100% found this document useful (1 vote)
7 views

Machine Learning For Signal Processing Data Science Algorithms And Computational Statistics Max A Little download

The document is a comprehensive resource on machine learning for signal processing, authored by Max A. Little, and published by Oxford University Press. It covers foundational mathematical concepts, algorithms, and applications relevant to digital signal processing and statistical machine learning, aimed at advanced undergraduates and researchers. The book emphasizes the integration of DSP and machine learning techniques, providing practical examples and software tools for signal analysis.

Uploaded by

lajayliy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
7 views

Machine Learning For Signal Processing Data Science Algorithms And Computational Statistics Max A Little download

The document is a comprehensive resource on machine learning for signal processing, authored by Max A. Little, and published by Oxford University Press. It covers foundational mathematical concepts, algorithms, and applications relevant to digital signal processing and statistical machine learning, aimed at advanced undergraduates and researchers. The book emphasizes the integration of DSP and machine learning techniques, providing practical examples and software tools for signal analysis.

Uploaded by

lajayliy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Machine Learning For Signal Processing Data

Science Algorithms And Computational Statistics


Max A Little download

https://ebookbell.com/product/machine-learning-for-signal-
processing-data-science-algorithms-and-computational-statistics-
max-a-little-52556894

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Machine Learning For Signal Processing Data Science Algorithms And


Computational Statistics Max A Little

https://ebookbell.com/product/machine-learning-for-signal-processing-
data-science-algorithms-and-computational-statistics-max-a-
little-10842174

Linear Algebra For Data Science Machine Learning And Signal Processing
Jeffrey A Fessler

https://ebookbell.com/product/linear-algebra-for-data-science-machine-
learning-and-signal-processing-jeffrey-a-fessler-62065588

Signal Processing And Machine Learning For Biomedical Big Data Falk

https://ebookbell.com/product/signal-processing-and-machine-learning-
for-biomedical-big-data-falk-7159780

Machine Learning Algorithms For Signal And Image Processing Suman Lata
Tripathi

https://ebookbell.com/product/machine-learning-algorithms-for-signal-
and-image-processing-suman-lata-tripathi-48672584
Bayesian Tensor Decomposition For Signal Processing And Machine
Learning Modeling Tuningfree Algorithms And Applications Lei Cheng

https://ebookbell.com/product/bayesian-tensor-decomposition-for-
signal-processing-and-machine-learning-modeling-tuningfree-algorithms-
and-applications-lei-cheng-49152314

Signal Processing And Machine Learning For Brainmachine Interfaces


Tanaka

https://ebookbell.com/product/signal-processing-and-machine-learning-
for-brainmachine-interfaces-tanaka-22038246

Machine Learning Methods For Signal Image And Speech Processing Meerja
Akhil Jabbar

https://ebookbell.com/product/machine-learning-methods-for-signal-
image-and-speech-processing-meerja-akhil-jabbar-37258816

Machine Learning For Algorithmic Trading Predictive Models To Extract


Signals From Market And Alternative Data For Systematic Trading
Strategies With Python 2nd Edition 2nd Edition Stefan Jansen

https://ebookbell.com/product/machine-learning-for-algorithmic-
trading-predictive-models-to-extract-signals-from-market-and-
alternative-data-for-systematic-trading-strategies-with-python-2nd-
edition-2nd-edition-stefan-jansen-11734656

Practical Guide For Biomedical Signals Analysis Using Machine Learning


Techniques A Matlab Based Approach 1st Edition Abdulhamit Subasi

https://ebookbell.com/product/practical-guide-for-biomedical-signals-
analysis-using-machine-learning-techniques-a-matlab-based-
approach-1st-edition-abdulhamit-subasi-10666766
Machine Learning for Signal
Processing
Machine Learning for Signal
Processing
Data Science, Algorithms, and Computational
Statistics

Max A. Little

1
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Max A. Little 2019
The moral rights of the author have been asserted
First Edition published in 2019
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2019944777
ISBN 978–0–19–871493–4
DOI: 10.1093/oso/9780198714934.001.0001
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Preface v

Preface
Digital signal processing (DSP) is one of the ‘foundational’ but some-
what invisible, engineering topics of the modern world, without which,
many of the technologies we take for granted: the digital telephone,
digital radio, television, CD and MP3 players, WiFi, radar, to name
just a few, would not be possible. A relative newcomer by comparison,
statistical machine learning is the theoretical backbone of exciting tech-
nologies that are by now starting to reach a level of ubiquity, such as
automatic techniques for car registration plate recognition, speech recog-
nition, stock market prediction, defect detection on assembly lines, robot
guidance and autonomous car navigation. Statistical machine learning
has origins in the recent merging of classical probability and statistics
with artificial intelligence, which exploits the analogy between intelligent
information processing in biological brains and sophisticated statistical
modelling and inference.
DSP and statistical machine learning are of such wide importance to
the knowledge economy that both have undergone rapid changes and
seen radical improvements in scope and applicability. Both DSP and
statistical machine learning make use of key topics in applied mathe-
matics such as probability and statistics, algebra, calculus, graphs and
networks. Therefore, intimate formal links between the two subjects
exist and because of this, an emerging consensus view is that DSP and
statistical machine learning should not be seen as separate subjects. The
many overlaps that exist between the two subjects can be exploited to
produce new digital signal processing tools of surprising utility and effi-
ciency, and wide applicability, highly suited to the contemporary world
of pervasive digital sensors and high-powered and yet cheap, comput-
ing hardware. This book gives a solid mathematical foundation to the
topic of statistical machine learning for signal processing, including the
contemporary concepts of the probabilistic graphical model (PGM) and
nonparametric Bayes, concepts which have only more recently emerged
as important for solving DSP problems.
The book is aimed at advanced undergraduates or first-year PhD stu-
dents as well as researchers and practitioners. It addresses the founda-
tional mathematical concepts, illustrated with pertinent and practical
examples across a range of problems in engineering and science. The aim
is to enable students with an undergraduate background in mathemat-
ics, statistics or physics, from a wide range of quantitative disciplines,
to get quickly up to speed with the latest techniques and concepts in
this fast-moving field. The accompanying software will enable readers
to test out the techniques to their own signal analysis problems. The
presentation of the mathematics is much along the lines of a standard
undergraduate physics or statistics textbook, free of distracting techni-
cal complexities and jargon, while not sacrificing rigour. It would be an
excellent textbook for emerging courses in machine learning for signals.
Contents

Preface v

List of Algorithms xiii

List of Figures xv

1 Mathematical foundations 1
1.1 Abstract algebras 1
Groups 1
Rings 3
1.2 Metrics 4
1.3 Vector spaces 5
Linear operators 7
Matrix algebra 7
Square and invertible matrices 8
Eigenvalues and eigenvectors 9
Special matrices 10
1.4 Probability and stochastic processes 12
Sample spaces, events, measures and distributions 12
Joint random variables: independence, conditionals, and
marginals 14
Bayes’ rule 16
Expectation, generating functions and characteristic func-
tions 17
Empirical distribution function and sample expectations 19
Transforming random variables 20
Multivariate Gaussian and other limiting distributions 21
Stochastic processes 23
Markov chains 25
1.5 Data compression and information
theory 28
The importance of the information map 31
Mutual information and Kullback-Leibler (K-L)
divergence 32
1.6 Graphs 34
Special graphs 35
1.7 Convexity 36
1.8 Computational complexity 37
Complexity order classes and big-O notation 38
viii Contents

Tractable versus intractable problems:


NP-completeness 38

2 Optimization 41
2.1 Preliminaries 41
Continuous differentiable problems and critical
points 41
Continuous optimization under equality constraints: La-
grange multipliers 42
Inequality constraints: duality and the Karush-Kuhn-Tucker
conditions 44
Convergence and convergence rates for iterative
methods 45
Non-differentiable continuous problems 46
Discrete (combinatorial) optimization problems 47
2.2 Analytical methods for continuous convex problems 48
L2 -norm objective functions 49
Mixed L2 -L1 norm objective functions 50
2.3 Numerical methods for continuous convex problems 51
Iteratively reweighted least squares (IRLS) 51
Gradient descent 53
Adapting the step sizes: line search 54
Newton’s method 56
Other gradient descent methods 58
2.4 Non-differentiable continuous convex problems 59
Linear programming 59
Quadratic programming 60
Subgradient methods 60
Primal-dual interior-point methods 62
Path-following methods 64
2.5 Continuous non-convex problems 65
2.6 Heuristics for discrete (combinatorial) optimization 66
Greedy search 67
(Simple) tabu search 67
Simulated annealing 68
Random restarting 69

3 Random sampling 71
3.1 Generating (uniform) random numbers 71
3.2 Sampling from continuous distributions 72
Quantile function (inverse CDF) and inverse transform
sampling 72
Random variable transformation methods 74
Rejection sampling 74
Adaptive rejection sampling (ARS) for log-concave densities 75
Special methods for particular distributions 78
3.3 Sampling from discrete distributions 79
Inverse transform sampling by sequential search 79
Contents ix

Rejection sampling for discrete variables 80


Binary search inversion for (large) finite sample
spaces 81
3.4 Sampling from general multivariate
distributions 81
Ancestral sampling 82
Gibbs sampling 83
Metropolis-Hastings 85
Other MCMC methods 88

4 Statistical modelling and inference 93


4.1 Statistical models 93
Parametric versus nonparametric models 93
Bayesian and non-Bayesian models 94
4.2 Optimal probability inferences 95
Maximum likelihood and minimum K-L divergence 95
Loss functions and empirical risk estimation 98
Maximum a-posteriori and regularization 99
Regularization, model complexity and data compression 101
Cross-validation and regularization 105
The bootstrap 107
4.3 Bayesian inference 108
4.4 Distributions associated with metrics and norms 110
Least squares 111
Least Lq -norms 111
Covariance, weighted norms and
Mahalanobis distance 112
4.5 The exponential family (EF) 115
Maximum entropy distributions 115
Sufficient statistics and canonical EFs 116
Conjugate priors 118
Prior and posterior predictive EFs 122
Conjugate EF prior mixtures 123
4.6 Distributions defined through quantiles 124
4.7 Densities associated with piecewise linear loss functions 126
4.8 Nonparametric density estimation 129
4.9 Inference by sampling 130
MCMC inference 130
Assessing convergence in MCMC methods 130

5 Probabilistic graphical models 133


5.1 Statistical modelling with PGMs 133
5.2 Exploring conditional
independence in PGMs 136
Hidden versus observed variables 136
Directed connection and separation 137
The Markov blanket of a node 138
5.3 Inference on PGMs 139
x Contents

Exact inference 140


Approximate inference 143

6 Statistical machine learning 149


6.1 Feature and kernel functions 149
6.2 Mixture modelling 150
Gibbs sampling for the mixture model 150
E-M for mixture models 152
6.3 Classification 154
Quadratic and linear discriminant analysis (QDA and LDA)155
Logistic regression 156
Support vector machines (SVM) 158
Classification loss functions and misclassification
count 161
Which classifier to choose? 161
6.4 Regression 162
Linear regression 162
Bayesian and regularized linear regression 163
Linear-in parameters regression 164
Generalized linear models (GLMs) 165
Nonparametric, nonlinear regression 167
Variable selection 169
6.5 Clustering 171
K-means and variants 171
Soft K-means, mean shift and variants 174
Semi-supervised clustering and classification 176
Choosing the number of clusters 177
Other clustering methods 178
6.6 Dimensionality reduction 178
Principal components analysis (PCA) 179
Probabilistic PCA (PPCA) 182
Nonlinear dimensionality reduction 184

7 Linear-Gaussian systems and signal processing 187


7.1 Preliminaries 187
Delta signals and related functions 187
Complex numbers, the unit root and complex exponentials 189
Marginals and conditionals of linear-Gaussian
models 190
7.2 Linear, time-invariant (LTI) systems 191
Convolution and impulse response 191
The discrete-time Fourier transform (DTFT) 192
Finite-length, periodic signals: the discrete Fourier trans-
form (DFT) 198
Continuous-time LTI systems 201
Heisenberg uncertainty 203
Gibb’s phenomena 205
Transfer function analysis of discrete-time LTI systems 206
Contents xi

Fast Fourier transforms (FFT) 208


7.3 LTI signal processing 212
Rational filter design: FIR, IIR filtering 212
Digital filter recipes 220
Fourier filtering of very long signals 222
Kernel regression as discrete convolution 224
7.4 Exploiting statistical stability for linear-Gaussian DSP 226
Discrete-time Gaussian processes (GPs) and DSP 226
Nonparametric power spectral density (PSD) estimation 231
Parametric PSD estimation 236
Subspace analysis: using PCA in DSP 238
7.5 The Kalman filter (KF) 242
Junction tree algorithm (JT) for KF computations 243
Forward filtering 244
Backward smoothing 246
Incomplete data likelihood 247
Viterbi decoding 247
Baum-Welch parameter estimation 249
Kalman filtering as signal subspace analysis 251
7.6 Time-varying linear systems 252
Short-time Fourier transform (STFT) and perfect recon-
struction 253
Continuous-time wavelet transforms (CWT) 255
Discretization and the discrete wavelet transform (DWT) 257
Wavelet design 261
Applications of the DWT 262

8 Discrete signals: sampling, quantization and coding 265


8.1 Discrete-time sampling 266
Bandlimited sampling 267
Uniform bandlimited sampling: Shannon-Whittaker in-
terpolation 267
Generalized uniform sampling 270
8.2 Quantization 273
Rate-distortion theory 275
Lloyd-Max and entropy-constrained quantizer
design 278
Statistical quantization and dithering 282
Vector quantization 286
8.3 Lossy signal compression 288
Audio companding 288
Linear predictive coding (LPC) 289
Transform coding 291
8.4 Compressive sensing (CS) 293
Sparsity and incoherence 294
Exact reconstruction by convex optimization 295
Compressive sensing in practice 296
xii Contents

9 Nonlinear and non-Gaussian signal processing 299


9.1 Running window filters 299
Maximum likelihood filters 300
Change point detection 301
9.2 Recursive filtering 302
9.3 Global nonlinear filtering 302
9.4 Hidden Markov models (HMMs) 304
Junction tree (JT) for efficient HMM computations 305
Viterbi decoding 306
Baum-Welch parameter estimation 306
Model evaluation and structured data classification 309
Viterbi parameter estimation 309
Avoiding numerical underflow in message passing 310
9.5 Homomorphic signal processing 311

10 Nonparametric Bayesian machine learning and signal pro-


cessing 313
10.1 Preliminaries 313
Exchangeability and de Finetti’s theorem 314
Representations of stochastic processes 316
Partitions and equivalence classes 317
10.2 Gaussian processes (GP) 318
From basis regression to kernel regression 318
Distributions over function spaces: GPs 319
Bayesian GP kernel regression 321
GP regression and Wiener filtering 325
Other GP-related topics 326
10.3 Dirichlet processes (DP) 327
The Dirichlet distribution: canonical prior for the cate-
gorical distribution 328
Defining the Dirichlet and related processes 331
Infinite mixture models (DPMMs) 334
Can DP-based models actually infer the number of com-
ponents? 343

Bibliography 345

Index 353
List of Algorithms

2.1 Iteratively reweighted least squares (IRLS). . . . . . . . . 52


2.2 Gradient descent. . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Backtracking line search. . . . . . . . . . . . . . . . . . . . 54
2.4 Golden section search. . . . . . . . . . . . . . . . . . . . . 55
2.5 Newton’s method. . . . . . . . . . . . . . . . . . . . . . . 56
2.6 The subgradient method. . . . . . . . . . . . . . . . . . . 61
2.7 A primal-dual interior-point method for linear program-
ming (LP). . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8 Greedy search for discrete optimization. . . . . . . . . . . 67
2.9 (Simple) tabu search. . . . . . . . . . . . . . . . . . . . . . 68
2.10 Simulated annealing. . . . . . . . . . . . . . . . . . . . . . 69
2.11 Random restarting. . . . . . . . . . . . . . . . . . . . . . . 70
3.1 Newton’s method for numerical inverse transform sampling. 73
3.2 Adaptive rejection sampling (ARS) for log-concave den-
sities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 Sequential search inversion (simplified version). . . . . . . 80
3.4 Binary (subdivision) search sampling. . . . . . . . . . . . 81
3.5 Gibb’s Markov-Chain Monte Carlo (MCMC) sampling. . . 83
3.6 Markov-Chain Monte Carlo (MCMC) Metropolis-Hastings
(MH) sampling. . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1 The junction tree algorithm for (semi-ring) marginaliza-
tion inference of a single variable clique. . . . . . . . . . . 142
5.2 Iterative conditional modes (ICM). . . . . . . . . . . . . . 144
5.3 Expectation-maximization (E-M). . . . . . . . . . . . . . . 146
6.1 Expectation-maximization (E-M) for general i.i.d. mix-
ture models. . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 The K-means algorithm. . . . . . . . . . . . . . . . . . . . 172
7.1 Recursive, Cooley-Tukey, radix-2, decimation in time, fast
Fourier transform (FFT) algorithm. . . . . . . . . . . . . 210
7.2 Overlap-add FIR convolution. . . . . . . . . . . . . . . . . 224
8.1 The Lloyd-Max algorithm for fixed-rate quantizer design. 279
8.2 The K-means algorithm for fixed-rate quantizer design. . 280
8.3 Iterative variable-rate entropy-constrained quantizer design.281
9.1 Baum-Welch expectation-maximization (E-M) for hidden
Markov models (HMMs). . . . . . . . . . . . . . . . . . . 308
9.2 Viterbi training for hidden Markov models (HMMs). . . . 310
10.1 Gaussian process (GP) informative vector machine (IVM)
regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.2 Dirichlet process mixture model (DPMM) Gibbs sampler. 337
xiv List of Algorithms

10.3 Dirichlet process means (DP-means) algorithm. . . . . . . 340


10.4 Maximum a-posteriori Dirichlet process mixture collapsed
(MAP-DP) algorithm for conjugate exponential family
distributions. . . . . . . . . . . . . . . . . . . . . . . . . . 342
List of Figures

1.1 Mapping between abstract groups 2


1.2 Rectangle symmetry group 2
1.3 Group Cayley tables 3
1.4 Metric 2D circles for various distance metrics 4
1.5 Important concepts in 2D vector spaces 5
1.6 Linear operator flow diagram 7
1.7 Invertible and non-invertible square matrices 8
1.8 Diagonalizing a matrix 9
1.9 Distributions for discrete and continuous random variables 13
1.10 Empirical cumulative distribution and density functions 20
1.11 The 2D multivariate Gausian PDF 21
1.12 Markov and non-Markov chains 25
1.13 Shannon information map 28
1.14 Undirected and directed graphs 33
1.15 Convex functions 35
1.16 Convexity, smoothness, non-differentiability 37

2.1 Lagrange multipliers in constrained optimization 42


2.2 Objective functions with differentiable and non-differentiable
points 46
2.3 Analytical shrinkage 50
2.4 Iteratively reweighted least squares (IRLS) 52
2.5 Gradient descent with constant and line search step sizes 54
2.6 Constant step size, backtracking and golden section search 55
2.7 Convergence of Newton’s method with and without line
search 57
2.8 Piecewise linear objective function (1D) 61
2.9 Convergence of the subgradient method 62
2.10 Convergence of primal-dual interior point 64
2.11 Regularization path of total variation denoising 65

3.1 Rejection sampling 75


3.2 Adaptive rejection sampling 77
3.3 Ancestral sampling 82
3.4 Gibb’s sampling for the Gaussian mixture model 83
3.5 Metropolis sampling for the Gaussian mixture model 85
3.6 Slice sampling 90

4.1 Overfitting and underfitting: polynomial models 103


xvi List of Figures

4.2 Minimum description length (MDL) 104


4.3 Cross-validation for model complexity selection 105
4.4 Quantile matching for distribution fitting 125
4.5 Convergence properties of Metropolis-Hastings (MH) sam-
pling 131

5.1 Simple 5-node probabilistic graphical model (PGM) 133


5.2 Notation for repeated connectivity in graphical models 134
5.3 Some complex graphical models 135
5.4 Three basic probabilistic graphical model patterns 137
5.5 D-connectivity in graphical models 138
5.6 The Markov blanket of a node 139
5.7 Junction trees 143
5.8 Convergence of iterative conditional modes (ICM) 145

6.1 Gaussian mixture model for Gamma ray intensity strati-


fication 151
6.2 Convergence of expectation-maximization (E-M) for the
Gaussian mixture model 152
6.3 Classifier decision boundaries 160
6.4 Loss functions for classification 161
6.5 Bayesian linear regression 163
6.6 Linear-in-parameters regression 165
6.7 Logistic regression 166
6.8 Nonparametric kernel regression 167
6.9 Overfitted linear regression on random data: justifying
variable selection 169
6.10 Lasso regression 170
6.11 Mean-shift clustering of human movement data 176
6.12 Semi-supervised versus unsupervised clustering 178
6.13 Dimensionality reduction 179
6.14 Principal components analysis (PCA): traffic count signals 181
6.15 Kernel principal components analysis (KPCA): human
movement data 184
6.16 Locally linear embedding (LLE): chirp signals 185

7.1 The sinc function 189


7.2 Heisenberg uncertainty 204
7.3 Gibb’s phenomena 206
7.4 Transfer function analysis 208
7.5 Transfer function of the simple FIR low-pass filter 214
7.6 Transfer function of the truncated ideal low-pass FIR filter216
7.7 The bilinear transform 220
7.8 Computational complexity of long-duration FIR imple-
mentations 224
7.9 Impulse response and transfer function of discretized ker-
nel regression 225
7.10 Periodogram power spectral density estimator 233
List of Figures xvii

7.11 Various nonparametric power spectral density estimators 234


7.12 Linear predictive power spectral density estimators 236
7.13 Regularized linear prediction analysis model selection 238
7.14 Wiener filtering of human walking signals 239
7.15 Sinusoidal subspace principal components analysis (PCA)
filtering 241
7.16 Subspace MUSIC frequency estimation 242
7.17 Typical 2D Kalman filter trajectories 243
7.18 Kalman filter smoothing by Tikhonov regularization 252
7.19 Short-time Fourier transform analysis 254
7.20 Time-frequency tiling: Fourier versus wavelet analysis 256
7.21 A selection of wavelet functions 261
7.22 Vanishing moments of the Daubechies wavelet 262
7.23 Discrete wavelet transform wavelet shrinkage 263

8.1 Digital MEMS sensors 265


8.2 Discrete-time sampling hardware block diagram 267
8.3 Non-uniform and uniform and bandlimited sampling 268
8.4 Shannon-Whittaker uniform sampling in the frequency
domain 269
8.5 Digital Gamma ray intensity signal from a drill hole 271
8.6 Quadratic B-spline uniform sampling 272
8.7 Rate-distortion curves for an i.i.d. Gaussian source 277
8.8 Lloyd-Max scalar quantization 280
8.9 Entropy-constrained scalar quantization 282
8.10 Shannon-Whittaker reconstruction of density functions 283
8.11 Quantization and dithering 285
8.12 Scalar vesus vector quantization 286
8.13 Companding quantization 287
8.14 Linear predictive coding for data compression 289
8.15 Transform coding: bit count versus coefficient variance 291
8.16 Transform coding of musical audio signals 292
8.17 Compressive sensing: discrete cosine transform basis 295
8.18 Random demodulation compressive sensing 297

9.1 Quantile running filter for exoplanetary light curves 301


9.2 First-order log-normal recursive filter: daily rainfall signals302
9.3 Total variation denoising (TVD) of power spectral density
time series 303
9.4 Bilateral filtering of human walking data 304
9.5 Hidden Markov modelling (HMM) of power spectral time
series data 306
9.6 Cepstral analysis of voice signals 312

10.1 Infinite exchangeability in probabilistic graphical model


form 315
10.2 Linking kernel and regularized basis function regression 319
xviii List of Figures

10.3 Random function draws from the linear basis regression


model 321
10.4 Gaussian process (GP) function draws with various co-
variance functions 322
10.5 Bayesian GP regression in action 323
10.6 Gaussian process (GP) regression for classification 326
10.7 Draws from the Dirichlet distribution for K = 3 327
10.8 The Dirichlet process (DP) 331
10.9 Random draws from a Dirichlet process (DP) 332
10.10Comparing partition sizes of Dirichlet versus Pitman-Yor
processes 334
10.11Parametric versus nonparametric estimation of linear pre-
dictive coding (LPC) coefficients 339
10.12DP-means for signal noise reduction: exoplanet light curves341
10.13Probabilistic graphical model (PGM) for the Bayesian
mixture model 342
10.14Performance of MAP-DP versus DP-means nonparamet-
ric clustering 344
Mathematical foundations
1
Statistical machine learning and signal processing are topics in applied
mathematics, which are based upon many abstract mathematical con-
cepts. Defining these concepts clearly is the most important first step in
this book. The purpose of this chapter is to introduce these foundational
mathematical concepts. It also justifies the statement that much of the
art of statistical machine learning as applied to signal processing, lies in
the choice of convenient mathematical models that happen to be useful
in practice. Convenient in this context means that the algebraic conse-
quences of the choice of mathematical modeling assumptions are in some
sense manageable. The seeds of this manageability are the elementary
mathematical concepts upon which the subject is built.

1.1 Abstract algebras


We will take the simple view in this book that mathematics is based on
logic applied to sets: a set is an unordered collection of objects, often
real numbers such as the set {π, 1, e} (which has three elements), or the
set of all real numbers R (with an infinite number of elements). From
this modest origin it is a remarkable fact that we can build the entirety
of the mathematical methods we need. We first start by reviewing some
elementary principles of (abstract) algebras.

Groups
An algebra is a structure that defines the rules of what happens when
pairs of elements of a set are acted upon by operations. A kind of
algebra known as a group (+, R) is the usual notion of addition with
pairs of real numbers. It is a group because it has an identity, the
number zero (when zero is added to any number it remains unchanged,
i.e.a+0 = 0+a = a, and every element in the set has an inverse (for any
number a, there is an inverse −a which means that a+(−a) = 0. Finally,
the operation is associative, which is to say that when operating on three
or more numbers, addition does not depend on the order in which the
numbers are added (i.e. a + (b + c) = (a + b) + c). Addition also has
the intuitive property that a + b = b + a, i.e. it does not matter if the
numbers are swapped: the operator is called commutative, and the group
is then called an Abelian group. Mirroring addition is multiplication
acting on the set of real numbers with zero removed (×, R − {0}), which
is also an Abelian group. The identity element is 1, and the inverses

Machine Learning for Signal Processing: Data Science, Algorithms, and Computational Statistics. Max A. Little.

c Max A. Little 2019. Published in 2019 by Oxford University Press. DOI: 10.1093/oso/9780198714934.001.0001
2 Mathematical foundations

are the reciprocals of each number. Multiplication is also associative,


and commutative. Note that we cannot include zero because this would
require the inclusion of the inverse of zero 1/0, which does not exist
(Figure 1.1).
Groups are naturally associated with symmetries. For example, the
set of rigid geometric transformations of a rectangle that leave the rect-
angle unchanged in the same position, form a group together with com-
positions of these transformations (there are flips along the horizontal
and vertical midlines, one clockwise rotation through 180° about the
centre, and the identity transformation that does nothing). This group
can be denoted as V4 = (◦, {e, h, v, r}), where e is the identity, h the hor-
izontal flip, v the vertical flip, and r the rotation, with the composition
operation ◦. For the rectangle, we can see that h◦v = r, i.e. a horizontal
Fig. 1.1: Illustrating abstract groups followed by a vertical flip corresponds to a 180° rotation (Figure 1.2).
and mapping between them. Shown Very often, the fact that we are able to make some convenient alge-
are the two continuous groups of real braic calculations in statistical machine learning and signal processing,
numbers under addition is (left col-
can be traced to the existence of one or more symmetry groups that arise
umn) and multiplication (right col-
umn), with identities 0 and 1 re- due to the choice of mathematical assumptions, and we will encounter
spectively. The homomorphism of many examples of this phenomena in later chapters, which often lead to
exponentiation maps addition onto significant computational efficiencies. A striking example of the conse-
multiplication (left to right column),
and the inverse, the logarithm, maps
quences of groups in classical algebra is the explanation for why there
multiplication back onto addition are no solutions that can be written in terms of addition,
PN multiplica-
(right to left column). Therefore, tion and roots, to the general polynomial equation i=0 ai xi = 0 when
these two groups are homomorphic. N ≥ 5. This fact has many practical consequences, for example, it is
possible to find the eigenvalues of a general matrix of size N × N using
simple analytical calculations when N < 5 (although the analytical cal-
culations do become prohibitively complex), but there is no possibility
of using similar analytical techniques when N ≥ 5, and one must resort
to numerical methods, and these methods sometimes cannot guarantee
to find all solutions!
Many simple groups with the same number of elements are isomorphic
to each other, that is, there is a unique function that maps the elements
of one group to the elements of the other, such that the operations
h v can be applied consistently to the mapped elements. Intuitively then,
the identity in one group is mapped to that of the other group. For
r example, the rotation group V4 above is isomorphic to the group M8 =
(×8 , {1, 3, 5, 7}), where×8 indicates multiplication modulo 8 (that is,
taking the remainder of the multiplication on division by 8, see Figure
Fig. 1.2: The group of symmetries 1.3).
of the rectangle, V4 = (◦, {e, h, v, r}). Whilst two groups might not be isomorphic, they are sometimes ho-
It consists of horizontal and vertical momorphic: there is a function between one group and the other that
flips, and a rotation of 180° about the maps each element in the first group to one or more elements in the
centre. This group is isomorphic to
the group M8 = (×8 , {1, 3, 5, 7}) (see second group, but the mapping is still consistent under the second oper-
Figure 1.3). ation. A very important example is the exponential map, exp (x), that
converts addition over the set of real numbers, to multiplication over
the set of positive real numbers: ea+b = ea eb ; a powerful variant of
this map is widely used in statistical inference to simplify and stabilize
calculations involving the probabilities of independent statistical events,
1.1 Abstract algebras 3

by converting them into calculations with their associated information


content. This is the negative logarithmic map − ln (x), that converts
probabilities under multiplication, into entropies under addition. This
map is very widely used in statistical inference as we shall see.
For a more detailed but accessible background to group theory, read
Humphreys (1996).

Rings
Whilst groups deal with one operation on a set of numbers, rings are a
slightly more complex structure that often arises when two operations
are applied to the same set. The most immediately tangible example
is the operations of addition and multiplication on the set of integers Z
(the positive and negative whole numbers with zero). Using the defini-
tion above, the set of integers under addition form an Abelian group,
whereas under multiplication the integers form a simple structure known
as a monoid – a group without inverses. Multiplication with the integers
is associative, and there is an identity (the positive number one), but the
multiplicative inverses are not integers (they are fractions such as 1/2,
−1/5 etc.) Finally, in combination, integer multiplication distributes
over integer addition: a × (b + c) = a × b + a × c = (b + c) × a. These
properties define a ring: it has one operation that together with the set
forms an Abelian group, and another operation that, with the set, forms
a monoid, and the second operation distributes over the first. As with
integers, the set of real numbers under the usual addition and multipli-
Fig. 1.3: The table for the symme-
cation also has the structure of a ring. Another very important example
try group V4 = (◦, {e, h, v, r}) (top),
is the set of square matrices all of size N × N with real elements un- and the group M8 = (×8 , {1, 3, 5, 7})
der normal matrix addition and multiplication. Here the multiplicative (bottom), showing the isomorphism
identity element is the identity matrix of size N × N , and the additive between them obtained by mapping
e 7→ 1, h 7→ 3, v 7→ 5 and r 7→ 7.
identity element is the same size square matrix with all zero elements.
Rings are powerful structures that can lead to very substantial com-
putational savings for many statistical machine learning and signal pro-
cessing problems. For example, if we remove the condition that the
additive operation must have inverses, then we have a pair of monoids
that are distributive. This structure is known as a semiring or semifield
and it turns out that the existence of this structure in many machine
learning and signal processing problems makes these otherwise computa-
tionally intractable problems feasible. For example, the classical Viterbi
algorithm for determining the most likely sequence of hidden states in a
Hidden Markov Model (HMM) is an application of the max-sum semifield
on the dynamic Bayesian network that defines the stochastic dependen-
cies in the model.
Both Dummit and Foote (2004) and Rotman (2000) contain detailed
introductions to abstract algebra including groups and rings.
4 Mathematical foundations

1.2 Metrics
Distance is a fundamental concept in mathematics. Distance functions
play a key role in machine learning and signal processing, particularly
as measures of similarity between objects, for example, digital signals
encoded as items of a set. We will also see that a statistical model often
implies the use of a particular measure of distance, and this measure
determines the properties of statistical inferences that can be made.
A geometry is obtained by attaching a notion of distance to a set: it
becomes a metric space. A metric takes two points in the set and returns
1 a single (usually real) value representing the distance between them. A
metric must have the following properties to satisfy intuitive notions of
distance:
0
(1) Non-negativity: d (x, y) ≥ 0,
-1
(2) Symmetry: d (x, y) = d (y, x) ,
-1 0 1 (3) Coincidence: d (x, x) = 0, and
(4) Triangle inequality: d (x, z) ≤ d (x, y) + d (y, z).
1
Respectively, these requirements are that (1) distance cannot be nega-
tive, (2) the distance going from x to y is the same as that from y to
0 x, (3) only points lying on top of each other have zero distance between
them, and (4) the length of any one side of a triangle defined by three
points cannot be greater than the sum of the length of the other two
-1
sides. For example, the Euclidean metric on a D-dimensional set is:
-1 0 1
v
uD
uX 2
1 d (x, y) = t (xi − yi ) (1.1)
i=1

0 This represents the notion of distance that we experience in everyday


geometry. The defining properties of distance lead to a vast range of
possible geometries, for example, the city-block geometry is defined by
-1
the absolute distance metric:
-1 0 1
D
X
1
d (x, y) = |xi − yi | (1.2)
i=1
City-block distance is so named because it measures distances on a
0 grid parallel to the co-ordinate axes. Distance need not take on real
values, for example, the discrete metric is defined as d (x, y) = 0 if
-1 x = y and d (x, y) = 1 otherwise. Another very important metric is the
-1 0 1 Mahalanobis distance:
q
T
Fig. 1.4: Metric 2D circles d (x, 0) = c d (x, y) = (x − y) Σ−1 (x − y) (1.3)
for various distance metrics. From
top to bottom, Euclidean distance, This distance may not be axis-aligned: it corresponds to the finding
absolute distance, the distance the Euclidean distance after applying an arbitrary stretch or compression
PD 0.3−1 along each axis followed by an arbitrary D-dimensional rotation (indeed
d (x, y) = i=1
|xi − yi |0.3 ,
and the Mahalanobis distance for if Σ = I, the identity matrix, this is identical to the Euclidean distance).
Σ11 = Σ22 = 1.0, Σ12 = Σ21 = −0.5.
The contours are c = 1 (red lines)
and c = 0.5 (blue lines).
1.3 Vector spaces 5

Figure 1.4 shows plots of 2D circles d (x, 0) = c for various metrics, in


particular c = 1 which is known as the unit circle in that metric space.
For further reading, metric spaces are introduced in Sutherland (2009)
in the context of real analysis and topology.

1.3 Vector spaces


A space is just the name given to a set endowed with some additional
mathematical structure. A (real) vector space is the key structure of
linear algebra that is a central topic in most of classical signal pro-
cessing — all digital signals are vectors, for example. The definition
of a (finite) vector space begins with an ordered set (often written as
a column) of N real numbers called a vector, and a single real num-
ber called a scalar. To that vector we attach the addition operation
which is both associative and commutative, that simply adds every cor-
responding element of the numbers in each vector together, written as
v + u. The identity for this operation is the vector with N zeros, 0.
Additionally, we define a scalar multiplication operation that multi-
plies each element of the vector with a scalar, λ. Using the scalar
multiplication by λ = −1, we can then form inverses of any vector.
Scalar multiplication should not matter in which order two scalar mul-
tiplications occur, e.g. λ (µv) = (λµ) v = (µλ) v = µ (λv). We
also require that scalar multiplication distributes over vector addition,
λ (v + u) = λv + λu, and scalar addition distributes over scalar multi-
plication, (λ + µ) v = λv + µv.
Every vector space has at least one basis for the space: this is a set of
linearly independent vectors, such that every vector in the vector space
can be written as a unique linear combination of these basis vectors
(Figure 1.5). Since our vectors have N entries, there are always N
vectors in the basis. Thus, N is the dimension of the vector space.
The simplest basis is the so-called standard basis, consisting of the N Fig. 1.5: Important concepts in 2D
T T vector spaces. The standard basis
vectors e1 = (1, 0, . . . , 0) , e2 = (0, 1, . . . , 0) etc. It is easy to see that (e1 , e2 ) is aligned with the axes. Two
T
a vector v = (v1 , v2 , . . . , vN ) can be expressed in terms of this basis as other vectors (v 1 , v 2 ) can also be used
v = v1 e1 + v2 e2 + · · · + vN eN . as a basis for the space, all that is re-
quired is they are linearly indepen-
By attaching a norm to a vector space (see below), we can measure
dent of each other (in the 2D case,
the length of any vector, the vector space is then referred to as a normed they are not simply scalar multiples
space. To satisfy intuitive notions of length, a norm V (u) must have of each other). Then x can be rep-
the following properties: resented in either basis. The dot
product between two vectors is pro-
portional to the cosine of the an-
(1) Non-negativity: V (u) ≥ 0,
gle between them: cos (θ) ∝ hv 1 , v 2 i.
(2) Positive scalability: V (αu) = |α| V (u), Because (e1 , e2 ) are at mutual right
angles, they have zero dot product
(3) Separation: If V (u) = 0 then u = 0, and and therefore the basis is orthogo-
(4) Triangle inequality: V (u + v) ≤ V (u) + V (v). nal. Additionally, it is an orthonor-
mal basis because they are unit norm
(length) (ke1 k = ke2 k = 1). The other
Often, the notation kuk
qPis used. Probably most familiar is the Eu- basis vectors are neither orthogonal
N
clidean norm kuk2 = i=1 ui , but another norm that gets heavy use
2 nor unit norm.
in statistical machine learning is the Lp -norm:
6 Mathematical foundations

N
! p1
X p
kukp = |ui | (1.4)
i=1

of which the Euclidean (p = 2) and city-block (p = 1) norms are special


cases. Also of importance is the max-norm kuk∞ = maxi=1...N |ui |,
which is just the length of the largest co-ordinate.
There are several ways in which a product of vectors can be formed.
The most important in our context is the inner product between two
vectors:
N
X
α = hu, vi = ui vi (1.5)
i=1
This is sometimes also described as the dot product u · v. For complex
vectors, this is defined as:
N
X
hu, vi = ui v̄i (1.6)
i=1

where ā is the complex conjugate of a ∈ C.


We will see later that the dot product plays a central role in the sta-
tistical notion of correlation. When two vectors have zero inner prod-
uct, they are said to be orthogonal; geometrically they meet at a right-
angle. This also has a statistical interpretation: for certain random vari-
ables, orthogonality implies statistical independence. Thus, orthogonal-
ity leads to significant simplifications in common calculations in classical
DSP.
A special and very useful kind of basis is an orthogonal basis where
the inner product between every pair of distinct basis vectors is zero:

hv i , v j i = 0 for all i 6= j, i, j = 1, 2 . . . N (1.7)


In addition, the basis is orthonormal if every basis vector has unit
norm kv i k = 1 – the standard basis is orthonormal, for example (Figure
1.5). Orthogonality/orthonormality dramatically simplifies many calcu-
lations over vector spaces, partly because it is straightforward to find
the N scalar coefficients ai of an arbitrary vector u in this basis using
the inner product:
hu, v i i
ai = 2 (1.8)
kv i k
which simplifies to ai = hu, v i i in the orthonormal case. Orthonormal
bases are the backbone of many methods in DSP and machine learning.
√ We can express the Euclidean norm using the inner product: kuk2 =
u · u. An inner product satisfies the following properties:

(1) Non-negativity: u · v ≥ 0,
(2) Symmetry: u · v = v · u, and
1.3 Vector spaces 7

(3) Linearity: (αu) · v = α (u · v).


There is an intuitive connection between distance and length: assuming
that the metric is homogeneous d (αu, αv) = |α| d (u, v) and transla-
tion invariant d (u, v) = d (u + a, v + a), a norm can be defined as the
distance to the origin kuk = d (0, u). A commonly occurring example
2
of this the so-called squared L2 weighted norm kukA =uT Au which is
2
just the squared Mahalanobis distance d (0, u) discussed earlier with
Σ−1 = A.
On the other hand, there is one sense in which every norm induces an
associated metric with the construction d (u, v) = ku − vk. This con-
struction enjoys extensive use in machine learning and statistical DSP to
quantify the “discrepancy” or “error” between two signals. In fact, since
norms are convex (discussed later), it follows that metrics constructed
this way from norms, are also convex, a fact of crucial importance in
practice.
A final product we will have need for in later chapters is the elemen-
twise product w = u ◦ v which is obtained by multiplying each element
in the vector together wn = un vn .

Linear operators
A linear operator or map acts on vectors to create other vectors, and
while doing so, preserve the operations of vector addition and scalar
multiplication. They are homomorphisms between vector spaces. Lin-
ear operators are fundamental to classical digital signal processing and
statistics, and so find heavy use in machine learning. Linear operators
L have the linear combination property:
Fig. 1.6: A ‘flow diagram’ depicting
linear operators. All linear opera-
tors share the property that the op-
L [α1 u1 + α2 u2 + · · · + αN uN ] = (1.9) erator L applied to the scaled sum
α1 L [u1 ] + α2 L [u2 ] + · · · + αN L [uN ] of (two or more) vectors α1 u1 + α2 u2
(top panel), is the same as the scaled
sum of the same operator applied to
What this says is that the operator commutes with scalar multiplica- each of these vectors first (bottom
tion and vector addition: we get the same result if we first scale, then panel). In other words, it does not
add the vectors, and then apply the operator to the result, or, apply the matter whether the operator is ap-
plied before or after the scaled sum.
operator to each vector, then scale them, and them add up the results
(Figure 1.6).
Matrices (which we discuss next), differentiation and integration, and
expectation in probability are all examples of linear operators. The
linearity of integration and differentiation are standard rules which can
be derived from the basic definitions. Linear maps in two-dimensional
space have a nice geometric interpretation: straight lines in the vector
space are mapped onto other straight lines (or onto a point if they are
degenerate maps). This idea extends to higher dimensional vector spaces
in the natural way.

Matrix algebra
When vectors are ‘stacked together’ they form a powerful structure
8 Mathematical foundations

which is a central topic of much of signal processing, statistics and ma-


chine learning: matrix algebra. A matrix is a ‘rectangular’ array of
N × M elements, for example, the 3 × 2 matrix A is:
 
a11 a12
A =  a21 a22  (1.10)
a31 a32
This can be seen to be two length three vectors stacked side-by-side.
The elements of a matrix are often written using the subscript notation
aij where i = 1, 2, . . . , N and j = 1, 2, . . . , M . Matrix addition of two
matrices, is commutative: C = A + B = B+A, where the addition is
element-by-element i.e. cij = aij + bij .
As with vectors, there are many possible ways in which matrix multi-
plication could be defined: the one most commonly encountered is the
row-by-column inner product. For two matrices A of size N × M and
B of size M × P , the product C = A × B is a new matrix of size N × P
defined as:
M
X
cij = aik bkj i = 1, 2, . . . , N, j = 1, 2, . . . , P (1.11)
k=1
This can be seen to be the matrix of all possible inner products of each
row of A by each column of B. Note that the number of columns of the
left hand matrix must match the number of rows of the right hand one.
Matrix multiplication is associative, it distributes over matrix addition,
and it is compatible with scalar multiplication: αA = B simply gives the
new matrix with entries bij = αaij , i.e. it is just columnwise application
of vector scalar multiplication. Matrix multiplication is, however, non-
commutative: it is not true in general that A × B gives the same result
as B × A.
A useful matrix operator is the transpose that swaps rows with
columns; if A is an N × M matrix then AT = B is the M × N matrix
bji = aij . Some of the properties of the transpose are: it is self-inverse
T T
Fig. 1.7: A depiction of the geo- AT = A; respects addition (A + B) = AT + BT ; and it reverses
T
metric effect of invertible and non- the order of factors in multiplication (AB) = BT AT .
invertible square matrices. The in-
vertible square matrix A maps the
triangle at the bottom to the ‘thin- Square and invertible matrices
ner’ triangle (for example, by trans-
forming the vector for each vertex). So far, we have not discussed how to solve matrix equations. The case
It scales the area of the triangle by of addition is easy because we can use scalar multiplication to form the
the determinant |A| 6= 0. However, negative of a matrix, i.e. given C = A + B, finding B requires us to
the non-invertible square matrix B calculate B = C − A = C + (−1) A. In the case of multiplication, we
collapses the triangle onto a single
point with no area because |B| = 0. need to find the “reciprocal” of a matrix, e.g. to solve C = AB for B we
Therefore, A−1 is well-defined, but would naturally calculate A−1 C = A−1 AB = B by the usual algebraic
B−1 is not. rules. However, things become more complicated because A−1 does not
exist in general. We will discuss the conditions under which a matrix
does have a multiplicative inverse next.
All square matrices of size N × N can be summed or multiplied to-
1.3 Vector spaces 9

gether in any order. A square matrix A with all zero elements except
for the main diagonal, i.e. aij = 0 unless i = j is called a diagonal ma-
trix. A special diagonal matrix, I, is the identity matrix where the main
diagonal entries aii = 1. Then, if the equality AB = I = BA holds, the The N × N identity matrix is de-
matrix B must be well-defined (and unique) and it is the inverse of A, noted IN or simply I when the
i.e. B = A−1 . We then say that A is invertible; if it is not invertible context is clear and the size can
then it is degenerate or singular. be omitted.
There are many equivalent conditions for matrix invertibility, for ex-
ample, the only solution to the equation Ax = 0 is the vector x = 0
or the columns of A are linearly independent. But one particularly
important way to test the invertibility of a matrix is to calculate the
determinant |A|: if the matrix is singular, the determinant is zero. It
follows that all invertible matrices have |A| =
6 0. The determinant calcu-
lation is quite elaborate for a general square matrix, formulas exist but
geometric intuition helps to understand these calculations: when a linear
map defined by a matrix acts on a geometric object in vector space with
a certain volume, the determinant is the scaling factor of the mapping.
Volumes under the action of the map are scaled by the magnitude of
the determinant. If the determinant is negative, the orientation of any
geometric object is reversed. Therefore, invertible transformations are
those that do not collapse the volume of any object in the vector space
to zero (Figure 1.7).
Another matrix operator which finds significant use is the trace tr (A)
of
PN a square matrix: this is just the sum of the diagonals, e.g. tr (A) =
i=1 aii . The trace
 is invariant to addition tr (A + B) = tr (A)+tr (B),
transpose tr AT = tr (A) and multiplication tr (AB) = tr (BA). With
products of three or more matrices the trace is invariant to cyclic per-
mutations, with three matrices: tr (ABC) = tr (CAB) = tr (BCA).

Eigenvalues and eigenvectors


A ubiquitous computation that arises in connection with algebraic prob-
lems in vector spaces is the eigenvalue problem for the given N ×N square
matrix A:

Av = λv (1.12)
Any non-zero N × 1 vector v which solves this equation is known as
an eigenvector of A, and the scalar value λ is known as the associated
eigenvalue. Eigenvectors are not unique: they can be multiplied by any Fig. 1.8: An example of diagonal-
non-zero scalar and still remain eigenvectors with the same eigenvalues. izing a matrix. The diagonalizable
Thus, often the unit length eigenvectors are sought as the solutions to square matrix A has diagonal matrix
(1.12). D containing the eigenvalues, and
transformation matrix P containing
It should be noted that (1.12) arises for vector spaces in general e.g. the eigenbasis, so A = PDP−1 . A
linear operators. An important example occurs in the vector space of maps the rotated square (top), to
functions f (x) with differential the operator L = dx d
. Here, the cor- the rectangle in the same orientation
(at left). This is equivalent to first
responding eigenvalue problem is the differential equation L [f (x)] =
‘unrotating’ the square (the effect
λf (x), for which the solution is f (x) = aeλx for any (non-zero) scalar of P−1 ) such that it is aligned with
value a. This is known as an eigenfunction of the differential operator the co-ordinate axes, then stretch-
ing/compressing the square along
each axis (the effect of D), and finally
rotating back to the original orienta-
tion (the effect of P).
10 Mathematical foundations

L.
If they exist, the eigenvectors and eigenvalues of a square matrix A can
be found by obtaining all scalar values λ such that|(A − λI)| = 0. This
holds because Av − λv = 0 if and only if |(A − λI)| = 0. Expanding out
this determinant equation leads to an N -th order polynomial equation
in λ, namely aN λN + aN −1 λN −1 + · · · + a0 = 0, and the roots of this
equation are the eigenvalues.
This polynomial is known as the characteristic polynomial for A and
determines the existence of a set of eigenvectors that is also a basis for
the space, in the following way. The fundamental theorem of algebra
states that this polynomial has exactly N roots, but some may be re-
peated (i.e. occur more than once). If there are no repeated roots of the
characteristic polynomial, then the eigenvalues are all distinct, and so
there are N eigenvectors which are all linearly independent. This means
that they form a basis for the vector space, which is the eigenbasis for
the matrix.
Not all matrices have an eigenbasis. However matrices that do are
also diagonalizable, that is, they have the same geometric effect as a
diagonal matrix, but in a different basis other than the standard one.
This basis can be found by solving the eigenvalue problem. Placing all
the eigenvectors into the columns of a matrix P and all the corresponding
eigenvalues into a diagonal matrix D, then the matrix can be rewritten:

A = PDP−1 (1.13)
See Figure 1.8. A diagonal matrix simply scales all the coordinates of
the space by a different, fixed amount. They are very simple to deal with,
and have important applications in signal processing and machine learn-
ing. For example, the Gaussian distribution over multiple variables, one
of the most important distributions in practical applications, encodes
the probabilistic relationship between each variable in the problem with
the covariance matrix. By diagonalizing this matrix, one can find a lin-
ear mapping which makes all the variables statistically independent of
each other: this dramatically simplifies many subsequent calculations.
Despite the central importance of the eigenvectors and eigenvalues a
linear problem, it is generally not possible to find all the eigenvalues by
analytical calculation. Therefore one generally turns to iterative numer-
ical algorithms to obtain an answer to a certain precision.

Special matrices
Beyond what has been already discussed, there is not that much more to
be said about general matrices which have N × M degrees of freedom.
Special matrices with fewer degrees of freedom have very interesting
properties and occur frequently in practice.
Some of the most interesting special matrices are symmetric matrices
with real entries – self-transpose and so square by definition, i.e. AT =
A. These matrices are always diagonalizable, and have an orthogonal
eigenbasis. The eigenvalues are always real. If the inverse exists, it is
1.3 Vector spaces 11

also symmetric. A symmetric matrix has 12 N (N + 1) unique entries, on


the order of half the N 2 entries of an arbitrary square matrix.
Positive-definite matrices are a special kind of symmetric matrix for
which v T Av > 0 for any non-zero vector v. All the eigenvalues are
positive. Take any real, invertible matrix B (so that Bv 6= 0 for all
T 2
such v) and let A = BT B, then v T BT Bv = (Bv) (Bv) = kBvk2 > 0
making A positive-definite. As will be described in the next section,
these kinds of matrices are very important in machine learning and signal
processing because the covariance matrix of a set of random variables is
positive-definite for exactly this reason.
Orthonormal matrices have all columns which are vectors that form
an orthonormal basis for the space. The determinant of these matrices
is either +1 or −1. Like symmetric matrices they are always diagonal-
izable, although the eigenvectors are generally complex with modulus
1. An orthonormal matrix is always invertible, the inverse is also or-
thonormal and equal to the transpose, AT = A−1 . The subset with
determinant +1, correspond to rotations in the vector space.
For upper (lower) triangular matrices, the diagonal and the entries
above (below) the diagonal are non-zero, the rest zero. These matrices
often occur when solving matrix problems such as Av = b, because the
matrix equation Lv = b is simple to solve by forward substitution if L
is lower-triangular. Forward substitution is a straightforward sequential
procedure which first obtains v1 in terms of b1 and l11 , then v2 in terms
of b1 , l21 and l22 etc. The same holds for upper triangular matrices and
backward substitution. Because of the simplicity of these substitution
procedures, there exist methods for decomposing a matrix into a product
of upper or lower triangular matrices and a companion matrix.
Toeplitz matrices are matrices with 2N − 1 degrees of freedom that
have constant diagonals, that is, the elements of A have entries aij =
ci−j . All discrete convolutions can be represented as Toeplitz matrices,
and as we will discuss later, this makes them of fundamental importance
in DSP. Because of the reduced degrees of freedom and special structure
of the matrix, a Toeplitz matrix problem Ax = b is computationally
easier to solve than a general matrix problem: a method known as the
Levinson recursion dramatically reduces the number of arithmetic oper-
ations needed.
Circulant matrices are Toeplitz matrices where each row is obtained
from the row above by rotating it one element to the right. With only
N degrees of freedom they are highly structured and can be understood
as discrete circular convolutions. The eigenbasis which diagonalizes the
matrix is the discrete Fourier basis which is one of the cornerstones of
classical DSP. It follows that any circulant matrix problem can be very
efficiently solved using the fast Fourier transform (FFT).
Dummit and Foote (2004) contains an in-depth exposition of vector
spaces from an abstract point of view, whereas Kaye and Wilson (1998)
is an accessible and more concrete introduction.
12 Mathematical foundations

1.4 Probability and stochastic processes


Probability is a formalization of the intuitive notion of uncertainty.
Statistics is built on probability. Therefore, statistical DSP and machine
learning has, at it’s root, the quantitative manipulation of uncertainties.
Probability theory contains the axiomatic foundation of uncertainty.

Sample spaces, events, measures and distributions


We start with a set of elements, say, Ω, which are known as outcomes.
This is what we get as the result of a measurement or experiment. The
set Ω is known as the sample space or universe. For example, the die
has six possible outcomes, so the sample space is Ω = {1, 2, 3, 4, 5, 6}.
Given these outcomes, we want to quantify the probability of certain
events occurring, for example, that we get a six or an even number in
any throw. These events form an abstract σ−algebra, F, which is, all
the sets of subsets of outcomes that can be constructed by applying
the elementary set operations of complement and (countable) unions
to a selection of the elements in 2Ω (the set of all subsets of Ω). The
elements of F are the events. For example, in the coin toss, there are
two possible outcomes, heads and tails, so Ω = {H, T }. A set of events
that are of interest make up aσ−algebra F = {∅, {H} , {T } , Ω}, so that
we can calculate the probability of heads or tails, none, or heads or
tails occurring (N.B. the last two events are in some senses ‘obvious’ —
the first is impossible and the second inevitable — so they require no
calculation to evaluate, but we will see that to do probability calculus
we always need the empty set and the set of all outcomes).
Given the pair Ω, F we want to assign probabilities to events, which
are real numbers lying between 0 and 1. An event with probability 0 is
impossible and will never occur, whereas if the event has probability 1
then it is certain to occur. A mapping that determines the probability
of any event is known as a measure function µ : F → R. An example
would be the measure function for the fair coin toss which is µ ({∅}) =
0, µ ({H, T }) = 1, µ ({H}) = µ ({T }) = 12 . A measure satisfies the
following rules:

(1) Non-negativity: µ (A) ≥ 0 for all A ∈ F,


(2) Unit measure: µ (Ω) = 1 and,
S∞ P∞
(3) Disjoint additivity: µ ( i=1 Ai ) = i=1 µ (Ai ) if the events Ai ∈
F do not overlap with each other (that is, they are mutually dis-
joint and so contain no elements from the sample space in com-
mon).

We mainly use the notation P (A)for the probability (measure) of event


A. We can derive some important consequences of these rules. For
example, if one event is wholly contained inside another, it must have
smaller probability: if A ⊆ B then P (A) ≤ P (B) with equality if
A = B. Similarly, the probability of the  event not occurring is one
minus the probability of that event: P Ā = 1 − P (A).
1.4 Probability and stochastic processes 13

Of great importance to statistics is the sample space of real numbers.


A useful σ-algebra is the Borel algebra formed from all possible (open)
intervals of the real line. With this algebra we can assign probabilities
to ranges of real numbers, e.g. P ([a, b]) for the real numbers a ≤ b. An
important consequence of the axioms is that P ({a}) = 0, i.e. point set
events have zero probability. This differs from discrete (countable) sam-
ple spaces where the probability of any single element from the sample
space can be non-zero.
Given a set of all possible events it is often natural to associate nu-
merical ‘labels’ to each event. This is extremely useful because then we
can perform meaningful numerical computations on the events. Ran-
dom variables are functions that map the outcomes to numerical val-
ues, for example the random variable that maps coin tosses into the set
{0, 1}, X ({T }) = 0 and X ({H}) = 1. Cumulative distribution func-
tions (CDFs) are measures as defined above, but where the events are
selected through the random variable. For example, the CDF of the
(fair) coin toss as described above would be:
(
1
for x = 0
P ({A ∈ {H, T } : X (A) ≤ x}) = 2 (1.14)
1 for x = 1
This is a special case of the Bernoulli distribution (see below). Two
common shorthand ways of writing the CDF are FX (x) and P (X ≤ x).
When the sample space is the real line, the random variable is contin-
uous. The associated probability of an event, in this case, a (half open)
interval of the real line is: Fig. 1.9: Distribution functions and
probabilities for ranges of discrete
(top) and continuous (bottom) ran-
dom variables X and Y respec-
P ((a, b]) = P ({A ∈ R : a < X (A) ≤ b}) = FX (b) − FX (a) (1.15) tively. Cumulative distribution func-
tions (CDF) FX and FY are shown
Often we can also define a distribution through a probability density on the left, and the associated prob-
ability mass (PMF) and probability
function (PDF) fX (x): density functions (PDF) fX and fY
ˆ on the right. For discrete X defined
P (A) = fX (x) dx (1.16) on the integers, the probability of
A thePevent A such that a ≤ X (A) ≤ b
b
where A ∈ F, and in practice statistical DSP and machine learning is is f (x) which is the same as
x=a X
FX (b) − FX (a − 1). For the continu-
most often (though not exclusively) concerned with F being the set of ous Y defined on the reals, the prob-
all open intervals of the real line (the Borel algebra), or some subset of ability of the event A = (a, b] is the
the real line such´ as [0, ∞). To satisfy the unit measure requirement, we given area under
´ the curve of fY ,
i.e. P (A) = A fY (y) dy. This is just
must have that R fX (x) dx = 1 (for the case of the whole real line). In
FY (b) − FY (a) by the fundamental
the discrete case, the equivalent is the probability mass function (PMF) theorem of calculus.
that assigns a probability measure to each separate outcome. To simplify
the notation, we often drop the random variable subscript when the
context is clear, writing e.g. F (x), f (x).
We can deduce certain properties of CDFs. Firstly, they must be non-
decreasing, because the associated PMF/PDFs must be non-negative.
Secondly, if X is defined on the range [a, b], we must have that FX (a) = 0
and FX (b) = 1 (in the commonly occurring case where either a or b
are infinite, then we would have, e.g. limx→−∞ FX (x) = 0 and/or
14 Mathematical foundations

limx→∞ FX (x) = 1). An important distinction to make here between


discrete and continuous random variables, is that the PDF can have
f (x) > 1 for some x in the range of the random variable, whereas PMFs
must have 0 ≤ f (x) ≤ 1 for all values of x in range. In the case of PMFs
this is necessary to satisfy the unit measure property. These concepts
are illustrated in Figure 1.9.
An elementary example of a PMF is the fair coin for which:
1
f (x) = for x ∈ {0, 1} (1.17)
2
To satisfy unit measure, we must have a∈X(Ω) f (a) = 1. The measure
P
of an event is similarly:
X
P (A) = f (a) (1.18)
a∈X(A)

Some ubiquitous PMFs include the Bernoulli distribution which rep-


resents the binary outcome:
(
1 − p for x = 0
f (x) = (1.19)
p for x = 1
1−x
A compact representation is f (x) = (1 − p) px . A very important
continuous distribution is the Gaussian distribution, whose density func-
tion is:
!
2
1 (x − µ)
f (x; µ, σ) = √ exp − (1.20)
2πσ 2 σ2
The semicolon is used to separate the random variable from the ad-
justable (non-random) parameters that determine the form of the pre-
cise distribution of X. When the parameters are considered random
variables, the bar notation f ( x| µ, σ) is used instead, indicating that X
depends, in a consistent probabilistic sense, on the value of the parame-
ters. This latter situation occurs in the Bayesian framework as we will
discuss later.

Joint random variables: independence, conditionals,


and marginals
Often we are interested in the probability of multiple simultaneous events
occurring. A consistent way to construct an underlying sample space is
to form the set of all possible combinations of events. This is known
as the product sample space. For example, the product sample space
of two coin tosses is the set Ω = {(H, H) , (H, T ) , (T, H) , (T, T )} with
σ−algebra F = {∅, {(H, H)} , {(H, T )} , {(T, H)} , {(T, T )} , Ω}. As with
single outcomes, we want to define a probability measure so that we can
evaluate the probability of any joint outcome. This measure is known
as the joint CDF :
1.4 Probability and stochastic processes 15

FXY (x, y) = P (X ≤ x and Y ≤ y) (1.21)


In words, the joint CDF is the probability that the pair of random
variables X, Y simultaneously take on values that are equal to x, y at
the most. For the case of continuous random variables, each defined on
the whole real line, this probability is a multiple integration:
ˆ y ˆ x
FXY (x, y) = f (u, v) du dv (1.22)
−∞ −∞

where f (u, v) is the joint PDF . The sample space is now the plane
´ ,´ and so in order to satisfy the unit measure axiom, it must2 be that
2
R
R R
f (u, v) du dv = 1. The probability of any´region A of R is then
the multiple integral over that region: P (A) = A f (u, v) du dv.
PThe corresponding discrete case has that P (A × B) =
(a, for any product of events where Ω ,
P
a∈X(A) b∈Y (B) f b) A ∈ X
B ∈ ΩY , and ΩX ,ΩY are the sample spaces of X and Y respectively,
and f (a, b) is the joint PMF . P The joint PMF
P must sum to one over the
whole product sample space: a∈X(ΩX ) b∈Y (ΩY ) f (a, b) = 1.
More general joint events over N variables are defined similarly and
associated with multiple CDFs, PDFs and PMFs, e.g.
fX1 X2 ...XN (x1 , x2 . . . xN ) and, when the context is clear from the ar-
guments of the function, we drop the subscript in the name of the
function for notational simplicity. This naturally allows us to define
distribution functions over vectors of random variables, e.g. f (x) for
T
X = (X1 , X2 . . . XN ) where typically, each element of the vector comes
from the same sample space.
Given the joint PMF/PDF, we can always ‘remove’ one or more of
the variables in the joint set by integrating out this variable, e.g.:
ˆ
f (x1 , x3 , . . . , xN ) = f (x1 , x2 , x3 , . . . , xN ) dx2 (1.23)
R
This computation is known as marginalization.
When considering joint events, we can perform calculations about the
conditional probability of one event occurring, when another has already
occurred (or is otherwise fixed). This conditional probability is written
using the bar notation P ( X = x| Y = y): described as the ‘probability
that the random variable X = x, given that Y = y’. For PMFs and
PDFs we will shorten this to f ( x| y). This probability can be calculated
from the joint and single distributions of the conditioning variable:
f (x, y)
f ( x| y) = (1.24)
f (y)
In effect, the conditional PMF/PDF is what we obtain from restricting
the joint sample space to the set for which Y = y, and calculating the
measure of the intersection of the joint sample space for any chosen x.
The division by f (y)ensures that the conditional distribution is itself a
normalized measure on this restricted sample space, as we can show by
16 Mathematical foundations

marginalizing out X from the right hand side of the above equation.
If the distribution of X does not depend upon Y , we say that X
is independent of Y . In this case f ( x| y) = f (x). This implies that
f (x, y) = f (x) f (y), i.e. the joint distribution over X, Y factorizes into
a product of the marginal distributions over X, Y . Independence is a
central topic in statistical DSP and machine learning because whenever
two or more variables are independent, this can lead to very significant
simplifications that in some cases, make the difference between whether a
problem is tractable at all. In fact, it is widely recognized these days that
the main goal of statistical machine learning is to find good factorizations
of the joint distribution over all the random variables of a problem.

Bayes’ rule
If we have the distribution function of a random variable conditioned on
another, is it possible to swap the role of conditioned and conditioning
variables? The answer is yes: provided that we have all the marginal
distributions. This leads us into the territory of Bayesian reasoning.
The calculus is straightforward, but the consequences are of profound
importance to statistical DSP and machine learning. We will illustrate
the concepts using continuous random variables, but the principles are
general and apply to random variables over any sample space. Suppose
we have two random variables X, Y and we know the conditional dis-
tribution of X given Y , then the conditional distribution of Y on X
is:
f ( x| y) f (y)
f ( y| x) = (1.25)
f (x)
This is known as Bayes’ rule. In the Bayesian formalism, f ( x| y) is
known as the likelihood, f (y) is known as the prior, f (x) is the evidence
and f ( y| x) is the posterior.
Often, we do not know the distribution over X; but since the nu-
merator in Bayes’ rule is the joint probability of X and Y , this can be
obtained by marginalizing out Y from the numerator:
f ( x| y) f (y)
f ( y| x) = ´ (1.26)
R
f ( x| y) f (y) dy
This form of Bayes’ rule is ubiquitous because it allows calculation of
the posterior knowing only the likelihood and the prior.
Unfortunately, one of the hardest and most computationally intractable
problems in applying the Bayesian formalism arises when attempting
to evaluate integrals over many variables to calculate the posterior in
(1.26). Fortunately however, there are common situations in which it is
not necessary to know the evidence probability. A third restatement of
Bayes’ rule makes it clear that the evidence probability can be consid-
ered a ‘normalizer’ for the posterior, ensuring that the posterior satisfies
the unit measure property:
1.4 Probability and stochastic processes 17

f ( y| x) ∝ f ( x| y) f (y) (1.27)
This form is very commonly encountered in many statistical inference
problems in machine learning. For example, when we wish to know
the value of a parameter or random variable given some data which
maximizes the posterior, and the evidence probability is independent of
this variable or parameter, then we can exclude the evidence probability
from the calculations.

Expectation, generating functions and characteristic


functions
There are many ways of summarizing the distribution of a random vari-
able. Of particular importance are measures of central tendency such
as the mean and median. The mean of a (continuous) random variable
X is the sum over all possible outcomes weighted by the probability of
that outcome:
ˆ
E [X] = x f (x) dx (1.28)

Where not obvious from the context, we write EX [X] to indicate that
this integral is with respect to the P
random variable X. In the case of
discrete variables this is E [X] = a∈X(Ω) x fP (x). As discussed ear-
lier, expectation is a linear operator, i.e. E [ i ai Xi ] = [X i]
P
i ai E
for arbitrary constants ai . A constant is invariant under expectation:
E [a] = a. The mean is also known as the expected value, and the in-
tegral is called the expectation. The expectation plays a central role
in probability and statistics, and can in fact be used to construct an
entirely different axiomatic view on probability. The expectation with
respect to an arbitrary transformation of a random variable, g (X) is:
ˆ
E [g (X)] = g (x) f (x) dx (1.29)

Using this we can define a hierarchy of summaries of the distribution of
a random variable, known as the k-th moments:
ˆ
E Xk = xk f (x) dx (1.30)
 

From the unit measure
  property of probability it can be seen that the
zeroth moment E X 0 = 1. The first moment coincides with the mean.
Central moments are those defined “around” the mean:
h i ˆ
k k
µk = E (X − E [X]) = (x − µ) f (x) dx (1.31)

where µis the mean. A very import central moment is the variance,var [X] =
µ2 , which is a measure of spread (about the mean) of the distribution.

The standard deviation is the square root of this std [X] = µ2 . Higher
order central moments such as skewness (µ3 ) and kurtosis (µ4 ) measure
18 Mathematical foundations

aspects such as the asymmetry and sharpness of distribution, respec-


tively.
For joint distributions with joint density function f (x, y), the expec-
tation is:
ˆ ˆ
E [g (X, Y )] = g (x, y) f (x, y) dx dy (1.32)
ΩY ΩX
From this, we can derive the joint moments:
ˆ ˆ
E X Y = xj y k f (x, y) dx dy (1.33)
 j k
ΩY ΩX
An important special case is the joint second central moment, known
as the covariance:

cov [X, Y ] = E [(X − E [X]) (Y − E [Y ])] (1.34)


ˆ ˆ
= (x − µX ) (y − µY ) f (x, y) dx dy
ΩY ΩX

where µX , µY are the means of X, Y respectively.


Sometimes, the hierarchy of moments of a distribution serve to define
the distribution uniquely. A very important kind of expectation is the
moment generating function (MGF), for discrete variables:
X
M (s) = E [exp (sX)] = exp (sx) f (x) dx (1.35)
x∈X(Ω)

The real variable s becomes the new independent variable replacing the
discrete variable x. When the sum (1.35) converges absolutely, then the
MGF exists and can be used to find all the moments for the distribution
of X:
  dk M
E Xk = (0) (1.36)
dtk
This can be shown to follow from the series expansion of the exponential
function. Using the Bernoulli example above, the MGF is M (s) =
1 − p + p exp (s). Often, the distribution of a random variable has a
simple form under the MGF that makes the task of manipulating random
variables relatively easy. For example, given a linear combination of
independent random variables:
N
X
XN = an Xn (1.37)
n=1
it is not a trivial matter to calculate the distribution of XN . However,
the MGF of the sum is just:
N
Y
MXN (s) = MXn (an s) (1.38)
n=1
1.4 Probability and stochastic processes 19

from which the distribution of the sum can sometimes be recognized


immediately. As an example, the MGF for an (unweighted) sum of N
i.i.d. Bernoulli random variables with parameter p, is:
N
MXN (s) = (1 − p + p exp (s)) (1.39)
which is just the MGF of the binomial distribution.
A similar expectation is the characteristic function (CF), for contin-
uous variables:
ˆ
ψ (s) = E [exp (isX)] = exp (isx) f (x) dx (1.40)


where i = −1. This can be understood as the Fourier transform of
the density function. An advantage over the MGF is that the CF al-
ways exists. It can therefore be used as an alternative way to define a
distribution, a fact which is necessary for some well-known distributions
such as the Levy or alpha-stable distributions. Well-known properties of
Fourier transforms make it easy to use the CF to manipulate random
variables. For example, given a random variable X with CF ψX (s), the
random variable Y = X + m where m is a constant is:

ψY (s) = ψX (s) exp (ism) (1.41)


From this, given that the CF of the standard normal Gaussian with
mean zero and unit variance, is exp − 12 s2 , the shifted random variable


Y has CF ψY (s) = exp ism − 12 s2 . Another property, similar to the


MGF, is the linear combination property:
N
Y
ψXN (s) = ψXi (ai s) (1.42)
i=1

We can use this to show that for a linear combination (1.37) of indepen-
dent Gaussian random variables with mean µn and variance σn2 , the CF
of the sum is:
N N
!
X 1 2X 2 2
ψXN (s) = exp is an µn − s a σ (1.43)
n=1
2 n=1 n n
PN
which can be recognized as another Gaussian with mean n=1 an µn
PN
and variance n=1 a2n σn2 . This shows that the Gaussian is invariant to
linear transformations, a property known as (statistical) stability, which
is of fundamental importance in classical statistical DSP.

Empirical distribution function and sample expecta-


tions
If we start with a PDF or PMF, then the specific values of the parameters
of these functions determine the mathematical form of the distribution.
However, often we want some given data to “speak for itself” and de-
termine a distribution function directly. An important, and simple way
20 Mathematical foundations

to do this, is using the empirical cumulative distribution function, or


ECDF:
N
1 X
FN (x) = 1 [xn ≤ x] (1.44)
N n=1
where 1 [.] takes a logical condition as argument, and is 1 if the condition
is true, 0 if false. Thus, the ECDF counts the number of data points
that are equal to or below the value of the variable x. It looks like a
1
staircase, jumping up one count at the value of each data point. The
ECDF estimates the CDF of the distribution of the data, and it can be
FN (x)

0.5
shown that, in a specific, probabilistic sense, this estimator converges
0
on the true CDF given an infinite amount of data. By differentiating
−4 −3 −2
x
−1 0
(1.44), the associated PMF (PDF) is a sum of Kronecker (Dirac) delta
∞ functions:
N
fN (x)

1 X
fN (x) = δ [xn − x] (1.45)
N n=1
0
−4 −3 −2 −1 0
x See Figure 1.10. The simplicity of this estimator makes it very useful
in practice. For example, the expectation with respect to the function
g (X) of the ECDF for a continuous random variable is:
Fig. 1.10: Empirical cumulative dis-
tribution function (ECDF) and den- ˆ
sity functions (EPDF) based on a
sample of size N = 20 from a Gaus- E [g (X)] = g (x) fN (x) dx (1.46)
sian random variable with µ = −2 Ω
ˆ N
and σ = 1.0. For the estimated ECDF 1 X
(top), the black ‘steps’ occur at the = g (x) δ [xn − x] dx
value of each sample, whereas the Ω N n=1
blue curve is the theoretical CDF for N ˆ N
the Gaussian. The EPDF (bottom) 1 X 1 X
consists of an infinitely tall Dirac
= g (x) δ [xn − x] dx = g (xn )
N n=1 Ω N n=1
‘spike’ occurring at each sample.
´
using the sift property of the delta function, f (x) δ [x − a] dx = f (a).
Therefore, the expectation of a random variable can be estimated from
the average of the expectation function applied to the data. These esti-
mates are known as sample expectations,
PN the most well-known of which
is the sample mean µ = E [X] = N1 n=1 xn .

Transforming random variables


We often need to apply some kind of transformation to a random vari-
able. What happens to the distribution over this random variable af-
ter transformation? Often, it is straightforward to compute the CDF,
PMF or PDF of the transformed variable. In general for a random
variable X with CDF FX (x), if the transformation Y = g (X) is in-
vertible, then FY (y) = FX g −1 (y) if g −1 is increasing, or FY (y) =
1 − FX g −1 (y) if g −1 is decreasing. For the corresponding PDFs,
 −1
fY (y) = fX g −1 (y) dgdy (y) . This calculus can be extended to the
case where g is generally non-invertible except on any number of isolated
1.4 Probability and stochastic processes 21

points. An example of this process is the transformation g (X) = X 2 ,


which converts the Gaussian into the chi-squared distribution.
Another ubiquitous example is the effect of a linear transformation,
i.e. Y = σX  + µ, for arbitrary scalar values µ, σ, for which fY (y) =
1
f
|σ| X
y−µ
σ . This can be used to prove that, for example, a great
many distributions are invariant under scaling by a positive real value
(exponential and gamma distributions are important examples), or are
contained in a location-scale family, that is, each member of the family
of distributions can be obtained from any other in the same family by
an appropriate translation and scaling (this holds for the Gaussian and
logistic distributions).

Multivariate Gaussian and other limiting distribu- 3


tions 2
As discussed earlier in this section, invariance to linear transformations 1
(stability) is one of the hallmarks of the Gaussian distribution. Another 0

x2
is the central limit theorem: for an infinite sequence of random variables
−1
with finite mean and variance which are all independent of each other,
the distribution of the sum of this sequence of random variables will tend −2
to the Gaussian. A simple proof using CFs exists. In many contexts this −3
−2 0 2
theorem is given as a justification for choosing the Gaussian distribution x1
as a model for some given data.
These desirable properties of a single Gaussian random variable carry 0.6
over to vectors with D elements; over which the multivariate Gaussian
0.5
distribution is:
0.4
f (x)
0.3
1 1
 
T −1
f (x; µ, Σ) = q exp − (x − µ) Σ (x − µ) (1.47) 0.2
D 2
(2π) |Σ| 0.1

T T −2 0 2
where x = (x1 , x2 , . . . , xD ) , the mean vector µ = (µ1 , µ2 , . . . , µD ) and x
Σ is the covariance matrix. Equal probability density contours of the
multivariate Gaussian are, in general, (hyper)-ellipses in D dimensions. Fig. 1.11: An example of the multi-
The maximum probability density occurs at x = µ. The positive-definite variate (D = 2) Gaussian with PDF
f (x1 , x2 )T ; µ, Σ (top). Height is
covariance matrix can be decomposed into a rotation and expansion or
probability density value, and the
contraction in each axis, around the point µ. Another very special and other two axes are x1 , x2 . The con-
important property of the multivariate Gaussian is that all marginals tours of constant probability are el-
are also (multivariate) Gaussians. Similarly, this means that condition- lipses (middle), here shown for prob-
ability density values of 0.03 (blue),
ing one multivariate Gaussians on another, gives another multivariate 0.05 (cyan) and 0.08 (black). The
Gaussian. All these properties are depicted in Figure 1.11 and described maximum probability density coin-
algebraically below. cides with the mean (here µ =
It is simple to extend the statistical stability property of the univariate (−1, 1)T ). The marginals are all
single-variable Gaussians, an exam-
Gaussian to multiple dimensions to show that this property also applies ple PDF of one of the marginals with
to the multivariate normal. Firstly, we need the notion of  the CF for mean µ = −1 is shown at bottom.
joint random variables X ∈ RD , which is ψX (s) = E exp isT X


for the variable s ∈ CD . Now, consider an i.i.d. vector of standard


normal univariate RVs, X ∈ RD . The joint CF will be ψX (s) =
22 Mathematical foundations

QD
exp − 12 s2i = exp − 12 sT s . This is the CF of the standard mul-
 
i=1
tivariate Gaussian which has zero mean vector and identity covariance,
X ∼ N (0, I). What happens if we apply the (affine) transformation
Y = AX + b with the (full-rank) matrix A ∈ RD×D and b ∈ RD ? The
effect of this transformation on an arbitrary CF is:

ψY (s) = EX exp isT AX


 
h  T i
= exp isT b EX exp i AT s X (1.48)


= exp is b ψX A s
T T
 

Inserting the CF of the standard multivariate normal, we get:

1
 
T
ψY (s) = exp is b exp − AT s
T
AT s
 
2
1
 
= exp is b − s AA s
T T T
(1.49)

2
which is the CF of another multivariate normal, from which we can
say that Y is multivariate normal with mean µ = b and covariance
Σ = AAT , or Y ∼ N (µ, Σ). It is now straightforward to predict
that application of another affine transformation Z = BY + c leads to
another multivariate Gaussian:

1
 
T T
ψZ (s) = exp isT c exp i BT s µ − BT s Σ BT s
 
2
1
 
= exp is (Bµ + c) − s BΣB s
T T T
(1.50)

2

which has mean Bµ + c and covariance BΣBT . This construction can


also be used to show that all marginals are also multivariate normal, if
we consider the orthogonal projection, with a set of dimension indices
P , e.g. projection down onto P = {2, 6, 7}. Then the projection matrix
P has entries bP,P = 1 and the rest of the entries are zero. Then we
have the following CF:
1
 
ψZ (s) = exp isT (Pµ) − sT PΣPT s

2
1
 
= exp isTP µP − sTP ΣP sP (1.51)
2
where aP indicates a vector with the elements in P deleted, and Ap
indicates a matrix with the rows and columns in P deleted. This is yet
another multivariate Gaussian. Similar arguments, can be used to show
that the conditional distributions are also multivariate normal. Partition
X into two sub-vectors X P and X P̄ where P̄ = {1, 2, . . . D} \P such
that P ∪ P̄ = {1, 2, . . . , D}. Then f (xP |xP̄ ) = N (xP ; m, S) where
1.4 Probability and stochastic processes 23

m = µP − ΣP P Σ−1 P P̄
(xP̄ − µP̄ ) and S = ΣP P − ΣP P̄ Σ−1 Σ . For
P̄ P̄ P̄ P
detailed proofs, see Murphy (2012, section 4.3.4).
Although the Gaussian is special, it is not the only distribution with
the statistical stability property: another important example is the α-
stable distribution (which includes the Gaussian as a special case); in
fact, a generalization of the central limit theorem states that the dis-
tribution of the sum of independent distributions with infinite vari-
ance tends towards the α-stable distribution. Yet another broad class
of distributions that are also invariant to linear transformations are
the elliptical distributions whose densities, when they exist are defined
in terms of q a function of the Mahalanobis distance from the mean,
T
d (x, µ) = (x − µ) Σ−1 (x − µ). These useful distributions also have
elliptically-distributed marginals, and the multivariate Gaussian is a spe-
cial case.
Just as the central limit theorem is often given as a justification for
choosing the Gaussian, the extreme value theorem is often a justification
for choosing one of the extreme value distributions. Consider an infinite
sequence of identically distributed but independent random variables,
the maximum of this sequence is either Frechet, Weibull or Gumbel dis-
tributed, regardless of the distribution of the random variables in the
sequence.

Stochastic processes
Stochastic processes are key objects in statistical signal processing and
machine learning. In essence, a stochastic process is just a collection of
random variables Xt on the same sample space Ω (also known as the
state space), where the index t comes from an arbitrary setT which may
be finite or infinite in size, uncountable or countable. When the index
set is finite such as T = {1, 2 . . . N }, then we may consider the collection
to coincide with a vector of random variables.
In this book the index is nearly always synonymous with (relative)
time, and therefore plays a crucial role since nearly all signals are time-
based. When the index is a real number i.e T = R, the collection is
known as a continuous-time stochastic process. Countable collections are
known as discrete-time stochastic processes. Although continuous-time
processes are important for theoretical purposes, all recorded signals we
capture from the real world are discrete-time and finite. These signals
must be stored digitally so that signal processing computations may be
performed on them. A signal is typically sampled from the real world at
uniform intervals in time, and each sample takes up a finite number of
digital bits.
This latter constraint influences the choice of sample space for each
of the random variables in the process. A finite number of bits can only
encode a finite range of discrete values, so, being faithful to the digital
representation might suggest Ω = {0, 1, 2 . . . K − 1} where K = 2B for
B bits, but this digital representation can also be arranged to encode
real numbers with a finite precision, a common example being the 32
24 Mathematical foundations

bit floating point representation, and this might be more faithful to how
we model the real-world process we are sampling (see Chapter 8 for
details). Therefore, it is often mathematically realistic and convenient
to work with stochastic processes on the sample space of real numbers
Ω = R. Nevertheless, the choice of sample space depends crucially
upon the real-world interpretation of the recorded signal, and we should
not forget that the actual computations may only ever amount to an
approximation of the mathematical model upon which they are based.
If each member of the collection of variables is independent of every
other and each one has the same distribution, then to characterize the
distributional properties of the process, all that matters is the distribu-
tion of every Xt , say f (x) for the PDF over the real state space. Simple
processes with this property are said to be independent and identically
distributed (i.i.d.) and they are of crucial importance in many applica-
tions in this book, because the joint distribution over the entire process
factorizes into a product of the individual distributions which leads to
considerable computational simplifications in statistical inference prob-
lems.
However, much more interesting signals are those for which the stochas-
tic process is time-dependent, where each of the Xt is not, in general, in-
dependent. The distributional properties of any process can be analyzed
by consideration of the joint distributions of all finite-length collections
of the constituent random variables, known as the finite-dimensional
distributions (f.d.d.s) of the process. For the real state space the f.d.d.s
T
are defined by a vector of N time indices t = (t1 , t2 . . . tN ) , where the
T T
vector (Xt1 , Xt2 . . . XtN ) has the PDF ft (x) for x = (x1 , x2 . . . xN ) .
It is these f.d.d.s which encode the dependence structure between the
random variables in the process.
The f.d.d.s encode much more besides dependence structure. Statis-
tical DSP and machine learning make heavy use of this construct. For
example, Gaussian processes which are fundamental to ubiquitous topics
such as Kalman filtering in signal processing and nonparametric Bayes’
inference in machine learning are processes for which the f.d.d.s are all
multivariate Gaussians. Another example is the Dirichlet process used
in nonparametric Bayes’ which has Dirichlet distributions as f.d.d.s. (see
Chapter 10).
Strongly stationary processes are special processes in which the f.d.d.s
T
are invariant to time translation, i.e. (Xt1 , Xt2 . . . XtN ) has the same
T
f.d.d. as (Xt1 +τ , Xt2 +τ . . . XtN +τ ) , for τ > 0. This says that the lo-
cal distributional properties will be the same at all times. This is yet
another mathematical simplification and it is used extensively. A less
restrictive notion of stationarity occurs when the first and second joint
moments are invariant to time delays: cov [Xt , Xs ] = cov [Xt+τ , Xs+τ ]
and E [Xt ] = E [Xs ] for all t, s and τ > 0. This is known as weak
stationarity. Strong stationarity implies weak stationarity, but the im-
plication does not necessarily work the other way except in special cases
(stationary Gaussian processes are one example for which this is true).
The temporal covariance (autocovariance) of a weakly stationary process
1.4 Probability and stochastic processes 25

depends only on time translation τ : cov [Xt , Xt+τ ] = cov [X0 , Xτ ] (see
Chapter 7).

Markov chains
Another broad class of simple non-i.i.d. stochastic processes are those
for which the dependence in time has a finite effect, known as Markov
chains. Given a discrete-time process with index t = Z over a discrete
state space, such a process satisfies the Markov property:

f (Xt+1 = xt+1 |Xt = xt , Xt−1 = xt−1 . . . X1 = x1 )


= f (Xt+1 = xt+1 |Xt = xt ) (1.52)

for all t ≥ 1 and all xt . This conditional probability is called the transi-
tion distribution.
What this says is that the probability of the random variable at any
time t depends only upon the previous time index, or stated another
way, the process is independent given knowledge of the random variable
at the previous time index. The Markov property leads to considerable
simplifications which allow broad computational savings for inference in
hidden Markov models, for example. (Note that a chain which depends
Fig. 1.12: Markov (left) and non-
on M ≥ 1 previous time indices is known as a M -th order Markov chain, Markov (right) chains. For a Markov
but in this book we will generally only consider the case where M = 1, chain, the distribution of the random
because any higher-order chain can always be rewritten as a 1st-order variable at time t, Xt , depends only
upon the value of the previous ran-
chain by the choice of an appropriate state space). Markov chains are dom variable Xt−1 ; this is true for
illustrated in Figure 1.12. the chain on the left. The chain on
A further simplification occurs if the temporal dependency is time the right is non-Markov because X3
translation invariant: depends upon X1 , and XT −1 depends
upon an ‘external’ random variable
Y.
f (Xt+1 = xt+1 |Xt = xt ) = f (X2 = x2 |X1 = x1 ) (1.53)
for all t. Such Markov processes are strongly stationary. Now, using the
basic properties of probability we can find the probability distribution
over the state at time t + 1:

X X
f (Xt+1 = xt |Xt = x0 ) f (Xt = x0 ) = f (Xt+1 = xt+1 , Xt = x0 )
x0 x0
= f (Xt+1 = xt+1 ) (1.54)

In words: the unconditional probability over the state at time t + 1 is


obtained by marginalizing out the probability over time t from the joint
distribution of the state at time t + 1 and t. But, by definition of the
transition distribution, the joint distribution is just the product of the
transition distribution times the unconditional distribution at time t.
For a chain over a finite state space of size K, we can always re-
label each state with a natural number Ω = {1, 2 . . . K}. This allows us
to construct a very computationally convenient notation for any finite
26 Mathematical foundations

discrete-state (strongly stationary) chain:

pij = f (X2 = i |X1 = j ) (1.55)


The K × K matrix P is called the transition matrix of the chain.
In order to satisfy the axioms of probability, this P matrix must have
non-negative entries, and it must be the case that i pij = 1 for all
j = 1, 2 . . . K. These properties make it a (left) stochastic matrix. This
matrix represents the conditional probabilities of the chain taking one
‘step’ from t to t + 1 and furthermore, Ps is the transition matrix taking
the state of the chain multiple steps from t to index t + s, as we now
show. Letting the K-element vector pt contain the PMF for Xt , i.e.
pti = f (Xt = i), then equation (1.54) can be written in component form:
X
pt+1 = pij ptj = Ppt i (1.56)

i
j

or in matrix-vector form:

pt+1 = Ppt (1.57)


In other words, the PMF at time index t + 1 can be found by multi-
plying the PMF at time index t by the transition matrix. Now, equa-
tion (1.57) is just a linear recurrence system whose solution is simply
pt+1 = Pt p1 , or more generally, pt+s = Ps pt . This illustrates how the
distributional properties of stationary Markov chains over discrete state
spaces can be found using tools from matrix algebra.
We will generally be very interested in stationary distributions q of
strongly stationary chains, which are invariant under application of the
transition matrix, so that Pq = q. These occur as limiting PMFs, i.e.
pt → q as t → ∞ from some starting PMF p1 . Finding a stationary
distribution amounts to solving an eigenvector problem with eigenvalue
1: the corresponding eigenvector will have strictly positive entries and
sum to 1.
Whether there are multiple stationary distributions, just one, or none,
is a key question to answer of any chain. If there is one, and only one,
limiting distribution, it is known as the equilibrium distribution µ, and
pt → µ from all possible starting PMFs, effectively “forgetting” the
initial state p1 . This property is crucial to many applications in machine
learning. Two important concepts which help to establish this property
for a chain are irreducibility: that is, whether it is possible to get from
any state to any other state, taking a finite number of transitions, and
aperiodicity: the chain does not get stuck in a repeating “cycle” of states.
Formally, for an irreducible chain, the transition matrix for s > 0
steps must be non-zero for all i, j ∈ K:

(Ps )ij > 0 (1.58)


There are several equivalent ways to define aperiodicity, here is one.
For all i ∈ K:
1.4 Probability and stochastic processes 27

gcd {s : (Ps )ii > 0} = 1 (1.59)


where gcd is the greatest common divisor. This definition implies that
the chain returns to the same state i at irregular times.
Another very important class of chains are those which satisfy the
detailed balance condition, in component form:

pij qj = pji qi (1.60)


for some for some distribution over states q with qi = f (X = i). Ex-
panding this out, we get:

f (X = i |X 0 = j ) f (X 0 = j) = f (X = j |X 0 = i ) f (X 0 = i)
f (X = i, X 0 = j) = f (X = j, X 0 = i) (1.61)

which implies that pij = pji , in other words, these chains have symmetric
transition matrices, P = PT , which are doubly stochastic in that both
the columns and rows are normalized. Interestingly, the distribution q
is also a stationary distribution of the chain because:
X X X
pij qj = pji qi = qi pji = qi (1.62)
j j j

This symmetry also implies that such chains are reversible if in the
stationary state, i.e. pt = q:

f (Xt+1 = i |Xt = j ) f (Xt = j) = f (Xt+1 = j |Xt = i ) f (Xt = i)


f (Xt+1 = i, Xt = j) = f (Xt = i, Xt+1 = j) (1.63)

So, the roles of Xt+1 and Xt in the joint distribution of states at


adjacent indices can be swapped, and this means that the f.d.d.s, and
hence all the distributional properties of the chain, are invariant to time
reversal.
In the case of continuous state spaces where Ω = R, the probability of
each time index is a PDF, f (Xt = xt ), and the transition probabilities
are conditional PDFs, f (Xt+1 = xt+1 |Xt = xt ). For example, consider
the following linear stochastic recurrence relation:

Xt+1 = aXt + t (1.64)


with X1 = x1 ,a is an arbitrary real scalar, and t are zero-mean, i.i.d.
Gaussian random variables with standard deviation σ. This is a discrete-
time Markov process with the following transition PDF:

f (Xt+1 = xt |Xt = xt ) = N xt+1 ; axt , σ 2 (1.65)




i.e. a Gaussian with mean aXt−1 and standard deviation σ. This im-
portant example of a Gaussian Markov chain is known as an AR(1 ) or
28 Mathematical foundations

first-order autoregressive process; more general AR(q) processes defined


by chains of higher order, are ubiquitous in signal processing and time
series analysis and will be discussed in detail in Caper 7.
Although we can no longer use finite matrix algebra in order to predict
probabilistic properties of chains with continuous state spaces, much of
the theory developed above for finite state chains applies with the obvi-
ous modifications. In the case of continuous sample space, the equivalent
to equation (1.54) for the PDF of the next time index is:

ˆ
f (Xt+1 = xt+1 ) = f (Xt+1 = xt+1 |Xt = x0 ) f (Xt = x0 ) dx0 (1.66)

so that, if it exists, an invariant PDF satisfies:


5
ˆ
4
f (X = x) = f (Xt+1 = x |Xt = x0 ) f (X = x0 ) dx0 (1.67)
− ln(P )

2 Detailed balance requires:


1

0
0 0.5 1 f (X = x |X 0 = x0 ) f (X 0 = x0 ) = f (X = x0 |X 0 = x ) f (X 0 = x)
P (1.68)
Indeed, for the AR(1) case described above, it can be shown that
0.6 the chain is  reversible
  with Gaussian stationary distribution f (x) =
0.5 N x; 0, σ 2 1 − a2 .
0.4 For further reading, Grimmett and Stirzaker (2001) is a very compre-
H

0.3 hensive introduction to probability and stochastic processes, whereas


0.2 Kemeny and Snell (1976) contains highly readable coverage of the ba-
0.1 sics of finite, stationary Markov chains.
0 0.5 1
p

1.5 Data compression and information


Fig. 1.13: The (Shannon) informa-
tion map (top) and entropy (bot- theory
tom). The information map − ln (P )
maps probabilities in the range [0, 1] Much of mathematical and statistical modelling of the world can be
to entropies over [+∞, 0]. Entropy is
considered a form of data compression — representing measured physical
the expected information of a ran-
dom variable H [X] = E [− ln f (X)], phenomena in a form that is more compact than the ‘size’ of the recorded
which is shown here for the Bernoulli signal obtained from the real world. There are many open questions
random variable with PMF f (x) = about exactly how to measure the size of a mathematical model and a
px (1 − p)1−x for x ∈ {0, 1}, as a func-
tion of p. Since the Bernoulli distri-
set of measured data. For example, the Kolmogorov complexity, K [S] is
bution is uniform for p = 12 (f (x) = the size (number of instructions or number of symbols) of the smallest
1
2
), it follows that the entropy is max- computer program that can reproduce the sequence of symbols S making
imized at that value, which is clear up a digital representation of the measurement. This measure of size of
from the plot.
a data set has the desirable property that it is invariant to the choice
computer programming language used (up to a constant).
For instance, the algorithmic complexity of the sequence S1 =
(1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3) is (at most) just the length
of the program which outputs the sub-sequence (1, 3) nine times.
1.5 Data compression and information theory 29

Whereas, the Kolmogorov complexity of the sequence S2 =


(8, 2, 7, 8, 2, 1, 8, 2, 6, 8, 2, 8, 8, 2, 3, 8, 2, 9), which has the same length as
S1 , does not, apparently, have such an obviously simple program which
can reproduce it (although we can certainly imagine simple schemes
which could encode this sequence more compactly than the length of
S2 ). Hence, we can hypothesize that K [S1 ] < K [S2 ] i.e. that the
complexity of S2 is greater than of S1 .
We can always say that an upper bound (but not a ‘tight’ bound)
on the complexity is K [S] ≤ |S| + C, i.e. the length of S plus some
(small) constant C, because a program which simply has a table from
which it retrieves the items from the sequence in the correct order, can
reproduce any sequence. This upper bound program has to contain the
entire sequence (hence the |S| part), and the program which retrieves the
sequence in order (the C part). We can also exhibit programs which are
too simple: these are short but do not exactly reproduce the sequence
(for example, for S2 above, the program which outputs the sub-sequence
(8, 2, 4) six times has complexity less than |S|, but it gets every third el-
ement of the sequence wrong). These ‘lossy’ programs can serve as lower
bounds. The actual K [S] lies between those bounds somewhere. The
main difficulty is that finding K [S] is non-computable, that is, we can
prove that it is impossible, in general, to write a program which, given
any sequence, can output the Kolmogorov complexity of that sequence
(Cover and Thomas, 2006, page 163). This means that for any sequence
and a program which reproduces it, we can only ever be sure that we
have an upper bound for K [S].
A related quantity which is computable, applicable to random vari-
ables, is the Shannon information map − ln P for the probability value
P . This is the foundation of the (somewhat poorly-named) discipline
of information theory. Since P ∈ [0, 1], the corresponding information
ranges over [+∞, 0] (Figure 1.13). Entropy measures the average infor-
mation contained in the distribution of a random variable. The Shannon
entropy of a discrete random variable X, defined on the sample space Ω
with PMF or PDF f (x)is:

H [X] = E [− ln f (X)] (1.69)


For discrete random variables this is H [X] = − x∈X(Ω) f (x) ln f (x),
P
whereas ´ for continuous variables this becomes
H [X] = − Ω f (x) ln f (x) dx. In the continuous case, H [X] is known
as the differential entropy. It is important to note that while this is
indeed an expectation of a function of the random variable, the expec-
tation function is the distribution function of the variable (Figure 1.13).
To date there are some reasonable contraversies about the unique
axiomatic foundation of information theory, but the following axioms
are sufficient and necessary to define Shannon entropy:
(1) Permutation invariance: The events can be relabeled without chang-
ing H [X],
(2) Finite maximum: Of all distributions on a fixed sample space,
30 Mathematical foundations

H [X] is maximized by the uniform distribution (for the discrete


case), or the Gaussian distribution (for the continuous case),
(3) Non-informative new impossible events: Extending a random vari-
able by including additional events that have zero probability mea-
sure does not change H [X],
(4) Additivity: H [X, Y ] = H [X] + EX [H [Y |X ]], in words: the en-
tropy of a joint distribution is the sum of the entropy of X, plus
the average entropy of Y conditioned on X.

Note that axiom 4 above implies that if X, Y are independent, then


H [X, Y ] = H [X] + H [Y ], i.e. the entropy of two independent random
variables is the sum of the entropies of the individual variables. Axiom 4
can, for example, be can be replaced by this weaker statement, in which
case the Shannon entropy is not the only entropy that satisfies all the
axioms. However, the Shannon entropy is the only entropy that behaves
in a consistent manner for conditional probabilities:

P (X, Y )
P (X |Y ) = =⇒ H [X |Y ] = H [X, Y ] − H [Y ] (1.70)
P (Y )
This is intuitive: it states that the information in X given Y , is what
we get by subtracting the information in Y from the joint information
contained in X and Y simultaneously. For this reason we will generally
work with the Shannon entropy in this book, so that by referring to
‘entropy’ we mean Shannon entropy.
For random processes, Kolmogorov complexity and entropy are inti-
mately related. Consider a discrete-time, i.i.d. process Xn on the dis-
crete, finite set Ω. Also, denote by K [S|N ] the conditional Kolmogorov
complexity where the length of the sequence N is assumed known to the
program. We can show that as N → ∞ Cover and Thomas (2006, page
154):

E [K [(X1 , . . . , XN ) |N ]] → N H2 [X] (1.71)


where H2 [X] is the entropy with the logarithm taken to base 2 in order
to measure it in bits. In other words, the expected (conditional) Kol-
mogorov complexity of an i.i.d. sequence, converges to the total entropy
of the sequence. This remarkable connection arises because Shannon
entropy can also be defined using a universally good coding scheme for
all i.i.d. random sequences (for example, Huffman coding which cre-
ates a uniquely decodable sequence of bits), that is, a way of construct-
ing compressed representations of realizations of any random variable.
Shannon’s famous source-coding theorem shows that the number of bits
in that (universal) compressed representation lies in between N H2 [X]
and N H2 [X] + N (Cover and Thomas, 2006, page 88).
Another way of expressing this compression is that given |Ω| ≤ 2L ,
then the integerL is the number of bits needed to represent all members
the sample space Ω. So, it would take N × L bits to represent the
1.5 Data compression and information theory 31

sequence S in an uncompressed binary representation. By contrast,


using entropy coding, then since H2 [X] ≤ log2 |Ω| and provided the
distribution of X is non-uniform, we can achieve a reduction in the
number of bits required to represent S without loss. The entropy tells
us exactly how much compression we can achieve on average in the best
case.

The importance of the information map


Why is information theory and data compression so important to ma-
chine learning and statistical signal processing? The most fundamental
is that, in logarithmic base 2, the information map − log2 f (x) of a
distribution over a random sequence X = (X1 , . . . , XN ), quantifies the
expected length of the compressed sequence, as follows. For an i.i.d.
sequence of length N , over the discrete set with M = |Ω| elements, then
if each outcome has probability pi , i ∈ {1, . . . M }, then outcome i will
be expected to occur N pi times. The distribution of the sequence is:
N
Y M
Y
f (x) = pxn ≈ pN
i
pi
(1.72)
n=1 i=1
Now, applying the information map:

M
X M
X
− log2 f (x) ≈ − N pi log2 pi = −N pi log2 pi
i=1 i=1
= N H [X]

which approximately coincides with both the total entropy, and the ex-
pected Kolmogorov complexity, asymptotically. The exact nature of this
approximation is made rigorous through the concepts of typical sets and
the asymptotic equipartition principle (Cover and Thomas, 2006, chap-
ter 3). So, the information map gets us directly from the probabilistic
to the data compression modeling points of view in machine learning.
A key algebraic observation here is that applying the information map
− ln P (X) has the effect of converting multiplicative probability calcu-
lations into additive information calculations. For conditional probabil-
ities:

P (X, Y )
 
− ln P (X |Y ) = − ln = − ln P (X, Y ) + ln P (Y ) (1.73)
P (Y )
and for independent random variables:

− ln P (X, Y ) = − ln (P (X) P (Y )) = − ln P (X) − ln P (Y ) (1.74)

This mapping has computational advantages. Products of many prob-


abilities can become unmanageably small and we run into difficulties
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of A Prince of
Swindlers
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: A Prince of Swindlers

Creator: Guy Boothby

Release date: May 23, 2017 [eBook #54771]


Most recently updated: August 26, 2018

Language: English

Credits: Produced by Al Haines

*** START OF THE PROJECT GUTENBERG EBOOK A PRINCE OF


SWINDLERS ***
A PRINCE
OF
SWINDLERS

BY
GUY BOOTHBY

ARTHUR WESTBROOK
COMPANY
CLEVELAND, OHIO, U. S. A.

Copyright, 1907, by Bainbridge Cayll

CONTENTS.

CHAPTER I.
A Criminal in Disguise

CHAPTER II.
The Den of Iniquity

CHAPTER III.
The Duchess of Wiltshire's Diamonds

CHAPTER IV.
How Simon Carne Won the Derby

CHAPTER V.
A Service to the State

CHAPTER VI.
A Visit in the Night

CHAPTER VII.
The Man of Many Crimes

CHAPTER VIII.
An Imperial Finale

A PRINCE OF SWINDLERS

CHAPTER I.
A CRIMINAL IN DISGUISE.

After no small amount of deliberation, I have come to the conclusion


that it is only fit and proper I should set myself right with the world
in the matter of the now famous 18--swindles. For, though I have
never been openly accused of complicity in those miserable affairs,
yet I cannot rid myself of the remembrance that it was I who
introduced the man who perpetrated them to London society, and
that in more than one instance I acted, innocently enough, Heaven
knows, as his Deus ex machinâ, in bringing about the very results he
was so anxious to achieve. I will first allude, in a few words, to the
year in which the crimes took place, and then proceed to describe
the events that led to my receiving the confession which has so
strangely and unexpectedly come into my hands.
Whatever else may be said on the subject, one thing at least is
certain--it will be many years before London forgets that season of
festivity. The joyous occasion which made half the sovereigns of
Europe our guests for weeks on end, kept foreign princes among us
until their faces became as familiar to us as those of our own
aristocracy, rendered the houses in our fashionable quarters
unobtainable for love or money, filled our hotels to repletion, and
produced daily pageants the like of which few of us have ever seen
or imagined, can hardly fail to go down to posterity as one of the
most notable in English history. Small wonder, therefore, that the
wealth, then located in our great metropolis, should have attracted
swindlers from all parts of the globe.
That it should have fallen to the lot of one who has always
prided himself on steering clear of undesirable acquaintances, to
introduce to his friends one of the most notorious adventurers our
capital has ever seen, seems like the irony of fate. Perhaps, however,
if I begin by showing how cleverly our meeting was contrived, those
who would otherwise feel inclined to censure me, will pause before
passing judgment, and will ask themselves whether they would not
have walked into the snare as unsuspectedly as I did.
It was during the last year of my term of office as Viceroy, and
while I was paying a visit to the Governor of Bombay, that I decided
upon making a tour of the Northern Provinces, beginning with
Peshawur, and winding up with the Maharajah of Malar-Kadir. As the
latter potentate is so well known, I need not describe him. His
forcible personality, his enlightened rule, and the progress his state
has made within the last ten years, are well known to every student
of the history of our magnificent Indian Empire.
My stay with him was a pleasant finish to an otherwise
monotonous business, for his hospitality has a world-wide
reputation. When I arrived he placed his palace, his servants, and
his stables at my disposal to use just as I pleased. My time was
practically my own. I could be as solitary as a hermit if I so desired;
on the other hand, I had but to give the order, and five hundred
men would cater for my amusement. It seems therefore the more
unfortunate that to this pleasant arrangement I should have to
attribute the calamities which it is the purpose of this series of
stories to narrate.
On the third morning of my stay I woke early. When I had
examined my watch I discovered that it wanted an hour of daylight,
and, not feeling inclined to go to sleep again, I wondered how I
should employ my time until my servant should bring me my chota
hazri, or early breakfast. On proceeding to my window I found a
perfect morning, the stars still shining, though in the east they were
paling before the approach of dawn. It was difficult to realize that in
a few hours the earth which now looked so cool and wholesome
would be lying, burnt up and quivering, beneath the blazing Indian
sun.
I stood and watched the picture presented to me for some
minutes, until an overwhelming desire came over me to order a
horse and go for a long ride before the sun should make his
appearance above the jungle trees. The temptation was more than I
could resist, so I crossed the room and, opening the door, woke my
servant, who was sleeping in the ante-chamber. Having bidden him
find a groom and have a horse saddled for me, without rousing the
household, I returned and commenced my toilet. Then, descending
by a private staircase to the great courtyard, I mounted the animal I
found awaiting me there, and set off.
Leaving the city behind me I made my way over the new bridge
with which His Highness has spanned the river, and, crossing the
plain, headed towards the jungle, that rises like a green wall upon
the other side. My horse was a waler of exceptional excellence, as
every one who knows the Maharajah's stable will readily understand,
and I was just in the humor for a ride. But the coolness was not
destined to last long, for by the time I had left the second village
behind me, the stars had given place to the faint grey light of dawn.
A soft, breeze stirred the palms and rustled the long grass, but its
freshness was deceptive; the sun would be up almost before I could
look round, and then nothing could save us from a scorching day.
After I had been riding for nearly an hour it struck me that, if I
wished to be back in time for breakfast, I had better think of
returning. At the time I was standing in the center of a small plain,
surrounded by jungle. Behind me was the path I had followed to
reach the place; in front, and to the right and left, others leading
whither I could not tell. Having no desire to return by the road I had
come, I touched up my horse and cantered off in an easterly
direction, feeling certain that even if I had to make a divergence, I
should reach the city without very much trouble.
By the time I had put three miles or so behind me the heat had
become stifling, the path being completely shut in on either side by
the densest jungle I have ever known. For all I could see to the
contrary, I might have been a hundred miles from any habitation.
Imagine my astonishment, therefore, when, on turning a corner
of the track, I suddenly left the jungle behind me, and found myself
standing on the top of a stupendous cliff, looking down upon a lake
of blue water. In the center of this lake was an island, and on the
island a house. At the distance I was from it the latter appeared to
be built of white marble, as indeed I afterward found to be the case.
Anything, however, more lovely than the effect produced by the blue
water, the white building, and the jungle-clad hills upon the other
side, can scarcely be imagined. I stood and gazed at it in delighted
amazement. Of all the beautiful places I had hitherto seen in India
this, I could honestly say, was entitled to rank first. But how it was
to benefit me in my present situation I could not for the life of me
understand.
Ten minutes later I had discovered a guide, and also a path
down the cliff to the shore, where, I was assured, a boat and a man
could be obtained to transport me to the palace. I therefore bade
my informant precede me, and after some minutes' anxious
scrambling my horse and I reached the water's edge.
Once there, the boatman was soon brought to light, and, when
I had resigned my horse to the care of my guide, I was rowed
across to the mysterious residence in question.
On reaching it we drew up at some steps leading to a broad
stone esplanade, which, I could see, encircled the entire place. Out
of a grove of trees rose the building itself, a confused jumble of
Eastern architecture crowned with many towers. With the exception
of the vegetation and the blue sky, everything was of a dazzling
white, against which the dark green of palms contrasted with
admirable effect.
Springing from the boat I made my way up the steps, imbued
with much the same feeling of curiosity as the happy Prince, so
familiar to us in our nursery days, must have experienced when he
found the enchanted castle in the forest. As I reached the top, to my
unqualified astonishment, an English man-servant appeared through
a gate-way and bowed before me.
"Breakfast is served," he said, "and my master bids me say that
he waits to receive your lordship."
Though I thought he must be making a mistake, I said nothing,
but followed him along a terrace, through a magnificent gateway, on
the top of which a peacock was preening himself in the sunlight,
through court after court, all built of the same white marble, through
a garden in which a fountain was playing to the rustling
accompaniment of pipal and pomegranate leaves, to finally enter the
veranda of the main building itself.
Drawing aside the curtain which covered the finely-carved
doorway, the servant invited me to enter, and as I did so announced
"His Excellency the Viceroy."
The change from the vivid whiteness of the marble outside to
the cool semi-European room in which I now found myself was
almost disconcerting in its abruptness. Indeed, I had scarcely time to
recover my presence of mind before I became aware that my host
was standing before me. Another surprise was in store for me. I had
expected to find a native, instead of which he proved to be an
Englishman.
"I am more indebted than I can say to your Excellency for the
honor of this visit," he began, as he extended his hand. "I can only
wish I were better prepared for it."
"You must not say that," I answered. "It is I who should
apologize. I fear I am an intruder. But to tell you the truth I had lost
my way, and it is only by chance that I am here at all. I was foolish
to venture out without a guide, and have none to blame for what
has occurred but myself."
"In this case I must thank the Fates for their kindness to me,"
returned my host. "But don't let me keep you standing. You must be
both tired and hungry after your long ride, and breakfast, as you
see, is upon the table. Shall we show ourselves sufficiently blind to
the conventionalities to sit down to it without further preliminaries?"
Upon my assenting he struck a small gong at his side, and
servants, acting under the instructions of the white man who had
conducted me to his master's presence, instantly appeared in
answer to it. We took our places at the table, and the meal
immediately commenced.
While it was in progress I was permitted an excellent
opportunity of studying my host, who sat opposite me, with such
light as penetrated the jhilmills falling directly upon his face. I doubt,
however, vividly as my memory recalls the scene, whether I can give
you an adequate description of the man who has since come to be a
sort of nightmare to me.
In height he could not have been more than five feet two. His
shoulders were broad, and would have been evidence of
considerable strength but for one malformation, which completely
spoilt his whole appearance. The poor fellow suffered from curvature
of the spine of the worst sort, and the large hump between his
shoulders produced a most extraordinary effect. But it is when I
endeavor to describe his face that I find myself confronted with the
most serious difficulty.
How to make you realize it I hardly know.
To begin with, I do not think I should be overstepping the mark
were I to say that it was one of the most beautiful countenances I
have ever seen in my fellow-men. Its contour was as perfect as that
of the bust of the Greek god Hermes, to whom, all things
considered, it is only fit and proper he should bear some
resemblance. The forehead was broad, and surmounted with a
wealth of dark hair, in color almost black. His eyes were large and
dreamy, the brows almost pencilled in their delicacy; the nose, the
most prominent feature of his face, reminded me more of that of the
great Napoleon than any other I can recall.
His mouth was small but firm, his ears as tiny as those of an
English beauty, and set in closer to his head than is usual with those
organs. But it was his chin that fascinated me most. It was plainly
that of a man accustomed to command; that of a man of iron will
whom no amount of opposition would deter from his purpose. His
hands were small and delicate, and his fingers taper, plainly those of
the artist, either a painter or a musician. Altogether he presented a
unique appearance, and one that once seen would not be easily
forgotten.
During the meal I congratulated him upon the possession of
such a beautiful residence, the like of which I had never seen
before.
"Unfortunately," he answered, "the place does not belong to
me, but is the property of our mutual host, the Maharajah. His
Highness, knowing that I am a scholar and a recluse, is kind enough
to permit me the use of this portion of the palace; and the value of
such a privilege I must leave you to imagine."
"You are a student, then?" I said, as I began to understand
matters a little more clearly.
"In a perfunctory sort of way," he replied. "That is to say, I have
acquired sufficient knowledge to be aware of my own ignorance."
I ventured to inquire the subject in which he took most interest.
It proved to be china and the native art of India, and on these two
topics we conversed for upwards of half-an-hour. It was evident that
he was a consummate master of his subject. This I could the more
readily understand when, our meal being finished, he led me into an
adjoining room, in which stood the cabinets containing his treasures.
Such a collection I had never seen before. Its size and completeness
amazed me.
"But surely you have not brought all these specimens together
yourself?" I asked in astonishment.
"With a few exceptions," he answered. "You see it has been the
hobby of my life. And it is to the fact that I am now engaged upon a
book upon the subject, which I hope to have published in England
next year, that you may attribute my playing the hermit here."
"You intend, then, to visit England?"
"If my book is finished in time," he answered, "I shall be in
London at the end of April or the commencement of May. Who
would not wish to be in the chief city of Her Majesty's dominions
upon such a joyous and auspicious occasion?"
As he said this he took down a small vase from a shelf, and, as
if to change the subject, described its history and its beauties to me.
A stranger picture than he presented at that moment it would be
difficult to imagine. His long fingers held his treasure as carefully as
if it were an invaluable jewel, his eyes glistened with the fire of the
true collector, who is born but never made, and when he came to
that part of his narrative which described the long hunt for, and the
eventual purchase of, the ornament in question, his voice fairly
shook with excitement. I was more interested than at any other time
I should have thought possible, and it was then that I committed the
most foolish action of my life. Quite carried away by his charm I
said:
"I hope when you do come to London, you will permit me to be
of any service I can to you."
"I thank you," he answered gravely, "our lordship is very kind,
and if the occasion arises, as I hope it will, I shall most certainly
avail myself of your offer."
"We shall be very pleased to see you," I replied; "and now, if
you will not consider me inquisitive, may I ask if you live in this great
place alone?"
"With the exception of my servants I have no companions."
"Really! You must surely find it very lonely?"
"I do, and it is that very solitude which endears it to me. When
His Highness so kindly offered me the place for a residence, I
inquired if I should have much company. He replied that I might
remain here twenty years and never see a soul unless I chose to do
so. On hearing that I accepted his offer with alacrity."
"Then you prefer the life of a hermit to mixing with your fellow-
men?"
"I do. But next year I shall put off my monastic habits for a few
months, and mix with my fellow-men, as you call them, in London."
"You will find hearty welcome, I am sure."
"It is very kind of you to say so; I hope I shall. But I am
forgetting the rules of hospitality. You are a great smoker, I have
heard. Let me offer you a cigar."
As he spoke he took a small silver whistle from his pocket, and
blew a peculiar note upon it. A moment later the same English
servant who had conducted me to his presence, entered, carrying a
number of cigar boxes upon a tray. I chose one, and as I did so
glanced at the man. In outward appearance he was exactly what a
body servant should be, of medium height, scrupulously neat, clean
shaven, and with a face as devoid of expression as a blank wall.
When he had left the room again my host immediately turned to me.
"Now," he said, "as you have seen my collection, will you like to
explore the palace?"
To this proposition I gladly assented, and we set off together.
An hour later, satiated with the beauty of what I had seen, and
feeling as if I had known the man beside me all my life, I bade him
good-bye upon the steps and prepared to return to the spot where
my horse was waiting for me.
"One of my servants will accompany you," he said, "and will
conduct you to the city."
"I am greatly indebted to you," I answered. "Should I not see
you before, I hope you will not forget your promise to call upon me
either in Calcutta, before we leave, or in London next year." He
smiled in a peculiar way.
"You must not think me so blind to my own interests as to
forget your kind offer," he replied. "It is just possible, however, that I
may be in Calcutta before you leave."
"I shall hope to see you then," I said, and having shaken him by
the hand, stepped into the boat which was waiting to convey me
across.
Within an hour I was back once more to the palace, much to
the satisfaction of the Maharajah and my staff, to whom my absence
had been the cause of considerable anxiety.
It was not until the evening that I found a convenient
opportunity, and was able to question His Highness about his
strange protégé. He quickly told me all there was to know about
him. His name, it appeared, was Simon Carne. He was an
Englishman and had been a great traveller. On a certain memorable
occasion he had saved His Highness' life at the risk of his own, and
ever since that time a close intimacy had existed between them. For
upwards of three years the man in question had occupied a wing of
the island palace, going away for months at a time presumably in
search of specimens for his collection, and returning when he
became tired of the world. To the best of His Highness' belief he was
exceedingly wealthy, but on this subject little was known. Such was
all I could learn about the mysterious individual I had met earlier in
the day.
Much as I wanted to do so, I was unable to pay another visit to
the palace on the lake. Owing to pressing business, I was compelled
to return to Calcutta as quickly as possible. For this reason it was
nearly eight months before I saw or heard anything of Simon Carne
again. When I did meet him we were in the midst of our
preparations for returning to England. I had been for a ride, I
remember, and was in the act of dismounting from my horse, when
an individual came down the steps and strolled towards me. I
recognized him instantly as the man in whom I had been so much
interested in Malar-Kadir. He was now dressed in fashionable
European attire, but there was no mistaking his face. I held out my
hand.
"How do you do, Mr. Carne?" I cried. "This is an unexpected
pleasure. Pray how long have you been in Calcutta?"
"I arrived last night," he answered, "and leave to-morrow
morning for Burma. You see, I have taken your Excellency at your
word."
"I am very pleased to see you," I replied. "I have the liveliest
recollection of your kindness to me the day that I lost my way in the
jungle. As you are leaving so soon, I fear we shall not have the
pleasure of seeing much of you, but possibly you can dine with us
this evening?"
"I shall be very glad to do so," he answered simply, watching
me with his wonderful eyes, which somehow always reminded me of
those of a collie.
"Her ladyship is devoted to Indian pottery and brass work," I
said, "and she would never forgive me if I did not give her an
opportunity of consulting you upon her collection."
"I shall be very proud to assist in any way I can," he answered.
"Very well, then, we shall meet at eight. Good-bye."
That evening we had the pleasure of his society at dinner, and I
am prepared to state that a more interesting guest has never sat at
a vice-regal table. My wife and daughters fell under his spell as
quickly as I had done. Indeed, the former told me afterwards that
she considered him the most uncommon man she had met during
her residence in the East, an admission scarcely complimentary to
the numerous important members of my council who all prided
themselves upon their originality. When he said good-bye we had
extorted his promise to call upon us in London, and I gathered later
that my wife was prepared to make a lion of him when he should
put in an appearance.
How he did arrive in London during the first week of the
following May; how it became known that he had taken Porchester
House, which, as every one knows, stands at the corner of Belverton
Street and Park Lane, for the season, at an enormous rental; how he
furnished it superbly, brought an army of Indian servants to wait
upon him, and was prepared to astonish the town with his
entertainments, are matters of history. I welcomed him to England,
and he dined with us on the night following his arrival, and thus it
was that we became, in a manner of speaking, his sponsors in
Society. When one looks back on that time, and remembers how
vigorously, even in the midst of all that season's gaiety, our social
world took him up, the fuss that was made of him, the manner in
which his doings were chronicled by the Press, it is indeed hard to
realize how egregiously we were all being deceived.
During the months of June and July he was to be met at every
house of distinction. Even royalty permitted itself to become on
friendly terms with him, while it was rumored that no fewer than
three of the proudest beauties in England were prepared at any
moment to accept his offer of marriage. To have been a social lion
during such a brilliant season, to have been able to afford one of the
most perfect residences in our great city, and to have written a book
which the foremost authorities upon the subject declare a
masterpiece, are things of which any man might be proud. And yet
this was exactly what Simon Carne was and did.
And now, having described his advent among us, I must refer to
the greatest excitement of all that year. Unique as was the occasion
which prompted the gaiety of London, constant as were the arrivals
and departures of illustrious folk, marvelous as were the social
functions, and enormous the amount of money expended, it is
strange that the things which attracted the most attention should be
neither royal, social, nor political.
As may be imagined, I am referring to the enormous robberies
and swindles which will forever be associated with that memorable
year. Day after day, for weeks at a time, the Press chronicled a series
of crimes, the like of which the oldest Englishman could not
remember. It soon became evident that they were the work of one
person, and that that person was a master hand was as certain as
his success.
At first the police were positive that the depredations were
conducted by a foreign gang, located somewhere in North London,
and that they would soon be able to put their fingers on the culprits.
But they were speedily undeceived. In spite of their efforts the
burglaries continued with painful regularity. Hardly a prominent
person escaped. My friend Lord Orpington was despoiled of his
priceless gold and silver plate; my cousin, the Duchess of Wiltshire,
lost her world-famous diamonds; the Earl of Calingforth his race-
horse "Vulcanite;" and others of my friends were despoiled of their
choicest possessions. How it was that I escaped I can understand
now, but I must confess that it passed my comprehension at the
time.
Throughout the season Simon Carne and I scarcely spent a day
apart. His society was like chloral; the more I took of it the more I
wanted. And I am now told that others were affected in the same
way. I used to flatter myself that it was to my endeavors he owed
his social success, and I can only, in justice, say that he tried to
prove himself grateful. I have his portrait hanging in my library now,
painted by a famous Academician, with this inscription upon the
lozenge at the base of the frame:
"To my kind friend, the Earl of Amberley, in remembrance of a
happy and prosperous visit to London, from Simon Carne."
The portrait represents him standing before a book-case in a
half-dark room. His extraordinary face, with its dark penetrating
eyes, is instinct with life, while his lips seem as if opening to speak.
To my thinking it would have been a better picture had he not been
standing in such a way that the light accentuated his deformity; but
it appears that this was the sitter's own desire, thus confirming
what, on many occasions, I had felt compelled to believe, namely,
that he was, for some peculiar reason, proud of his misfortune.
It was at the end of the Cowes week that we parted company.
He had been racing his yacht the Unknown Quantity, and, as if not
satisfied with having won the Derby, must needs appropriate the
Queen's Cup. It was on the day following that now famous race that
half the leaders of London Society bade him farewell on the deck of
the steam yacht that was to carry him back to India.
A month later, and quite by chance, the dreadful truth came
out. Then it was discovered that the man of whom we had all been
making so much fuss, the man whom royalty had condescended to
treat almost as a friend, was neither more nor less than a Prince of
Swindlers, who had been utilizing his splendid opportunities to the
very best advantage.
Every one will remember the excitement which followed the first
disclosure of this dreadful secret and the others which followed it. As
fresh discoveries came to light, the popular interest became more
and more intense, while the public's wonderment at the man's
almost superhuman cleverness waxed every day greater than before.
My position, as you may suppose was not an enviable one. I saw
how cleverly I had been duped, and when my friends, who had most
of them, suffered from his talents, congratulated me on my
immunity, I could only console myself with the reflection that I was
responsible for more than half the acquaintances the wretch had
made. But, deeply as I was drinking of the cup of sorrow, I had not
come to the bottom of it yet.
One Saturday evening--the 7th of November, if I recollect
aright--I was sitting in my library, writing letters after dinner, when I
heard the postman come round the square and finally ascend the
steps of my house. A few moments later a footman entered bearing
some letters, and a large packet, upon a salver. Having read the
former, I cut the string which bound the parcel, and opened it.
To my surprise, it contained a bundle of manuscript and a letter.
The former I put aside, while I broke open the envelope and
extracted its contents. To my horror, it was from Simon Carne, and
ran as follows:

"On the High Seas.

MY DEAR LORD AMBERLEY,--

"It is only reasonable to suppose that by this time you have become
acquainted with the nature of the peculiar services you have
rendered me. I am your debtor for as pleasant, and, at the same
time, as profitable a visit to London as any man could desire. In
order that you may not think me ungrateful, I will ask you to accept
the accompanying narrative of my adventures in your great
metropolis. Since I have placed myself beyond the reach of capture,
I will permit you to make any use of it you please. Doubtless you will
blame me, but you must at least do me the justice to remember
that, in spite of the splendid opportunities you permitted me, I
invariably spared yourself and family. You will think me mad thus to
betray myself, but, believe me, I have taken the greatest precautions
against discovery, and as I am proud of my London exploits, I have
not the least desire to hide my light beneath a bushel.
"With kind regards to Lady Amberley and yourself,
"I am, yours very sincerely,
"SIMON CARNE."

Needless to say I did not retire to rest before I had read the
manuscript through from beginning to end, with the result that the
morning following I communicated with the police. They were
hopeful that they might be able to discover the place where the
packet had been posted, but after considerable search it was found
that it had been handed by a captain of a yacht, name unknown, to
the commander of a homeward bound brig, off Finisterre, for
postage in Plymouth. The narrative, as you will observe, is written in
the third person, and, as far as I can gather, the handwriting is not
that of Simon Carne. As, however, the details of each individual
swindle coincide exactly with the facts as ascertained by the police,
there can be no doubt of their authenticity.
A year has now elapsed since my receipt of the packet. During
that time the police of almost every civilized country have been on
the alert to effect the capture of my whilom friend, but without
success. Whether his yacht sank and conveyed him to the bottom of
the ocean, or whether, as I suspect, she only carried him to a certain
part of the seas where he changed into another vessel and so
eluded justice, I cannot say. Even the Maharajah of Malar-Kadir has
heard nothing of him since. The fact, however, remains, I have,
innocently enough, compounded a series of felonies, and, as I said
at the commencement of this preface, the publication of the
narrative I have so strangely received is intended to be, as far as
possible, my excuse.
CHAPTER II.
THE DEN OF INIQUITY.

The night was close and muggy, such a night, indeed, as only
Calcutta, of all the great cities of the East, can produce. The reek of
the native quarter, that sickly, penetrating odor which once smelt, is
never forgotten, filled the streets and even invaded the sacred
precincts of Government House, where a man of gentlemanly
appearance, but sadly deformed, was engaged in bidding Her
Majesty the Queen of England's representative in India an almost
affectionate farewell.
"You will not forget your promise to acquaint us with your arrival
in London," said His Excellency as he shook his guest by the hand.
"We shall be delighted to see you, and if we can make your stay
pleasurable as well as profitable to you, you may be sure we shall
endeavor to do so."
"Your lordship is most hospitable, and I think I may safely
promise that I will avail myself of your kindness," replied the other.
"In the meantime 'good-bye,' and a pleasant voyage to you."
A few minutes later he had passed the sentry, and was making
his way along the Maidan to the point where the Chitpore Road
crosses it. Here he stopped and appeared to deliberate. He smiled a
little sardonically as the recollection of the evening's entertainment
crossed his mind, and, as if he feared he might forget something
connected with it, when he reached a lamp-post, took a note-book
from his pocket and made an entry in it.
"Providence has really been most kind," he said as he shut the
book with a snap, and returned it to his pocket. "And what is more, I
am prepared to be properly grateful. It was a good morning's work
for me when His Excellency decided to take a ride through the
Maharajah's suburbs. Now I have only to play my cards carefully and
success should be assured."
He took a cigar from his pocket, nipped off the end, and then lit
it. He was still smiling when the smoke had cleared away.
"It is fortunate that Her Excellency is, like myself, an
enthusiastic admirer of Indian art," he said. "It is a trump card, and I
shall play it for all it's worth when I get to the other side. But to-
night I have something of more importance to consider. I have to
find the sinews of war. Let us hope that the luck which has followed
me hitherto will still hold good, and that Liz will prove as tractable as
usual."
Almost as he concluded his soliloquy a ticcagharri made its
appearance, and, without being hailed, pulled up beside him. It was
evident that their meeting was intentional, for the driver asked no
question of his fare, who simply took his seat, laid himself back upon
the cushions, and smoked his cigar with the air of a man playing a
part in some performance that had been long arranged.
Ten minutes later the coachman had turned out of the Chitpore
Road into a narrow by-street. From this he broke off into another,
and at the end of a few minutes into still another. These offshoots of
the main thoroughfare were wrapped in inky darkness, and, in order
that there should be as much danger as possible, they were crowded
to excess. To those who know Calcutta this information will be
significant.
There are slums in all the great cities of the world, and every
one boasts its own peculiar characteristics. The Ratcliffe Highway in
London, and the streets that lead off it, can show a fair assortment
of vice; the Chinese quarters of New York, Chicago, and San
Francisco can more than equal them; Little Bourke Street,
Melbourne, a portion of Singapore, and the shipping quarter of
Bombay, have their own individual qualities, but surely for the lowest
of all the world's low places one must go to Calcutta, the capital of
our great Indian Empire.
Surrounding the Lai, Machua, Burra, and Joira Bazaars are to be
found the most infamous dens that mind of man can conceive. But
that is not all. If an exhibition of scented, high-toned, gold-lacquered
vice is required, one has only to make one's way into the streets that
lie within a stone's throw of the Chitpore Road to be accommodated.
Reaching a certain corner, the gharri came to a standstill and
the fare alighted. He said something in an undertone to the driver as
he paid him, and then stood upon the footway placidly smoking until
the vehicle had disappeared from view. When it was no longer in
sight he looked up at the houses towering above his head; in one a
marriage feast was being celebrated; across the way the sound of a
woman's voice in angry expostulation could be heard. The passers-
by, all of whom were natives, scanned him curiously, but made no
remark. Englishmen, it is true, were sometimes seen in that quarter
and at that hour, but this one seemed of a different class, and it is
possible that nine out of every ten took him for the most detested of
all Englishmen, a police officer.
For upwards of ten minutes he waited, but after that he seemed
to become impatient. The person he had expected to find at the
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like