100% found this document useful (1 vote)

47 views66 pages

Deep Generative Modeling Jakub M. Tomczak - The ebook in PDF format with all chapters is ready for download

The document introduces the book 'Deep Generative Modeling' by Jakub M. Tomczak, which explores the intersection of probabilistic modeling and deep learning. It emphasizes the importance of unsupervised learning and generative models in advancing AI, particularly in understanding and representing complex data structures. The book is designed for students and researchers with a foundational knowledge in mathematics and programming, providing practical examples and code to facilitate learning.

Uploaded by

sarfoadrie29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

47 views66 pages

Deep Generative Modeling Jakub M. Tomczak - The ebook in PDF format with all chapters is ready for download

Uploaded by

sarfoadrie29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

Deep Generative Modeling Jakub M. Tomczak

https://ebookmeta.com/product/deep-generative-modeling-
jakub-m-tomczak/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com

Deep Generative Modeling
Jakub M. Tomczak

Deep Generative Modeling

Jakub M. Tomczak
Vrije Universiteit Amsterdam
Amsterdam
Noord-Holland, The Netherlands

ISBN 978-3-030-93157-5 ISBN 978-3-030-93158-2 (eBook)

https://doi.org/10.1007/978-3-030-93158-2

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my beloved wife Ewelina,
my parents, and brother.
Foreword

In the last decade, with the advance of deep learning, machine learning has made
enormous progress. It has completely changed entire subfields of AI such as
computer vision, speech recognition, and natural language processing. And more
fields are being disrupted as we speak, including robotics, wireless communication,
and the natural sciences.
Most advances have come from supervised learning, where the input (e.g., an
image) and the target label (e.g., a “cat”) are available for training. Deep neural
networks have become uncannily good at predicting objects in visuals scenes and
translating between languages. But obtaining labels to train such models is often
time consuming, expensive, unethical, or simply impossible. That’s why the field
has come to the realization that unsupervised (or self-supervised) methods are key
to make further progress.
This is no different for human learning: when human children grow up, the
amount of information that is consumed to learn about the world is mostly
unlabeled. How often does anyone really tell you what you see or hear in the
world? We must learn the regularities of the world unsupervised, and we do this
by searching for patterns and structure in the data.
And there is lots of structure to be learned! To illustrate this, imagine that we
choose the three colors of each pixel of an image uniformly at random. The result
will be an image that with overwhelmingly large probability will look like gibberish.
The vast majority of image space is filled with images that do not look like anything
we see when we open our eyes. This means that there is a huge amount of structure
that can be discovered, and so there is a lot to learn for children!
Of course, kids do not just stare into the world. Instead, they constantly interact
with it. When children play, they test their hypotheses about the laws of physics,
sociology, and psychology. When predictions are wrong, they are surprised and
presumably update their internal models to make better predictions next time. It is
reasonable to assume that this interactive play of an embodied intelligence is key to
at least arrive at the type of human intelligence we are used to. This type of learning
has clear parallels with reinforcement learning, where machines make plans, say to

vii
viii Foreword

play a game of chess, observe if they win or lose, and update their models of the
world and strategies to act in them.
But it’s difficult to make robots move around in the world to test hypotheses and
actively acquire their own annotations. So, the more practical approach to learning
with lots of data is unsupervised learning. This field has gained a huge amount of
attention and has seen stunning progress recently. One only needs to look at the
kind of images of non-existent human faces that we can now generate effortlessly
to experience the uncanny sense of progress the field has made.
Unsupervised learning comes in many flavors. This book is about the kind we
call probabilistic generative modeling. The goal of this subfield is to estimate a
probabilistic model of the input data. Once we have such a model, we can generate
new samples from it (i.e., new images of faces of people that do not exist).
A second goal is to learn abstract representations of the input. This latter field is
called representation learning. The high-level representations self-organize the input
into “disentangled” concepts, which could be the objects we are familiar with, such
as cars and cats, and their relationships.
While disentangling has a clear intuitive meaning, it has proven to be a
rather slippery concept to properly define. In the 1990s, people were thinking of
statistically independent latent variables. The goal of the brain was to transform the
highly dependent pixel representation into a much more efficient and less redundant
representation of independent latent variables, which compresses the input and
makes the brain more energy and information efficient.
Learning and compression are deeply connected concepts. Learning requires
lossy compression of data because we are interested in generalization and not in
storing the data. At the level of datasets, machine learning itself is about transferring
a tiny fraction of the information present in a dataset into the parameters of a model
and forgetting everything else.
Similarly, at the level of a single datapoint, when we process for example an
input image, we are ultimately interested in the abstract high-level concepts present
in that image, such as objects and their relations, and not in detailed, pixel-level
information. With our internal models we can reason about these objects, manipulate
them in our head and imagine possible counterfactual futures for them. Intelligence
is about squeezing out the relevant predictive information from the correlated soup
pixel-level information that hits our senses and representing that information in a
useful manner that facilitates mental manipulation.
But the objects that we are familiar with in our everyday lives are not really
all that independent. A cat that is chasing a bird is not statistically independent
of it. And so, people also made attempts to define disentangling in terms of
(subspaces of variables) that exhibit certain simple transformation properties when
we transform the input (a.k.a. equivariant representations), or as variables that one
can independently control in order to manipulate the world around us, or as causal
variables that are activating certain independent mechanisms that describe the world,
and so on.
The simplest way to train a model without labels is to learn a probabilistic
generative model (or density) of the input data. There are a number of techniques
Foreword ix

in the field of probabilistic generative models that focus directly on maximizing

the log-probability (or a bound on the log probability) of the data under the
generative model. Besides VAEs and GANs, this book explains normalizing flows,
autoregressive models, energy-based models, and the latest cool kid on the block:
deep diffusion models.
One can also learn representations that are good for a broad range of subsequent
prediction tasks without ever training a generative model. The idea is to design
tasks for the representation to solve that do not require one to acquire annotations.
For instance, when considering time varying data, one can simply predict the future,
which is fortunately always there for you. Or one can invent more exotic tasks such
as predicting whether a patch was to the right of the left of another patch, or whether
a movie is playing forward or backward, or predicting a word in the middle of
a sentence from the words around it. This type of unsupervised learning is often
called self-supervised learning, although I should admit that also this term seems to
be used in different ways by different people.
Many approaches can indeed be understood in this “auxiliary tasks” view
of unsupervised learning, including some probabilistic generative models. For
instance, a variational autoencoder (VAE) can be understood as predicting its own
input back by first pushing the information through an information bottleneck. A
GAN can be understood as predicting whether a presented input is a real image
(datapoint) or a fake (self-generated) one. Noise contrastive estimation can be seen
as predicting in latent space whether the embedding of an input patch was close or
far in space and/or time.
This book discusses the latest advances in deep probabilistic generative models.
And it does so in a very accessible way. What makes this book special is that, like
the child who is building a tower of bricks to understand the laws of physics, the
student who uses this book can learn about deep probabilistic generative models
by playing with code. And it really helps that the author has earned his spurs by
having published extensively in this field. It is a great to tool to teach this topic in
the classroom.
What will the future of our field bring? It seems obvious that progress towards
AGI will heavily rely on unsupervised learning. It’s interesting to see that the
scientific community seems to be divided into two camps: the “scaling camp”
believes that we achieve AGI by scaling our current technology to ever larger models
trained with more data and more compute power. Intelligence will automatically
emerge from this scaling. The other camp believes we need new theory and new
ideas to make further progress, such as the manipulation of discrete symbols (a.k.a.
reasoning), causality, and the explicit incorporation of common-sense knowledge.
And then there is of course the increasingly important and urgent discussion
of how humans will interact with these models: can they still understand what is
happening under the hood or should we simply give up on interpretability? How will
our lives change by models that understand us better than we do, and where humans
who follow the recommendations of algorithms are more successful than those who
resist? Or what information can we still trust if deepfakes become so realistic that
we cannot distinguish them anymore from the real thing? Will democracy still be
x Foreword

able function under this barrage of fake news? One thing is certain, this field is one
of the hottest in town, and this book is an excellent introduction to start engaging
with it. But everyone should be keenly aware that mastering this technology comes
with new responsibilities towards society. Let’s progress the field with caution.

October 30, 2021 Max Welling

Preface

We live in a world where Artificial Intelligence (AI) has become a widely used
term: there are movies about AI, journalists writing about AI, and CEOs talking
about AI. Most importantly, there is AI in our daily lives, turning our phones,
TVs, fridges, and vacuum cleaners into smartphones, smart TVs, smart fridges,
and vacuum robots. We use AI, however, we still do not fully understand what
“AI” is and how to formulate it, even though AI has been established as a separate
field in the 1950s. Since then, many researchers pursue the holy grail of creating
an artificial intelligence system that is capable of mimicking, understanding, and
aiding humans through processing data and knowledge. In many cases, we have
succeeded to outperform human beings on particular tasks in terms of speed and
accuracy! Current AI methods do not necessarily imitate human processing (neither
biologically nor cognitively) but rather are aimed at making a quick and accurate
decision like navigating in cleaning a room or enhancing the quality of a displayed
movie. In such tasks, probability theory is key since limited or poor quality of data
or intrinsic behavior of a system forces us to quantify uncertainty. Moreover, deep
learning has become a leading learning paradigm that allows learning hierarchical
data representations. It draws its motivation from biological neural networks;
however, the correspondence between deep learning and biological neurons is rather
far-fetched. Nevertheless, deep learning has brought AI to the next level, achieving
state-of-the-art performance in many decision-making tasks. The next step seems
to be a combination of these two paradigms, probability theory and deep learning,
to obtain powerful AI systems that are able to quantify their uncertainties about
environments they operate in.
What Is This Book About Then? This book tackles the problem of formulating AI
systems by combining probabilistic modeling and deep learning. Moreover, it goes
beyond the typical predictive modeling and brings together supervised learning and
unsupervised learning. The resulting paradigm, called deep generative modeling,
utilizes the generative perspective on perceiving the surrounding world. It assumes
that each phenomenon is driven by an underlying generative process that defines
a joint distribution over random variables and their stochastic interactions, i.e.,

xi
xii Preface

how events occur and in what order. The adjective “deep” comes from the fact
that the distribution is parameterized using deep neural networks. There are two
distinct traits of deep generative modeling. First, the application of deep neural
networks allows rich and flexible parameterization of distributions. Second, the
principled manner of modeling stochastic dependencies using probability theory
ensures rigorous formulation and prevents potential flaws in reasoning. Moreover,
probability theory provides a unified framework where the likelihood function plays
a crucial role in quantifying uncertainty and defining objective functions.
Who Is This Book for Then? The book is designed to appeal to curious students,
engineers, and researchers with a modest mathematical background in under-
graduate calculus, linear algebra, probability theory, and the basics in machine
learning, deep learning, and programming in Python and PyTorch (or other deep
learning libraries). It should appeal to students and researchers from a variety of
backgrounds, including computer science, engineering, data science, physics, and
bioinformatics that wish to get familiar with deep generative modeling. In order
to engage with a reader, the book introduces fundamental concepts with specific
examples and code snippets. The full code accompanying the book is available
online at:
https://github.com/jmtomczak/intro_dgm
The ultimate aim of the book is to outline the most important techniques in deep
generative modeling and, eventually, enable readers to formulate new models and
implement them.
The Structure of the Book The book consists of eight chapters that could be read
separately and in (almost) any order. Chapter 1 introduces the topic and highlights
important classes of deep generative models and general concepts. Chapters 2, 3
and 4 discuss modeling of marginal distributions while Chaps. 5, and 6 outline the
material on modeling of joint distributions. Chapter 7 presents a class of latent
variable models that are not learned through the likelihood-based objective. The
last chapter, Chap. 8, indicates how deep generative modeling could be used in the
fast-growing field of neural compression. All chapters are accompanied by code
snippets to help understand how the presented methods could be implemented. The
references are generally to indicate the original source of the presented material
and provide further reading. Deep generating modeling is a broad field of study,
and including all fantastic ideas is nearly impossible. Therefore, I would like to
apologize for missing any paper. If anyone feels left out, it was not intentional from
my side.
In the end, I would like to thank my wife, Ewelina, for her help and presence that
gave me the strength to carry on with writing this book. I am also grateful to my
parents for always supporting me, and my brother who spent a lot of time checking
the first version of the book and the code.

Amsterdam, The Netherlands Jakub M. Tomczak

November 1, 2021
Acknowledgments

This book, like many other books, would not have been possible without the con-
tribution and help from many people. During my career, I was extremely privileged
and lucky to work on deep generative modeling with an amazing set of people
whom I would like to thank here (in alphabetical order): Tameem Adel, Rianne
van den Berg, Taco Cohen, Tim Davidson, Nicola De Cao, Luka Falorsi, Eliseo
Ferrante, Patrick Forré, Ioannis Gatopoulos, Efstratios Gavves, Adam Gonczarek,
Amirhossein Habibian, Leonard Hasenclever, Emiel Hoogeboom, Maximilian Ilse,
Thomas Kipf, Anna Kuzina, Christos Louizos, Yura Perugachi-Diaz, Ties van
Rozendaal, Victor Satorras, Jerzy Światek,
˛ Max Welling, Szymon Zar˛eba, and
Maciej Zi˛eba.
I would like to thank other colleagues with whom I worked on AI and had plenty
of fascinating discussions (in alphabetical order): Davide Abati, Ilze Auzina, Babak
Ehteshami Bejnordi, Erik Bekkers, Tijmen Blankevoort, Matteo De Carlo, Fuda van
Diggelen, A.E. Eiben, Ali El Hassouni, Arkadiusz Gertych, Russ Greiner, Mark
Hoogendoorn, Emile van Krieken, Gongjin Lan, Falko Lavitt, Romain Lepert, Jie
Luo, ChangYong Oh, Siamak Ravanbakhsh, Diederik Roijers, David W. Romero,
Annette ten Teije, Auke Wiggers, and Alessandro Zonta.
I am especially thankful to my brother, Kasper, who patiently read all sections,
and ran and checked every single line of code in this book. You can’t even imagine
my gratitude for that!
I would like to thank my wife, Ewelina, for supporting me all the time and giving
me the strength to finish this book. Without her help and understanding, it would
be nearly impossible to accomplish this project. I would like to also express my
gratitude to my parents, Elżbieta and Ryszard, for their support at different stages
of my life because without them I would never be who I am now.

xiii
Contents

1 Why Deep Generative Modeling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 AI Is Not Only About Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Where Can We Use (Deep) Generative Modeling? . . . . . . . . . . . . . . . . . . . 3
1.3 How to Formulate (Deep) Generative Modeling?. . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Autoregressive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Flow-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Energy-Based Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Purpose and Content of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Autoregressive Models Parameterized by Neural Networks . . . . . . . . . 14
2.2.1 Finite Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Long-Range Memory Through RNNs . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Long-Range Memory Through Convolutional Nets . . . . . . . . . . 16
2.3 Deep Generative Autoregressive Model in Action! . . . . . . . . . . . . . . . . . . . 19
2.3.1 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Is It All? No! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Flow-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Flows for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Change of Variables for Deep Generative Modeling . . . . . . . . . 30
3.1.3 Building Blocks of RealNVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.1 Coupling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.2 Permutation Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.3 Dequantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Flows in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xv
xvi Contents

3.1.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Is It All? Really? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.7 ResNet Flows and DenseNet Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Flows for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Flows in R or Maybe Rather in Z? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Integer Discrete Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Probabilistic Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Variational Auto-Encoders: Variational Inference for
Non-linear Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 The Model and the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 A Different Perspective on the ELBO. . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Components of VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3.1 Parameterization of Distributions . . . . . . . . . . . . . . . . . . . 63
4.3.3.2 Reparameterization Trick. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 VAE in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.6 Typical Issues with VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 There Is More! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Improving Variational Auto-Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1.1 Standard Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1.3 VampPrior: Variational Mixture of
Posterior Prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1.4 GTM: Generative Topographic Mapping . . . . . . . . . . . 85
4.4.1.5 GTM-VampPrior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1.6 Flow-Based Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Variational Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2.1 Variational Posteriors with Householder
Flows [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2.2 Variational Posteriors with Sylvester
Flows [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.2.3 Hyperspherical Latent Space . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Hierarchical Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Hierarchical VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.2.1 Two-Level VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents xvii

4.5.2.2 Top-Down VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5.2.3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.2.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.3 Diffusion-Based Deep Generative Models . . . . . . . . . . . . . . . . . . . . 112
4.5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.3.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5.3.3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5 Hybrid Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Approach 1: Let’s Be Naive! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.2 Approach 2: Shared Parameterization! . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Hybrid Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3 Let’s Implement It! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7 Generative Adversarial Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Implicit Modeling with Generative Adversarial Networks
(GANs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3 Implementing GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.4 There Are Many GANs Out There! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8 Deep Generative Modeling for Neural Compression. . . . . . . . . . . . . . . . . . . . . 173
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.2 General Compression Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.3 A Short Detour: JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.4 Neural Compression: Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
xviii Contents

A Useful Facts from Algebra and Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

A.1 Norms & Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.2 Matrix Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

B Useful Facts from Probability Theory and Statistics. . . . . . . . . . . . . . . . . . . . . 191

B.1 Commonly Used Probability Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 1
Why Deep Generative Modeling?

1.1 AI Is Not Only About Decision Making

Before we start thinking about (deep) generative modeling, let us consider a simple
example. Imagine we have trained a deep neural network that classifies images (x ∈
ZD ) of animals (y ∈ Y, and Y = {cat, dog, horse}). Further, let us assume that
this neural network is trained really well so that it always classifies a proper class
with a high probability p(y|x). So far so good, right? The problem could occur
though. As pointed out in [1], adding noise to images could result in completely
false classification. An example of such a situation is presented in Fig. 1.1 where
adding noise could shift predicted probabilities of labels; however, the image is
barely changed (at least to us, human beings).
This example indicates that neural networks that are used to parameterize the
conditional distribution p(y|x) seem to lack semantic understanding of images.
Further, we even hypothesize that learning discriminative models is not enough for
proper decision making and creating AI. A machine learning system cannot rely on
learning how to make a decision without understanding the reality and being able to
express uncertainty about the surrounding world. How can we trust such a system
if even a small amount of noise could change its internal beliefs and also shift its
certainty from one decision to the other? How can we communicate with such a
system if it is unable to properly express its opinion about whether its surrounding
is new or not?
To motivate the importance of the concepts like uncertainty and understanding
in decision making, let us consider a system that classifies objects, but this time
into two classes: orange and blue. We assume we have some two-dimensional data
(Fig. 1.2, left) and a new datapoint to be classified (a black cross in Fig. 1.2). We
can make decisions using two approaches. First, a classifier could be formulated
explicitly by modeling the conditional distribution p(y|x) (Fig. 1.2, middle). Sec-
ond, we can consider a joint distribution p(x, y) that could be further decomposed
as p(x, y) = p(y|x) p(x) (Fig. 1.2, right).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1

J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_1
2 1 Why Deep Generative Modeling?

Fig. 1.1 An example of adding noise to an almost perfectly classified image that results in a shift
of predicted label

Fig. 1.2 And example of data (left) and two approaches to decision making: (middle) a discrimi-
native approach and (right) a generative approach

After training a model using the discriminative approach, namely, the conditional
distribution p(y|x), we obtain a clear decision boundary. Then, we see that the black
cross is farther away from the orange region; thus, the classifier assigns a higher
probability to the blue label. As a result, the classifier is certain about the decision!
On the other hand, if we additionally fit a distribution p(x), we observe that the
black cross is not only farther away from the decision boundary, but also it is distant
to the region where the blue datapoints lie. In other words, the black point is far away
from the region of high probability mass. As a result, the (marginal) probability
of the black cross p(x = black cross) is low, and the joint distribution p(x =
black cross, y = blue) will be low as well and, thus, the decision is uncertain!
This simple example clearly indicates that if we want to build AI systems that
make reliable decisions and can communicate with us, human beings, they must
understand the environment first. For this purpose, they cannot simply learn how
to make decisions, but they should be able to quantify their beliefs about their
surrounding using the language of probability [2, 3]. In order to do that, we claim
that estimating the distribution over objects, p(x), is crucial.
From the generative perspective, knowing the distribution p(x) is essential
because:
• It could be used to assess whether a given object has been observed in the past or
not.
• It could help to properly weight the decision.
• It could be used to assess uncertainty about the environment.
1.2 Where Can We Use (Deep) Generative Modeling? 3

• It could be used to actively learn by interacting with the environment (e.g., by

asking for labeling objects with low p(x)).
• And, eventually, it could be used to generate (synthesize) new objects.
Typically, in the literature of deep learning, generative models are treated as
generators of new data. However, here we try to convey a new perspective where
having p(x) has much broader applicability, and this could be essential for building
successful AI systems. Lastly, we would like to also make an obvious connection to
generative modeling in machine learning, where formulating a proper generative
process is crucial for understanding the phenomena of interest [3, 4]. However,
in many cases, it is easier to focus on the other factorization, namely, p(x, y) =
p(x|y) p(y). We claim that considering p(x, y) = p(y|x) p(x) has clear advantages
as mentioned before.

1.2 Where Can We Use (Deep) Generative Modeling?

With the development of neural networks and the increase in computational power,
deep generative modeling has become one of the leading directions in AI. Its
applications vary from typical modalities considered in machine learning, i.e., text
analysis (e.g., [5]), image analysis (e.g., [6]), audio analysis (e.g., [7]), to problems
in active learning (e.g., [8]), reinforcement learning (e.g., [9]), graph analysis (e.g.,
[10]), and medical imaging (e.g., [11]). In Fig. 1.3, we present graphically potential
applications of deep generative modeling.
In some applications, it is indeed important to generate (synthesize) objects or
modify features of objects to create new ones (e.g., an app turns a young person

Learning

Labeled Unlabeled
data data

Quering

Active learning

Text
Images

Graphs

Reinforcement Audio
learning
Medical imaging

Fig. 1.3 Various potential applications of deep generative modeling

4 1 Why Deep Generative Modeling?

into an old one). However, in others like active learning it is important to ask for
uncertain objects, i.e., objects with low p(x)) that should be labeled by an oracle. In
reinforcement learning, on the other hand, generating the next most likely situation
(states) is crucial for taking actions by an agent. For medical applications, explaining
a decision, e.g., in terms of the probability of the label and the object, is definitely
more informative to a human doctor than simply assisting with a diagnosis label.
If an AI system would be able to indicate how certain it is and also quantify
whether the object is suspicious (i.e., low p(x)) or not, then it might be used as
an independent specialist that outlines its own opinion.
These examples clearly indicate that many fields, if not all, could highly benefit
from (deep) generative modeling. Obviously, there are many mechanisms that AI
systems should be equipped with. However, we claim that the generative modeling
capability is definitely one of the most important ones, as outlined in the above-
mentioned cases.

1.3 How to Formulate (Deep) Generative Modeling?

At this point, after highlighting the importance and wide applicability of (deep)
generative modeling, we should ask ourselves how to formulate (deep) generative
models. In other words, how to express p(x) that we mentioned already multiple
times.
We can divide (deep) generative modeling into four main groups (see Fig. 1.4):
• Autoregressive generative models (ARM)
• Flow-based models
• Latent variable models
• Energy-based models
We use deep in brackets because most of what we have discussed so far could be
modeled without using neural networks. However, neural networks are flexible and
powerful and, therefore, they are widely used to parameterize generative models.
From now on, we focus entirely on deep generative models.
As a side note, please treat this taxonomy as a guideline that helps us to navigate
through this book, not something written in stone. Personally, I am not a big fan of
spending too much time on categorizing and labeling science because it very often
results in antagonizing and gatekeeping. Anyway, there is also a group of models
based on the score matching principle [12–14] that do not necessarily fit our simple
taxonomy. However, as pointed out in [14], these models share a lot of similarities
with latent variable models (if we treat consecutive steps of a stochastic process as
latent variables) and, thus, we treat them as such.
1.3 How to Formulate (Deep) Generative Modeling? 5

Deep Generative Models

Autoregressive Flow-based
Latent variable Energy-based
models models
models models
(e.g., PixelCNN) (e.g., RealNVP)

Implicit models Prescribed models

(e.g., GANs) (e.g., VAEs)

Fig. 1.4 A taxonomy of deep generative models

1.3.1 Autoregressive Models

The first group of deep generative models utilizes the idea of autoregressive
modeling (ARM). In other words, the distribution over x is represented in an
autoregressive manner:

D
p(x) = p(x0 ) p(xi |x<i ), (1.1)
i=1

where x<i denotes all x’s up to the index i.

Modeling all conditional distributions p(xi |x<i ) would be computationally
inefficient. However, we can take advantage of causal convolutions as presented
in [7] for audio and in [15, 16] for images. We will discuss ARMs more in depth in
Chap. 2.

1.3.2 Flow-Based Models

The change of variables formula provides a principled manner of expressing a

density of a random variable by transforming it with an invertible transformation
f [17]:

p(x) = p z = f (x) |Jf (x) |, (1.2)

where Jf (x) denotes the Jacobian matrix.

We can parameterize f using deep neural networks; however, it cannot be any
arbitrary neural networks, because we must be able to calculate the Jacobian matrix.
First ideas of using the change of variable formulate focused on linear, volume-
preserving transformations that yields |Jf (x) | = 1 [18, 19]. Further attempts utilized
6 1 Why Deep Generative Modeling?

theorems on matrix determinants that resulted in specific non-linear transformations,

namely, planar flows [20] and Sylvester flows [21, 22]. A different approach focuses
on formulating invertible transformations for which the Jacobian-determinant could
be calculated easily like for coupling layers in RealNVP [23]. Recently, arbitrary
neural networks are constrained in such a way they are invertible and the Jacobian-
determinant is approximated [24–26].
In the case of the discrete distributions (e.g., integers), for the probability mass
functions, there is no change of volume and, therefore, the change of variables
formula takes the following form:

p(x) = p z = f (x) . (1.3)

Integer discrete flows propose to use affine coupling layers with rounding
operators to ensure the integer-valued output [27]. A generalization of the affine
coupling layer was further investigated in [28].
All generative models that take advantage of the change of variables formula
are referred to as flow-based models or flows for short. We will discuss flows in
Chap. 3.

1.3.3 Latent Variable Models

The idea behind latent variable models is to assume a lower-dimensional latent

space and the following generative process:

z ∼ p(z)
x ∼ p(x|z).

In other words, the latent variables correspond to hidden factors in data, and the
conditional distribution p(x|z) could be treated as a generator.
The most widely known latent variable model is the probabilistic Principal
Component Analysis (pPCA) [29] where p(z) and p(x|z) are Gaussian distribu-
tions, and the dependency between z and x is linear.
A non-linear extension of the pPCA with arbitrary distributions is the Varia-
tional Auto-Encoder (VAE) framework [30, 31]. To make the inference tractable,
variational inference is utilized to approximate the posterior p(z|x), and neural
networks are used to parameterize the distributions. Since the publication of the
seminal papers by Kingma and Welling [30], Rezende et al. [31], there were multiple
extensions of this framework, including working on more powerful variational
posteriors [19, 21, 22, 32], priors [33, 34], and decoders [35]. Interesting directions
include considering different topologies of the latent space, e.g., the hyperspherical
latent space [36]. In VAEs and the pPCA all distributions must be defined upfront
1.3 How to Formulate (Deep) Generative Modeling? 7

and, therefore, they are called prescribed models. We will pay special attention to
this group of deep generative models in Chap. 4.
So far, ARMs, flows, the pPCA, and VAEs are probabilistic models with
the objective function being the log-likelihood function that is closely related
to using the Kullback–Leibler divergence between the data distribution and the
model distribution. A different approach utilizes an adversarial loss in which a
discriminator D(·) determines a difference between real data and synthetic
data

provided by a generator in the implicit form, namely, p(x|z) = δ x − G(z) ,
where δ(·) is the Dirac delta. This group of models is called implicit models, and
Generative Adversarial Networks (GANs) [6] become one of the first successful
deep generative models for synthesizing realistic-looking objects (e.g., images). See
Chap. 7 for more details.

1.3.4 Energy-Based Models

Physics provide an interesting perspective on defining a group of generative

models through defining an energy function, E(x), and, eventually, the Boltzmann
distribution:
exp{−E(x)}
p(x) = , (1.4)
Z

where Z = x exp{−E(x)} is the partition function.
In other words, the distribution is defined by the exponentiated energy function
that is further normalized to obtain values between 0 and 1 (i.e., probabilities). There
is much more into that if we think about physics, but we do not require delving into
that. I refer to [37] as a great starting point for that.
Models defined by an energy function are referred to as energy-based models
(EBMs) [38]. The main idea behind EBMs is to formulate the energy function and
calculate (or rather approximate) the partition function. The largest group of EBMs
consists of Boltzmann Machines that entangle x’s through a bilinear form, i.e.,
E(x) = x Wx [39, 40]. Introducing latent variables and taking E(x, z) = x Wz
results in Restricted Boltzmann Machines [41]. The idea of Boltzmann machines
could be further extended to the joint distribution over x and y as it is done, e.g., in
classification Restricted Boltzmann Machines [42]. Recently, it has been shown that
an arbitrary neural network could be used to define the joint distribution [43]. We
will discuss how this could be accomplished in Chap. 6.
8 1 Why Deep Generative Modeling?

1.3.5 Overview

In Table 1.1, we compared all four groups of models (with a distinction between
implicit latent variable models and prescribed latent variable models) using arbitrary
criteria like:
• Whether training is typically stable
• Whether it is possible to calculate the likelihood function
• Whether one can use a model for lossy or lossless compression
• Whether a model could be used for representation learning
All likelihood-based models (i.e., ARMs, flows, EBMs, and prescribed models
like VAEs) can be trained in a stable manner, while implicit models like GANs
suffer from instabilities. In the case of the non-linear prescribed models like VAEs,
we must remember that the likelihood function cannot be exactly calculated, and
only a lower-bound could be provided. Similarly, EBMs require calculating the
partition function that is analytically intractable problem. As a result, we can get
the unnormalized probability or an approximation at best. ARMs constitute one
of the best likelihood-based models; however, their sampling process is extremely
slow due to the autoregressive manner of generating new content. EBMs require
running a Monte Carlo method to receive a sample. Since we operate on high-
dimensional objects, this is a great obstacle for using EBMs widely in practice. All
other approaches are relatively fast. In the case of compression, VAEs are models
that allow us to use a bottleneck (the latent space). On the other hand, ARMs and
flows could be used for lossless compression since they are density estimators
and provide the exact likelihood value. Implicit models cannot be directly used
for compression; however, recent works use GANs to improve image compression
[44]. Flows, prescribed models, and EBMs (if use latents) could be used for
representation learning, namely, learning a set of random variables that summarize
data in some way and/or disentangle factors in data. The question about what is a
good representation is a different story and we refer a curious reader to literature,
e.g., [45].

Table 1.1 A comparison of deep generative models

Generative models Training Likelihood Sampling Compression Representation
Autoregressive models Stable Exact Slow Lossless No
Flow-based models Stable Exact Fast/slow Lossless Yes
Implicit models Unstable No Fast No No
Prescribed models Stable Approximate Fast Lossy Yes
Energy-based models Stable Unnormalized Slow Rather not Yes
1.4 Purpose and Content of This Book 9

1.4 Purpose and Content of This Book

This book is intended as an introduction to the field of deep generative modeling.

Its goal is to convince you, dear reader, to the philosophy of generative modeling
and show you its beauty! Deep generative modeling is an interesting hybrid
that combines probability theory, statistics, probabilistic machine learning, and
deep learning in a single framework. However, to be able to follow the ideas
presented in this book it is advised to possess knowledge in algebra and calculus,
probability theory and statistics, the basics of machine learning and deep learning,
and programming with Python. Knowing PyTorch1 is highly recommended since
all code snippets are written in PyTorch. However, knowing other deep learning
frameworks like Keras, Tensorflow, or JAX should be sufficient to understand the
code.
In this book, we will not review machine learning concepts or building blocks
in deep learning unless it is essential to comprehend a given topic. Instead, we
will delve into models and training algorithms of deep generative models. We
will either discuss the marginal models, such as autoregressive models (Chap. 2),
flow-based models (Chap. 3): RealNVP, Integer Discrete Flows, and residual and
DenseNet flows, latent variable models (Chap. 4): Variational Auto-Encoder and its
components, hierarchical VAEs, and Diffusion-based deep generative models, or
frameworks for modeling the joint distribution like hybrid modeling (Chap. 5) and
energy-based models (Chap. 6). Eventually, we will present how deep generative
modeling could be useful for data compression within the neural compression
framework (Chap. 8). In general, the book is organized in such a way that each
chapter could be followed independently from the others and in an order that suits a
reader best.
So who is the target audience of this book? Well, hopefully everybody who is
interested in AI, but there are two groups who could definitely benefit from the
presented content. The first target audience is university students who want to go
beyond standard courses on machine learning and deep learning. The second group
is research engineers who want to broaden their knowledge on AI or prefer to make
the next step in their careers and learn about the next generation of AI systems.
Either way, the book is intended for curious minds who want to understand AI and
learn not only about theory but also how to implement the discussed material. For
this purpose, each topic is associated with general discussion and introduction that
is further followed by formal formulations and a piece of code (in PyTorch). The
intention of this book is to truly understand deep generative modeling that, in the
humble opinion of the author of this book, is only possible if one can not only derive
a model but also implement it. Therefore, this book is accompanied by the following
code repository:
https://github.com/jmtomczak/intro_dgm

1 https://pytorch.org/.
10 1 Why Deep Generative Modeling?

References

1. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International
Conference on Learning Representations, ICLR 2014, 2014.
2. Christopher M Bishop. Model-based machine learning. Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20120222,
2013.
3. Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature,
521(7553):452–459, 2015.
4. Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative
and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 1, pages 87–94. IEEE, 2006.
5. Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Language Learning, pages 10–21, 2016.
6. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint
arXiv:1406.2661, 2014.
7. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
8. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–
5981, 2019.
9. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
10. GraphVAE: Towards generation of small graphs using variational autoencoders,
author=Simonovsky, Martin and Komodakis, Nikos, booktitle=International Conference
on Artificial Neural Networks, pages=412–422, year=2018, organization=Springer.
11. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
12. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.
13. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. arXiv preprint arXiv:1907.05600, 2019.
14. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
15. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
16. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
17. Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv preprint arXiv:1302.5125, 2013.
18. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
19. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv preprint arXiv:1611.09630, 2016.
20. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Inter-
national Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
References 11

21. Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 393–402. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
22. Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The
convolution exponential and generalized Sylvester flows. arXiv preprint arXiv:2006.01910,
2020.
23. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv preprint arXiv:1605.08803, 2016.
24. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen.
Invertible residual networks. In International Conference on Machine Learning, pages 573–
582. PMLR, 2019.
25. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
26. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
Concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
27. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
28. Jakub M Tomczak. General invertible transformations for flow-based generative modeling.
INNF+, 2021.
29. Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622,
1999.
30. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
31. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In International conference on machine
learning, pages 1278–1286. PMLR, 2014.
32. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural
Information Processing Systems, 29:4743–4751, 2016.
33. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
34. Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
35. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
36. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
37. Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
38. Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-
based learning. Predicting structured data, 1(0), 2006.
39. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
40. Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in Boltzmann
machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.
41. Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural
networks: Tricks of the trade, pages 599–619. Springer, 2012.
12 1 Why Deep Generative Modeling?

42. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on Machine learning, pages
536–543, 2008.
43. Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad
Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should
treat it like one. In International Conference on Learning Representations, 2019.
44. Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity
generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
45. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Chapter 2
Autoregressive Models

2.1 Introduction

Before we start discussing how we can model the distribution p(x), we refresh our
memory about the core rules of probability theory, namely, the sum rule and the
product rule. Let us introduce two random variables x and y. Their joint distribution
is p(x, y). The product rule allows us to factorize the joint distribution in two
manners, namely:

p(x, y) = p(x|y)p(y) (2.1)

= p(y|x)p(x). (2.2)

In other words, the joint distribution could be represented as a product of a

marginal distribution and a conditional distribution. The sum rule tells us that if
we want to calculate the marginal distribution over one of the variables, we must
integrate out (or sum out) the other variable, that is:

p(x) = p(x, y). (2.3)
y

These two rules will play a crucial role in probability theory and statistics and, in
particular, in formulating deep generative models.
Now, let us consider a high-dimensional random variable x ∈ XD where X =
{0, 1, . . . , 255} (e.g., pixel values) or X = R. Our goal is to model p(x). Before we
jump into thinking of specific parameterization, let us first apply the product rule to
express the joint distribution in a different manner:

D
p(x) = p(x1 ) p(xd |x<d ), (2.4)
d=2

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 13

J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_2
14 2 Autoregressive Models

where x<d = [x1 , x2 , . . . , xd−1 ] . For instance, for x = [x1 , x2 , x3 ] , we have

p(x) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 ).
As we can see, the product rule applied multiple times to the joint distribu-
tion provides a principled manner of factorizing the joint distribution into many
conditional distributions. That’s great news! However, modeling all conditional
distributions p(xd |x<d ) separately is simply infeasible! If we did that, we would
obtain D separate models, and the complexity of each model would grow due to
varying conditioning. A natural question is whether we can do better, and the answer
is yes.

2.2 Autoregressive Models Parameterized by Neural

Networks

As mentioned earlier, we aim for modeling the joint distribution p(x) using
conditional distributions. A potential solution to the issue of using D separate model
is utilizing a single, shared model for the conditional distribution. However, we
need to make some assumptions to use such a shared model. In other words, we
look for an autoregressive model (ARM). In the next subsection, we outline ARMs
parameterized with various neural networks. After all, we are talking about deep
generative models so using a neural network is not surprising, isn’t it?

2.2.1 Finite Memory

The first attempt to limiting the complexity of a conditional model is to assume a

finite memory. For instance, we can assume that each variable is dependent on no
more than two other variables, namely:

D
p(x) = p(x1 )p(x2 |x1 ) p(xd |xd−1 , xd−2 ). (2.5)
d=3

Then, we can use a small neural network, e.g., multi-layered perceptron (MLP),
to predict the distribution of xd . If X = {0, 1, . . . , 255}, the MLP takes xd−1 , xd−2
and outputs probabilities for the categorical distribution of xd , θd . The MLP could
be of the following form:

[xd−1 , xd−2 ] → Linear(2, M) → ReLU → Linear(M, 256) → softmax → θd ,

(2.6)
where M denotes the number of hidden units, e.g., M = 300. An example of this
approach is depicted in Fig. 2.1.
2.2 Autoregressive Models Parameterized by Neural Networks 15

Fig. 2.1 An example of applying a shared MLP depending on two last inputs. Inputs are denoted
by blue nodes (bottom), intermediate representations are denoted by orange nodes (middle), and
output probabilities are denoted by green nodes (top). Notice that a probability θd is not dependent
on xd

It is important to notice that now we use a single, shared MLP to predict

probabilities for xd . Such a model is not only non-linear but also its parameterization
is convenient due to a relatively small number of weights to be trained. However,
the obvious drawback of this approach is a limited memory (i.e., only two last
variables in our example). Moreover, it is unclear a priori how many variables we
should use in conditioning. In many problems, e.g., image processing, learning long-
range statistics is crucial to understand complex patterns in data; therefore, having
long-range memory is essential.

2.2.2 Long-Range Memory Through RNNs

A possible solution to the problem of a short-range memory modeled by an MLP

relies on applying a recurrent neural network (RNN) [1, 2]. In other words, we can
model the conditional distributions as follows [3]:

p(xd |x<d ) = p (xd |RNN(xd−1 , hd−1 )) , (2.7)

where hd = RNN(xd−1 , hd−1 ), and hd is a hidden context, which acts as a memory

that allows learning long-range dependencies. An example of using an RNN is
presented in Fig. 2.2.
This approach gives a single parameterization, thus, it is efficient and also solves
the problem of a finite memory. So far so good! Unfortunately, RNNs suffer from
other issues, namely:
• They are sequential, hence, slow.
16 2 Autoregressive Models

Fig. 2.2 An example of applying an RNN depending on two last inputs. Inputs are denoted by blue
nodes (bottom), intermediate representations are denoted by orange nodes (middle), and output
probabilities are denoted by green nodes (top). Notice that compared to the approach with a shared
MLP, there is an additional dependency between intermediate nodes hd

• If they are badly conditioned (i.e., the eigenvalues of a weight matrix are
larger or smaller than 1, then they suffer from exploding or vanishing gradients,
respectively, that hinders learning long-range dependencies.
There exist methods to help training RNNs like gradient clipping or, more generally,
gradient regularization [4] or orthogonal weights [5]. However, here we are not
interested in looking into rather specific solutions to new problems. We seek for a
different parameterization that could solve our original problem, namely, modeling
long-range dependencies in an ARM.

2.2.3 Long-Range Memory Through Convolutional Nets

In [6, 7] it was noticed that convolutional neural networks (CNNs) could be used
instead of RNNs to model long-range dependencies. To be more precise, one-
dimensional convolutional layers (Conv1D) could be stacked together to process
sequential data. The advantages of such an approach are the following:
• Kernels are shared (i.e., an efficient parameterization).
• The processing is done in parallel that greatly speeds up computations.
• By stacking more layers, the effective kernel size grows with the network depth.
These three traits seem to place Conv1D-based neural networks as a perfect solution
to our problem. However, can we indeed use them straight away?
A Conv1D can be applied to calculate embeddings like in [7], but it cannot be
used for autoregressive models. Why? Because we need convolutions to be causal
[8]. Causal in this context means that a Conv1D layer is dependent on the last k
inputs but the current one (option A) or with the current one (option B). In other
words, we must “cut” the kernel in half and forbid it to look into the next variables
2.2 Autoregressive Models Parameterized by Neural Networks 17

Fig. 2.3 An example of applying causal convolutions. The kernel size is 2, but by applying dilation
in higher layers, a much larger input could be processed (red edges), thus, a larger memory is
utilized. Notice that the first layers must be option A to ensure proper processing

(look into the future). Importantly, the option A is required in the first layer because
the final output (i.e., the probabilities θd ) cannot be dependent on xd . Additionally,
if we are concerned about the effective kernel size, we can use dilation larger
than 1.
In Fig. 2.3 we present an example of a neural network consisting of 3 causal
Conv1D layers. The first CausalConv1D is of type A, i.e., it does not take into
account only the last k inputs without the current one. Then, in the next two layers,
we use CausalConv1D (option B) with dilations 2 and 3. Typically, the dilation
values are 1, 2, 4, and 8 (v.d. Oord et al., 2016a); however, taking 2 and 4 would
not nicely fit in a figure. We highlight in red all connections that go from the output
layer to the input layer. As we can notice, stacking CausalConv1D layers with the
dilation larger than 1 allows us to learn long-range dependencies (in this example,
by looking at 7 last inputs).
An example of an implementation of CausalConv1D layer is presented below. If
you are still confused about the option A and the option B, please analyze the code
snippet step-by-step.
1 class CausalConv1d (nn. Module ):
2 def __init__ (self , in_channels , out_channels , kernel_size ,
dilation , A=False , ∗∗ kwargs ):
3 super( CausalConv1d , self ). __init__ ()
4

5 # The general idea is the following: We take the built−in

PyTorch Conv1D . Then , we must pick a proper padding , because
we must ensure the convolutional is causal . Eventually , we
18 2 Autoregressive Models

must remove some final elements of the output , because we

simply don ’t need them! Since CausalConv1D is still a
convolution , we must define the kernel size , dilation , and
whether it is option A (A=True) or option B (A=False).
Remember that by playing with dilation we can enlarge the
size of the memory .
6

7 # attributes :
8 self. kernel_size = kernel_size
9 self. dilation = dilation
10 self.A = A # whether option A (A=True) or B (A= False)
11 self. padding = ( kernel_size − 1) ∗ dilation + A ∗ 1
12

13 # we will do padding by ourselves in the forward pass!

14 self. conv1d = torch.nn. Conv1d ( in_channels , out_channels ,
15 kernel_size , stride =1,
16 padding =0,
17 dilation =dilation ,∗∗ kwargs )
18

19 def forward (self , x):

20 # We do padding only from the left! This is more
efficient implementation .
21 x = torch.nn. functional .pad(x, (self.padding , 0))
22 conv1d_out = self. conv1d (x)
23 if self.A:
24 # Remember , we cannot be dependent on the current
component; therefore , the last element is removed .
25 return conv1d_out [:, :, : −1]
26 else:
27 return conv1d_out

Listing 2.1 Causal convolution 1D

The CausalConv1D layers are better-suited to modeling sequential data than

RNNs. They obtain not only better results (e.g., classification accuracy) but also
allow learning long-range dependencies more efficiently than RNNs [8]. Moreover,
they do not suffer from exploding/vanishing gradient issues. As a result, they seem
to be a perfect parameterization for autoregressive models! Their supremacy has
been proven in many cases, including audio processing by WaveNet, a neural
network consisting of CausalConv1D layers [9], or image processing by PixelCNN,
a model with CausalConv2D components [10].
Then, is there any drawback of applying autoregressive models parameterized by
causal convolutions? Unfortunately, yes, there is and it is connected with sampling.
If we want to evaluate probabilities for given inputs, we need to calculate the
forward pass where all calculations are done in parallel. However, if we want to
sample new objects, we must iterate through all positions (think of a big for-loop,
from the first variable to the last one) and iteratively predict probabilities and sample
new values. Since we use convolutions to parameterize the model, we must do D full
forward passes to get the final sample. That is a big waste, but, unfortunately, that
is the price we must pay for all “goodies” following from the convolutional-based
2.3 Deep Generative Autoregressive Model in Action! 19

parameterization of the ARM. Fortunately, there is on-going research on speeding

up computations, e.g., see [11].

2.3 Deep Generative Autoregressive Model in Action!

Alright, let us talk more about details and how to implement an ARM. Here, and in
the whole book, we focus on images, e.g., x ∈ {0, 1, . . . , 15}64 . Since images are
represented by integers, we will use the categorical distribution to represent them (in
next chapters, we will comment on the choice of distribution for images and present
some alternatives). We model p(x) using an ARM parameterized by CausalConv1D
layers. As a result, each conditional is the following:

p(xd |x<d ) = Categorical (xd |θd (x<d )) (2.8)

L
[xd =l]
= θd,l , (2.9)
l=1

where [a = b] is the Iverson bracket (i.e., [a = b] = 1 if a = b, and [a = b] = 0

if a = b), and θd (x<d ) ∈ [0, 1]16 is the outputof the CausalConv1D-based neural
network with the softmax in the last layer, so L l=1 θd,l = 1. To be very clear, the
last layer must have 16 output channels (because there are 16 possible values per
pixel), and the softmax is taken over these 16 values. We stack CausalConv1D layers
with non-linear activation functions in between (e.g., LeakyRELU). Of course,
we must remember about taking the option A CausalConv1D as the first layer!
Otherwise we break the assumption about taking into account xd in predicting θd .
What about the objective function? ARMs are the likelihood-based models, so for
given N i.i.d. datapoints D = {x1 , . . . , xN }, we aim at maximizing the logarithm of
the likelihood function, that is (we will use the product and sum rules again):

ln p(D) = ln p(xn ) (2.10)
n

= ln p(xn ) (2.11)
n

= ln p(xn,d |xn,<d ) (2.12)
n d

= ln p(xn,d |xn,<d ) (2.13)
n d

= ln Categorical (xd |θd (x<d )) (2.14)
n d
20 2 Autoregressive Models

L

= [xd = l] ln θd (x<d ) . (2.15)
n d l=1

For simplicity, we assumed that x<1 = ∅, i.e., no conditioning. As we can notice,

the objective function takes a very nice form! First, the logarithm over the i.i.d. data
D results in a sum over datapoints of the logarithm of individual distributions p(xn ).
Second, applying the product rule, together with the logarithm, results in another
sum, this time over dimensions. Eventually, by parameterizing the conditionals by
CausalConv1D, we can calculate all θd in one forward pass and then check the pixel
value (see the last line of ln p(D)). Ideally, we want θd,l to be as close to 1 as
possible if xd = l.

2.3.1 Code

Uff... Alright, let’s take a look at some code. The full code is available under the
following: https://github.com/jmtomczak/intro_dgm. Here, we focus only on the
code for the model. We provide details in the comments.
1 class ARM(nn. Module ):
2 def __init__ (self , net , D=2, num_vals =256):
3 super(ARM , self). __init__ ()
4

5 # Remember , always credit the author , even if it’s you ;)

6 print(’ARM by JT.’)
7

8 # This is a definition of a network . See the next cell.

9 self.net = net
10 # This is how many values a pixel can take.
11 self. num_vals = num_vals
12 # This is the problem dimentionality (the number of
pixels )
13 self.D = D
14

15 # This function calculates the arm output .

16 def f(self , x):
17 # First , we apply causal convolutions .
18 h = self.net(x. unsqueeze (1))
19 # In channels , we have the number of values . Therefore ,
we change the order of dims.
20 h = h. permute (0, 2, 1)
21 # We apply softmax to calculate probabilities .
22 p = torch. softmax (h, 2)
23 return p
24

25 # The forward pass calculates the log−probability of an image

.
26 def forward (self , x, reduction=’avg ’):
27 if reduction == ’avg ’:
2.3 Deep Generative Autoregressive Model in Action! 21

28 return −(self. log_prob (x).mean ())

29 elif reduction == ’sum ’:
30 return −(self. log_prob (x).sum ())
31 else:
32 raise ValueError (’reduction could be either ‘avg ‘ or
‘sum ‘. ’)
33

34 # This function calculates the log−probability (log−

categorical ).
35 # See the full code in the separate file for details .
36 def log_prob (self , x):
37 mu_d = self.f(x)
38 log_p = log_categorical (x, mu_d , num_classes =self.
num_vals , reduction =’sum ’, dim=−1).sum(−1)
39

40 return log_p
41

42 # This function implements sampling procedure.

43 def sample (self , batch_size ):
44 # As you can notice , we first initialize a tensor with
zeros.
45 x_new = torch.zeros (( batch_size , self.D))
46

47 # Then , iteratively , we sample a value for a pixel.

48 for d in range(self.D):
49 p = self.f(x_new)
50 x_new_d = torch. multinomial (p[:, d, :], num_samples
=1)
51 x_new [:, d] = x_new_d [: ,0]
52

53 return x_new

Listing 2.2 Autoregressive model parameterized by causal convolutions 1D

1 # An example of a network . NOTICE : The first layer is A=True ,

while all the others are A=False.
2 # At this point we should know already why :)
3 M = 256
4

5 net = nn. Sequential (

6 CausalConv1d ( in_channels =1, out_channels =MM , dilation =1,
kernel_size =kernel , A=True , bias=True),
7 nn. LeakyReLU (),
8 CausalConv1d ( in_channels =MM , out_channels =MM , dilation =1,
kernel_size =kernel , A=False , bias=True),
9 nn. LeakyReLU (),
10 CausalConv1d ( in_channels =MM , out_channels =MM , dilation =1,
kernel_size =kernel , A=False , bias=True),
11 nn. LeakyReLU (),
12 CausalConv1d ( in_channels =MM , out_channels =num_vals , dilation
=1, kernel_size =kernel , A=False , bias=True))

Listing 2.3 An example of a network

22 2 Autoregressive Models

A B C

Fig. 2.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the ARM. (c) The validation curve during training

3x3 kernel

*
masked
kernel

masked
input image feature map
3x3 kernel

Fig. 2.5 An example of a masked 3×3 kernel (i.e., a causal 2D kernel): (left) A difference between
a standard kernel (all weights are used; denoted by green) and a masked kernel (some weights are
masked, i.e., not used; in red). For the masked kernel, we denoted the node (pixel) in the middle in
violet, because it is either masked (option A) or not (option B). (middle) An example of an image
(light orange nodes: zeros, light blue nodes: ones) and a masked kernel (option A). (right) The
result of applying the masked kernel to the image (with padding equal to 1)

Perfect! Now we are ready to run the full code. After training our ARM, we
should obtain results similar to those in Fig. 2.4.

2.4 Is It All? No!

First of all, we discussed one-dimensional causal convolutions that are typically

insufficient for modeling images due to their spatial dependencies in 2D (or 3D if
we consider more than 1 channel; for simplicity, we focus on a 2D case). In [10], a
CausalConv2D was proposed. The idea is similar to that discussed so far, but now
we need to ensure that the kernel will not look into future pixels in both the x-axis
and y-axis. In Fig. 2.5, we present the difference between a standard kernel where
all kernel weights are used and a masked kernel with some weights zeroed-out (or
masked). Notice that in CausalConv2D we must also use option A for the first layer
(i.e., we skip the pixel in the middle) and we can pick option B for the remaining
layers. In Fig. 2.6, we present the same example as in Fig. 2.5 but using numeric
values.
2.4 Is It All? No! 23

Fig. 2.6 The same example 0 0 0 0 0 0 0 0 0 0 0 0 0 0

as in Fig. 2.5 but with 0 1 1 1 1 1 0 0 1 2 2 2 1 0
numeric values 0 0 0 0 0 1 0 111 1 2 3 3 3 3 2
0 0 0 0 1 0 0 110 0 0 0 0 2 2 1
0 0 0 1 0 0 0 000 0 0 0 2 2 1 0
0 0 1 0 0 0 0 0 0 2 2 1 0 0
0 1 0 0 0 0 0 0 2 2 1 0 0 0

In [12], the authors propose a further improvement on the causal convolutions.

The main idea relies on creating a block that consists of vertical and horizontal
convolutional layers. Moreover, they use gated non-linearity function, namely:

h = tanh(Wx) σ (Vx). (2.16)

See Figure 2 in [12] for details.

Further improvements on ARMs applied to images are presented in [13]. Therein,
the authors propose to replace the categorical distribution used for modeling pixel
values with the discretized logistic distribution. Moreover, they suggest to use a
mixture of discretized logistic distributions to further increase flexibility of their
ARMs.
The introduction of the causal convolution opened multiple opportunities for
deep generative modeling and allowed obtaining state-of-the-art generations and
density estimations. It is impossible to review all papers here, we just name a few
interesting directions/applications that are worth remembering:
• An alternative ordering of pixels was proposed in [14]. Instead of using the
ordering from left to right, a “zig–zag” pattern was proposed that allows pixels
to depend on pixels previously sampled to the left and above.
• ARMs could be used as stand-alone models or they can be used in a combination
with other approaches. For instance, they can be used for modeling a prior in the
(Variational) Auto-Encoders [15].
• ARMs could be also used to model videos [16]. Factorization of sequential data
like video is very natural, and ARMs fit this scenario perfectly.
• A possible drawback of ARMs is a lack of latent representation because all
conditionals are modeled explicitly from data. To overcome this issue, [17]
proposed to use a PixelCNN-based decoder in a Variational Auto-Encoder.
• An interesting and important research direction is about proposing new architec-
tures/components of ARMs or speeding them up. As mentioned earlier, sampling
from ARMs could be slow, but there are ideas to improve on that by predictive
sampling [11, 18].
• Alternatively, we can replace the likelihood function with other similarity met-
rics, e.g., the Wasserstein distance between distributions as in quantile regression.
In the context of ARMs, quantile regression was applied in [19], requiring only
minor architectural changes, that resulted in improved quality scores.
24 2 Autoregressive Models

• An important class of models constitute transformers [20]. These models use

self-attention layers instead of causal convolutions.
• Multi-scale ARMs were proposed to scale high-quality images logarithmically
instead of quadratically. The idea is to make local independence assumptions
[21] or impose a partitioning on the spatial dimensions [22]. Even though these
ideas allow lowering the memory requirements, sampling remains rather slow.

References

1. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
2. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
3. Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural
networks. In ICML, 2011.
4. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. In International conference on machine learning, pages 1310–1318. PMLR,
2013.
5. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.
In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
6. Ronan Collobert and Jason Weston. A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th international
conference on Machine learning, pages 160–167, 2008.
7. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network
for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
8. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-
tional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
9. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
10. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
11. Auke Wiggers and Emiel Hoogeboom. Predictive sampling with forecasting autoregressive
models. In International Conference on Machine Learning, pages 10260–10269. PMLR, 2020.
12. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with pixelcnn decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
13. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
14. Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved
autoregressive generative model. In International Conference on Machine Learning, pages
864–872. PMLR, 2018.
15. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
Other documents randomly have
different content
“A piece shall not by removing itself uncover the
kindred King to the attack of a hostile piece.”
Thus, it is clear, that a pinned piece is a disqualified piece; its
powers are dormant and by the laws of the game it is temporarily
reduced to an inert mass, and deprived of every faculty normally
appertaining to it as a Chess-piece. On the other hand, as is equally
obvious, the pinning piece is in full possession of its normal powers
and is qualified to perform every function.
To hold that a piece disqualified by the laws of the game can
nullify the activities of a piece in full possession of its powers, is to
assert that black is white, that the moon is made of green cheese,
that the tail can wag the dog, or any other of those things which
have led the German to promulgate his caustic formula on the
Anglo-Saxon.
Hence, artificially to nullify the normal powers of an active and
potential piece which is operating in conformity to the laws of the
game, and artificially to revivify the dormant powers of a piece
disqualified by the same laws; to debar the former from exercising
its legitimate functions and to permit the latter to exercise functions
from which by law, it specifically is debarred, is a self-evident
incongruity and any argument whereby such procedure is upheld,
necessarily and obviously, is sophistry.

No less interesting than instructive and conclusive, is reference of

this question to those intellectual principles which give birth to the
game of Chess, per se, viz.:
As a primary fundamental, with the power to give check, is
associated concurrently the obligation upon the King thus checked,
not to remain in check.
Secondly: The totality of powers assigned to the Chess-pieces is
the ability to move, provided the King be free from check. This
totality of powers may be denoted by the indefinite symbol, X.
The play thus has for its object:
The reduction to zero of the adverse X, by the operation of the
kindred X.
This result is checkmate in its generalized form. In effect, it is the
destruction of the power of the adverse pieces to move, by means of
check made permanent.
By the law of continuity it is self-evident that:
The power to move appertaining either to White or to Black, runs
from full power to move any piece (a power due to freedom from
check), down to total inability to move any piece, due to his King
being permanently checked, i.e., checkmated.
This series cannot be interrupted without obvious violation of the
ethics of the game; because, so long as any part of X remains, the
principle from which the series emanated still operates, and this
without regard to quantity of X remaining unexpended.
Thus, a game of Chess is a procedure from total ability to total
disability; i.e., from one logical whole to another; otherwise, from X
to zero.
Checkmate, furnishes the limit to the series; the game and X
vanish together.
This is in perfect keeping with the law of continuity, which acts
and dominates from beginning to end of the series, and so long as
any part of X remains.
Hence to permit either White or Black to move any piece, leaving
his King in check, is an anomaly.

Denial to the Pawn of ability to move to the rear is an accurate

interpretation of military ethics.
Of those puerile hypotheses common to the man who does not
know, one of the most entrancing to the popular mind, is the notion
that Corps d’armee properly are of equal numbers and of the same
composition.
This supposition is due to ignorance of the fact that the
multifarious duties of applied Strategetics, require for their execution
like variety of instruments, which diversity of means is strikingly
illustrated by the differing movements of the Chess-pieces.
The inability of the Pawn to move backward strategically
harmonizes with its functions as a Corps of Position, in contradiction
to the movements of the pieces, which latter are Corps of Evolution.
This restriction in the move of the Pawn is in exact harmony with
the inability of the Queen to move on obliques, of the Rook to move
on obliques or on diagonals, of the Bishop to move on obliques,
verticals and horizontals, of the Knight to move on diagonals,
verticals, and horizontals, and of the King to move like any other
piece.

Possessed of the invaluable privilege of making the first move in

the game, knowing that no move should be made without an object,
understanding that the true object of every move is to minimize the
adverse power for resistance and comprehending that all power for
resistance is derived from facility of movement, the student easily
deduces the true object of White’s initial move in every game of
Chess, viz.:

PRINCIPLE
To make the first of a series of movements, each of which shall
increase the mobility of the kindred pieces and correspondingly
decrease the mobility of the adverse pieces.
As the effect of such policy, the power for resistance appertaining
to Black, ultimately must become so insufficient that he no longer
will be able adequately to defend:

1. His base of operations.

2. The communications of his army with its base.
3. The communications of his corps d’armee with each other, or,
4. To prevent the White hypothetical force penetrating to its
Logistic Horizon.

To produce this fatal weakness in the Black position by the

advantage of the first move is much easier for White than commonly
is supposed.
The process consists in making only those movements by means
of which the kindred corps d’armee, progressively occupying
specified objectives, are advanced, viz.:

I. To the Strategetic Objective, when acting against the

communications of the adverse Determinate Force and its
Base of Operations.
II. To the Logistic Horizon, when acting against the
communications between the adverse Determinate and the
adverse Hypothetical Forces.
III. To the Strategic Vertices, when acting against the
communications of the hostile corps d’armee with each
other.

To bring about either of these results against an opponent equally

equipped and capable, of course is a much more difficult task than
to checkmate an enemy incapable of movement.
Yet such achievement is possible to White and with exact play it
seemingly is a certainty that he succeeds in one or the other, owing
to his inestimable privilege of first move.
For the normal advantage that attaches to the first move in a
game of Chess is vastly enhanced by a peculiarity in the
mathematical make-up of the surface of the Chess-board, whereby,
he who makes the first move may secure to himself the advantage
in mobility, and conversely may inflict upon the second player a
corresponding disadvantage in mobility.
This peculiar property emanates from this fact:
The sixty-four points, i.e., the sixty-four centres of
the squares into which the surface of the Chess-board
is divided, constitute, when taken collectively, the
quadrant of a circle, whose radius is eight points in
length.
Hence, in Chessic mathematics, the sides of the Chessboard do
not form a square, but the segment of a circumference.
To prove the truth of this, one has but to count the points
contained in the verticals and horizontals and in the hypothenuse of
each corresponding angle, and in every instance it will be found that
the number of points contained in the base, perpendicular, and
hypothenuse, is the same.
For example:
Let the eight points of the King’s Rook’s file form the perpendicular
of a right angle triangle, of which the kindred first horizontal forms
the base; then, the hypothenuse of the given angle, will be that
diagonal which extends from QR1 to KR8. Now, merely by the
processes of simple arithmetic, it may be shown that there are,

1. Eight points in the base.

2. Eight points in the perpendicular.
3. Eight points in the hypothenuse.

Consequently the three sides of this given right angled triangle are
equal to each other, which is a geometric impossibility.
Therefore, it is self-evident that there exists a mathematical
incongruity in the surface of the Chess-board.
That is, what to the eye seems a right angled triangle, is in its
relations to the movements of the Chess-pieces, an equilateral
triangle. Hence, the Chess-board, in its relations to the pieces when
the latter are at rest, properly may be regarded as a great square
sub-divided into sixty-four smaller squares; but on the contrary, in
those calculations relating to the Chess-pieces in motion, the Chess-
board must be regarded as the quadrant of a circle of eight points
radius. The demonstration follows, viz.:
Connect by a straight line the points KR8 and QR8. Connect by
another straight line the points QR8 and QR1. Connect each of the
fifteen points through which these lines pass with the point KR1, by
means of lines passing through the least number of points
intervening.
Then the line KR8 and QR8 will represent the segment of a circle
of which latter the point KR1 is the center. The lines KR1-KR8 and
KR1-QR1 will represent the sides of a quadrant contained in the
given circle and bounded by the given segment, and the lines drawn
from KR1 to the fifteen points contained in the given segment of the
given circumference, will be found to be fifteen equal radii each
eight points in length.

Having noted the form of the Static or positional surface of the

Chess-board and its relations to the pieces at rest, and having
established the configuration of the Dynamic surface upon which the
pieces move, it is next in sequence to deduce that fundamental fact
and to give it that geometric expression which shall mathematically
harmonize these conflicting geometric figures in their relations to
Chess-play.
As the basic fact of applied Chessic forces, it is to be noted, that:
PRINCIPLE
The King is the SOURCE from whence the Chess-pieces derive all
power of movement; and from his ability to move, emanates
ALL power for attack and for defence possessed by a Chessic
army.

This faculty of mobility, derived from the existence of the kindred

King, is the all essential element in Chess-play, and to increase the
mobility of the kindred pieces and to reduce that of the adverse
pieces is the simple, sure and only scientific road to victory; and by
comparison of the Static with the Dynamic surface of the Chess-
board, the desired principle readily is discovered, viz.,
The Static surface of the Chess-board being a
square, its least division is into two great right angled
triangles having a common hypothenuse.
The Dynamic surface being the quadrant of a circle,
its least division also is into two great sections, one of
which is a right angled triangle and the other a semi-
circle.
Comparing the two surfaces of the Chess-board thus divided, it
will be seen that these three great right angled triangles are equal,
each containing thirty-six points; and having for their common
vertices, the points KR1, QR1 and R8.
Furthermore, it will be seen that the hypothenuse common to
these triangles, also is the chord of that semi-circle which appertains
to the Dynamic surface.
Again, it will be perceived that this semi-circle, like the three right
angled triangles, is composed of thirty-six points, and consequently
that all of the four sub-divisions of the Static and Dynamic surfaces
of the Chess-board are equal.
Thus it obviously follows, that:
1. The great central diagonal, always is one side of each of the
four chief geometric figures into which the Chess board is
divided; that:
2. It mathematically perfects each of these figures and
harmonizes each to all, and that:
3. By means of it each figure becomes possessed of eight more
points than it otherwise would contain.

Hence, the following is self-evident:

PRINCIPLE
That Chessic army which can possess itself of the great central
diagonal, thereby acquires the larger number of points upon
which to act and consequently greater facilities for movement;
and conversely:

By the loss of the great central diagonal, the mobility of the

opposing army is correspondingly decreased.

It therefore is clear that the object of any series of movements by

a Chessic army acting otherwise than on Line of Operations, should
be:

PRINCIPLE
Form the kindred army upon the hypothenuse of the right angled
triangle which is contained within the Dynamic surface of the
Chess-board; and conversely,

Compel the adverse army to act exclusively within that semi-circle

which appertains to the same surface.
Under these circumstances, the kindred corps will be possessed of
facilities for movement represented by thirty-six squares; while the
logistic area of the opposing army will be restricted to twenty-eight
squares.
There are, of course, two great central diagonals of the Chess-
board; but as the student is fully informed that great central
diagonal always is to be selected, which extends towards the
Objective Plane.

Mobility, per se, increases or decreases with the number of

squares open to occupation.
But in all situations there will be points of no value, while other
points are of value inestimable; for the reason that the occupation of
the former will not favorably affect the play, or may even lose the
game; while by the occupation of the latter, victory is at once
secured.
But it is not the province of Mobility to pass on the values of
points; this latter is the duty of Strategy. It is sufficient for Mobility
that it provide superior facilities for movement; it is for Strategy to
define the Line of Movement; for Logistics, by means of this Line of
Movement, to bring into action in proper times and sequence, the
required force, and for Tactics, with this force, to execute the proper
evolutions.
Mobility derives its importance from three things which may occur
severally or in combination, viz.:

1. All power for offense or for defense is eliminated from a

Chess-piece the instant it loses its ability to move.
2. The superiority possessed by corps acting offensively over
adverse corps acting defensively, resides in that the attack
of a piece is valid at every point which it menaces; while
the defensive effort of a piece, as a rule, is valid only at a
single point. Consequently:

PRINCIPLE
Increased facilities for movement enhance the power of attacking
pieces in a much greater degree than like facilities enhance the
power of defending pieces.

Such increasing facilities for movement ultimately render an

attacking force irresistible, for the reason that it finally becomes a
physical impossibility for the opposing equal force to provide valid
defences for the numerous tactical keys, which at a given time
become simultaneously assailed. Hence:

PRINCIPLE
Superior facilities for occupying any point at any time and with any
force, always ensure the superior force at a given point, at a
given time.

The relative advantage in mobility possessed by one army over an

opposing army always can be determined by the following, viz.:

RULE
1. That army whose strategic front of operations is established upon
the Strategetic Center has the relative advantage in Mobility.

2. To utilize the advantage in Mobility extend the Strategic Front in

the direction of the objective plane.
3. To neutralize the relative disadvantage in Mobility eliminate that
adverse Corps d’armee which tactically expresses such adverse
advantage; or so post the Prime Strategetic Point as to vitiate
the adverse Strategic front.

Advantage in Mobility is divided into two classes, viz.:

I. General Advantage in Mobility.

II. Special Advantage in Mobility.

A General Advantage in Mobility consists in the ability to act

simultaneously against two or more vital points by means of interior
logistic radii due to position between:—

1. The adverse army and its Base of Operations.

2. Two or more adverse Grand Columns.
3. The wings of a hostile Grand Column.
4. Two or more isolated adverse Corps d’armee.

Such position upon interior lines of movement is secured by

occupying either of the Prime Offensive Origins, i.e.:

1. Strategic Center vs. Adverse Formation in Mass.

2. Logistic Center vs. Adverse Formation by Grand Columns.
3. Tactical Center vs. Adverse Formation by Wings.
4. Logistic Triune vs. Adverse Formation by Corps.

Special Advantage in Mobility consists in the ability of a corps

d’armee to traverse greater or equal distances in lesser times than
opposing corps.
MILITARY EXAMPLES

“Never interrupt your enemy when he is making a false

movement.”—Napoleon.

In the year (366 B.C.) the King of Sparta, with an army of 30,000
men marched to the aid of the Mantineans against Thebes.
Epaminondas took up a post with his army from whence he equally
threatened Mantinea and Sparta. Agesilaus incautiously moved too
far towards the coast, whereupon Epaminondas, with 70,000 men
precipitated himself upon Lacedaemonia, laying waste the country
with fire and sword, all but taking by storm the city of Sparta and
showing the women of Lacedaemonia the campfire of an enemy for
the first time in six hundred years.

Flaminius advancing incautiously to oppose Hannibal, the latter

took up a post with his army from whence he equally threatened the
city of Rome and the army of the Consul. In the endeavor to rectify
his error, the Roman general committed a worse and was destroyed
with his entire army.

At Thapsus, April 6, 46 B.C., Caesar took up a post with his army

from whence he equally threatened the Roman army under Scipio
and the African army under Juba. Scipio having marched off with his
troops to a better camp some miles distant, Caesar attacked and
annihilated Juba’s army.
At Pirna, Frederic the Great, captured the Saxon army entire, and
at Rossbach, Leuthern and Zorndorf destroyed successively a
French, an Austrian and a Russian army merely by occupying a post
from whence he equally threatened two or more vital points,
awaiting the time when one would become inadequately defended.

Washington won the Revolutionary War merely by occupying a

post from whence he equally threatened the British armies at New
York and Philadelphia; refusing battle and building up an army of
Continental regular troops enlisted for the war and trained by the
Baron von Steuben in the system of Frederic the Great.

Bonaparte won at Montenotte, Castiglione, Arcola, Rivoli and

Austerlitz his most perfect exhibitions of generalship, merely by
passively threatening two vital points and in his own words: “By
never interrupting an enemy when he is making a false movement.”

Perfection in Mobility is attained whenever the kindred army is

able to act unrestrainedly in any and all directions, while the
movements of the hostile army are restricted.
NUMBERS

“In warfare the advantage in numbers never is to be

despised.”—Von Moltke.

“Arguments avail but little against him whose opinion is

voiced by thirty legions.”—Roman Proverb.

“That king who has the most iron is master of those who
merely have the more gold.”—Solon.

“It never troubles the wolf how many sheep there are.”—
Agesilaus.
NUMBERS

“A handful of troops inured to Warfare proceed to certain

victory; while on the contrary, numerous hordes of raw and
undisciplined men are but a multitude of victims dragged to
slaughter.”—Vegetius.

“Turenne always was victorious with armies infinitely

inferior in numbers to those of his enemies; because he
moved with expedition, knew how to secure himself from
being attacked in every situation and always kept near his
enemy.”—Count de Saxe.

“Numbers are of no significance when troops are once

thrown into confusion.”—Prince Eugene.

Humanity is divisible into two groups, one of which relatively is

small and the other, by comparison, very large.
The first of these groups is made up comparatively of but a few
persons, who, by virtue of circumstances are possessed of
everything except adequate physical strength; and the second group
consists of those vast multitudes of mankind, which are destitute of
everything except of incalculable prowess, due to their
overwhelming numbers.
Hence, at every moment of its existence, organized Society is face
to face with the possibility of collision into the Under World; and
because of the knowledge that such encounter is inevitable,
unforeseeable and perhaps immediately impending, Civilization, so-
called, ever is beset by an unspeakable and all-corroding fear.
To deter a multitude, destitute of everything except the power to
take, from despoiling by means of its irresistible physique, those
few who are possessed of everything except ability to defend
themselves, in all Ages has been the chiefest problem of
mankind; and to the solution of this problem has been devoted
every resource known to Education, Legislation, Ecclesiasticism
and Jurisprudence.

This condition further is complicated by a peculiar outgrowth of

necessary expedients, always more or less unstable, due to that
falsity of premise in which words do not agree with acts.
Of these expedients the most incongruous is the arming and
training of the children of the mob for the protection of the upper
stratum; and that peculiar mental insufficiency of hoi polloi, whereby
it ever is induced to accept as its leaders the sons of the Patrician
class.
That a social structure founded upon such anomalies should
endure, constitutes in itself the real Nine Wonders of the World; and
is proof of that marvellous ingenuity with which the House of Have
profits by the chronic predeliction of the House of Want to fritter
away time and opportunity, feeding on vain hope.

The advantage in Numbers consists in having in the aggregate

more Corps d’armee than has the adversary.
All benefit to be derived from the advantage in Numbers is limited
to the active and scientific use of every corp d’armee; otherwise
excess of Numbers, not only is of no avail, but easily may
degenerate into fatal disadvantage by impeding the decisive action
of other kindred corps. Says Napoleon: “It is only the troops brought
into action, that avails in battles and campaigns—the rest does not
count.”
A loss in Numbers at chess-play occurs only when two pieces are
lost for one, or three for two, or one for none, and the like. No
diminution in aggregate of force can take place on the Chess-board,
so long as the number of the opposing pieces are equal.
This is true although all the pieces on one side are Queens and
those of the other side all Pawns.
The reason for this is:
All the Chess-pieces are equal in strength, one to the other. The
Pawn can overthrow and capture any piece—the Queen can do no
more.
That is to say, at its turn to move, any piece can capture any
adverse piece; and this is all that any piece can do.
It is true that the Queen, on its turn to move, has a maximum
option of twenty-seven squares, while the Pawn’s maximum never is
more than three. But as the power of the Queen can be exerted only
upon one point, obviously, her observation of the remaining twenty-
six points is merely a manifestation of mobility, and her display of
force is limited to a single square. Hence, the result in each case is
identical, and the display of force equal.
The relative advantage in Numbers possessed by one army over
an opposing army always can be determined by the following, viz.:

RULE
That army which contains more Corps d’armee than an opposing
army has the relative advantage in Numbers.
“With the inferiority in Numbers, one must depend more
upon conduct and contrivance than upon strength.”—Caesar.

MILITARY EXAMPLES

“He who has the advantage in Numbers, if he be not a

blockhead, incessantly will distract his enemy by
detachments, against all of which it is impossible to provide a
remedy.”—Frederic the Great.

“He that hath the advantage in Numbers usually should

exchange pieces freely, because the fewer that remain the
more readily are they oppressed by a superior force.”—Dal
Rio.

At Thymbra, Cyrus the Great, king of the Medes and Persians, with
10,000 horse cuirassiers, 20,000 heavy infantry, 300 chariots and
166,000 light troops, conquered Croesus, King of Assyria whose
army consisted of 360,000 infantry and 60,000 cavalry. This victory
made Persia dominant in Asia.

At Marathon, 10,000 Athenian and 1,000 Plataean heavy infantry,

routed 110,000 Medes and Persians. This victory averted the
overthrow of Grecian civilization by Asiatic barbarism.
At Leuctra, Epaminondas, general of the Thebans, with 6000
heavy infantry and 400 heavy horse, routed the Lacedaemonean
army, composed of 22,000 of the bravest and most skillful soldiers of
the known world, and extinguished the military ascendency which
for centuries Sparta had exercised over the Grecian commonwealths.
At Issus, Alexander the Great with 40,000 heavy infantry and
7,000 heavy cavalry destroyed the army of Darius Codomannus, King
of Persia, which consisted of 1,000,000 infantry, 40,000 cavalry, 200
chariots and 15 elephants. This battle, in which white men
encountered elephants for the first time, established the military
supremacy of Europe over Asia.

Alexander the Great invaded Asia (May, 334 B.C.) whose armies
aggregated 3,000,000 men trained to war; with 30,000 heavy
infantry, 4000 heavy cavalry, $225,000 dollars in money and thirty
days’ provisions.
At Arbela, Alexander the Great with 45,000 heavy infantry and
8,000 heavy horse, annihilated the last resources of Darius and
reduced Persia to a Greek province. The Persian army consisted of
about 600,000 infantry and cavalry, of whom 300,000 were killed.

Hannibal began his march from Spain (218 B.C.) to invade the
Roman commonwealth, with 90,000 heavy infantry and 12,000
heavy cavalry. He arrived at Aosta in October (218 B.C.) with only
20,000 infantry and 6,000 cavalry to encounter a State that could
put into the field 700,000 of the bravest and most skillful soldiers
then alive.
At Cannae, Hannibal destroyed the finest army Rome ever put in
the field. Out of 90,000 of the flower of the commonwealth only
about 3,000 escaped. The Carthagenian army consisted of 40,000
heavy infantry and 10,000 heavy cavalry.

At Alesia, (51 B.C.) Caesar completed the subjugation of Gaul, by

destroying in detail two hostile armies aggregating 470,000 men.
The Roman army consisted of 43,000 heavy infantry, 10,000 heavy
cavalry and 10,000 light cavalry.

At Pharsaleus, (48 B.C.) Caesar with 22,000 Roman veterans

routed 45,000 soldiers under Pompey and acquired the chief place in
the Roman state.

At Angora, (1402) Tamerlane, with 1,400,000 Asiatics, destroyed

the Turkish army of 900,000 men, commanded by the Ottoman
Sultan Bajazet, in the most stupendous battle of authentic record.
After giving his final instructions to his officers, Tamerlane, it is
recorded, betook himself to his tent and played at Chess until the
crisis of the battle arrived, whereupon he proceeded to the decisive
point and in person directed those evolutions which resulted in the
destruction of the Ottoman army.
The assumption that the great Asiatic warrior was playing at Chess
during the earlier part of the battle of Angora, undoubtedly is
erroneous. Most probably he followed the progress of the conflict by
posting chess-pieces upon the Chessboard and moving these
according to reports sent him momentarily by his lieutenants.
Obviously, in the days when the field telegraphy and telephone
were unknown, such method was entirely feasible and satisfactory to
the Master of Strategetics and far superior to any attempt to
overlook such a confused and complicated concourse.

At Bannockburne (June 24, 1314), Robert Bruce, King of Scotland,

with 30,000 Scots annihilated the largest army that England ever put
upon a battlefield.
This army was led by Edward II and consisted of over 100,000 of
the flower of England’s nobility, gentry and yeomanry. The victory
established the independence of Scotland and cost England 30,000
troops, which could not be replaced in that generation.

Gustavus Adolphus invaded Germany with an army of 27,000 men,

over one-half of whom were Scots and English. At that time the
Catholic armies in the field aggregated several hundred thousand
trained and hardened soldiers, led by brave and able generals.
At Leipsic, after 20,000 Saxon allies had fled from the battlefield,
Gustavus Adolphus with 22,000 Swedes, Scots and English routed
44,000 of the best troops of the day, commanded by Gen. Tilly. This
victory delivered the Protestant princes of Continental Europe from
Catholic domination.

At Zentha (Sept. 11, 1697), Prince Eugene with 60,000 Austrians

routed 150,000 Turks, commanded by the Sultan Kara-Mustapha,
with the loss of 38,000 killed, 4,000 prisoners and 160 cannon. This
victory established the military reputation of this celebrated French
General.

At Turin (Sept. 7, 1706) Prince Eugene with 30,000 Austrians

routed 80,000 French under the Duke of Orleans. Gen. Daun, whose
brilliant evolutions decided the battle, afterward, as Field-Marshal of
the Austrian armies, was routed by Frederic the Great at Leuthern.

At Peterwaradin (Aug. 5, 1716) Prince Eugene with 60,000

Austrians destroyed 150,000 Turks. This victory delivered Europe for
all time from the menace of Mahometan dominion.

At Belgrade (Aug. 26, 1717) Prince Eugene with 55,000 Austrians

destroyed a Turkish army of 200,000 men.

At Rosbach (Nov. 5, 1757) Frederic the Great with 22,000

Prussians, in open field, destroyed a French army of 70,000 regulars
commanded by the Prince de Soubisse.

At Leuthern (Dec. 5, 1757) Frederic the Great with 33,000

Prussians destroyed in open field, an Austrian army of 93,000
regulars, commanded by Field-Marshal Daun. The Austrians lost
54,000 men and 200 cannon.
At Zorndorf (Aug. 25, 1758) Frederic the Great with 45,000
Prussians destroyed a Russian army of 60,000 men commanded by
Field-Marshal Fermor. The Russians left 18,000 men dead on the
field.

At Leignitz (Aug. 15, 1760) Frederic the Great with 30,000 men
out-manoeuvred, defeated with the loss of 10,000 men and escaped
from the combined Austrian and Russian armies aggregating
130,000 men.

At Torgau (Nov. 5, 1760) Frederic the Great with 45,000 Prussians

destroyed an Austrian army of 90,000 men, commanded by Field-
Marshal Daun.

Washington, with 7,000 Americans, while pursued by 20,000

British and Hessians under Lord Cornwallis, captured a Hessian
advance column at Trenton (Dec. 25, 1776) and destroyed a British
detachment at Princeton, (Jan. 3, 1777).

Bonaparte, with 30,000 infantry, 3,000 cavalry and 40 cannon,

invaded Italy, (March 26, 1796) which was defended by 100,000
Piedmontese and Austrian regulars under Generals Colli and
Beaulieu. In fifteen days he had captured the former, driven the
latter to his own country and compelled Piedmont to sign a treaty of
peace and alliance with France.
At Castiglione, Arcole, Bassano and Rivoli, with an army not
exceeding 40,000 men Bonaparte destroyed four Austrian armies,
each aggregating about 100,000 men.

At Wagram, Napoleon, with less than 100,000 men, overthrew the

main Austrian army of 150,000 men, foiled the attempts at succor of
the secondary Austrian army of 40,000 men, and compelled Austria
to accept peace with France.
In the campaign of 1814, Napoleon, with never more than 70,000
men, twice repulsed from the walls of Paris and drove backward
nearly to the Rhine River an allied army of nearly 300,000 Austrians,
Prussians and Russians.

In the year 480 B.C., Xerxes, King of Persia, invaded Greece with
an army, which by Herodotus, Plutarch and Isocrates, is estimated at
2,641,610 men at arms and exclusive of servants, butlers, women
and camp followers.
Arriving at the Pass of Thermopolae, the march of the invaders
was arrested by Leonidas, King of Sparta, with an army made up of
300 Spartans, 400 Thebans, 700 Thespians, 1,000 Phocians and
3,000 from various Grecian States, posted behind a barricade built
across the entrance.
This celebrated defile is about a mile in length. It runs between
Mount Oeta and an impassible morass, which forms the edge of the
Gulf of Malia and at each end is so narrow that a wagon can barely
pass.