0% found this document useful (0 votes)
3 views

Method: Research

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Method: Research

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Bayesian Analysis (2022) TBA, Number TBA, pp.

Data Augmentation for Bayesian Deep Learning


Yuexi Wang∗ , Nicholas Polson∗ and Vadim O. Sokolov†
arXiv:1903.09668v4 [stat.ML] 24 Oct 2022

Abstract. Deep Learning (DL) methods have emerged as one of the most pow-
erful tools for functional approximation and prediction. While the representa-
tion properties of DL have been well studied, uncertainty quantification remains
challenging and largely unexplored. Data augmentation techniques are a natu-
ral approach to provide uncertainty quantification and to incorporate stochastic
Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose
of our paper is to show that training DL architectures with data augmentation
leads to efficiency gains. We use the theory of scale mixtures of normals to de-
rive data augmentation strategies for deep learning. This allows variants of the
expectation-maximization and MCMC algorithms to be brought to bear on these

Research high dimensional nonlinear deep learning models. To demonstrate our method-
ology, we develop data augmentation algorithms for a variety of commonly used

method
activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is
compared to traditional stochastic gradient descent with back-propagation. Our
optimization procedure leads to a version of iteratively re-weighted least squares
and can be implemented at scale with accelerated linear algebra methods provid-
Prediction ing substantial improvement in speed. We illustrate our methodology on a number
of standard datasets. Finally, we conclude with directions for future research.
1

Keywords: deep learning, data augmentation, MCMC, back-propagation, SGD.

1 Introduction
Deep neural networks (DNNs) have become a central tool for Artificial Intelligence
(AI) applications such as, image processing (ImageNet, Krizhevsky et al. (2012)), object
recognition (ResNet, He et al. (2016)) and game intelligence (AlphaGoZero, Silver et al.
(2016)). The approximability (Poggio et al., 2017; Bauer and Kohler, 2019) and rate of
convergence of deep learning, either in the frequentist fashion (Schmidt-Hieber, 2020) or
Problem
from a Bayesian predictive point of view (Polson and Rockova, 2018; Wang and Rockova,

a
2020), have been well-explored and understood. Fan et al. (2021) provides a selective
overview of deep learning. However, training deep learners is challenging due to the
high dimensional search space and the non-convex objective function. Deep neural net-
works have also suffered from issues such as local traps, miscalibration and overfitting.
Various efforts have been made to improve the generalization performance and many
of their roots lie in Bayesian modeling. For example, Dropout (Wager et al., 2013) is
commonly used and can be viewed as a deterministic ridge !2 regularization. Sparsity
structure via spike-and-slab priors (Polson and Rockova, 2018) on weights helps DNNs

arXiv: 1903.09668
∗ Booth School of Business, University of Chicago, Chicago, IL, [email protected],
[email protected].
† Volgenau School of Engineering, George Mason University, Fairfax, VA, [email protected].

AcademicAffiliation
© 2022 International Society for Bayesian Analysis
2 Data Augmentation for Bayesian Deep Learning

adapt to smoothness and avoid overfitting. Rezende et al. (2014) propose stochastic
back-propagation through the use of latent Gaussian variables.
In this paper, following the spirit of hierarchical Bayesian modeling, we develop
data augmentation strategies for deep learning with a complete data likelihood function
equivalent to weighted least squares regression. By using the theory of mean-variance
mixtures of Gaussians, our latent variable representation brings all of the conditionally
linear model theory to deep learning. For example, it allows for the straightforward
specification of uncertainty at each layer of deep learning and for a wide range of reg-
ularization penalties. Our method applies to commonly used activation functions such
as ReLU, leaky ReLU, logit (see also Gan et al. (2015)), and provides a general frame-
work for training and inference in DNNs. It inherits the advantages and disadvantages of
data augmentation schemes. For approximation methods like Expectation-Maximization
(EM) and Minorize-Maximization (MM), they are stable as they increase the objective
but can be slow in the neighborhood of the maximum point even with acceleration
methods such as Nesterov acceleration available and the performance is highly depen-
dent on the properties of the objective function. Stochastic exploratory methods like
MCMC have the main advantage of addressing uncertainty quantification (UQ) and are
stable in the sense they require no tuning. Hyper-parameter estimation is immediately
available using traditional Bayesian methods. DA augments the objective function with
extra hidden units which allow for efficient step size selection for the gradient descent
search. In some of the applications, data augmentation methods can be formulated in
terms of complete data sufficient statistics, a considerable advantage when dealing with
large datasets where most of the computational expense comes from repeatedly iterating
over the data. By combing the MCMC methods with the J-copies trick (Jacquier et al.,
2007), we can move faster towards posterior mode and avoid local maxima. Traditional

Hypothesis
methods for training deep learning models such as stochastic gradient descent (SGD)
have none of the above advantages. We also note that we exploit the advantages of SGD
and accelerated linear algebra methods when we implement our weighted least squares
regression step.
Data augmentation strategies are commonplace in statistical algorithms and ac-
celerated convergence (Nesterov, 1983; Green, 1984) is available. Our goal is to show
similar efficiency improvements for deep learning. Our work builds on Deng et al. (2019)
who provide adaptive empirical Bayes methods. In particular, we show how to imple-
Objective
ment standard activation functions, including ReLU (Polson and Rockova, 2018), logis-
tic (Zhou et al., 2012; Hernández-Lobato and Adams, 2015) and SVM (Mallick et al.,
2005) activation functions and provide specific data augmentation strategies and algo-
rithms. The core subroutine of the resulting algorithms solves a least squares problem.
Scalable linear algebra libraries such as Compute Unified Device Architecture (CUDA)
and accelerated linear algebra (XLA) are available for implementation. To illustrate Software
our approach, empirically we experiment with two benchmark datasets using Pólya-
Gamma data augmentation for logit activation functions. For the deep architecture
embedded in our approach, we adopt deep ReLU networks. Deep networks are able
Dependencies
to achieve the same level of approximation accuracy with exponentially fewer param-
eters for compositional functions (Mhaskar et al., 2017). Poggio et al. (2017) further

Research show how deep networks can avoid the curse of dimensionality. The ReLU function is
l

Wang et al. 3

favored due to its ability to avoid vanishing gradients and its expressibility and inher-
ent sparsity. Approximation properties of deep ReLU networks have been developed
in Montufar et al. (2014), Telgarsky (2017), and Liang and Srikant (2017). Yarotsky
(2017) and Schmidt-Hieber (2020) show that deep ReLU networks can yield a rate-
optimal approximation of smooth functions of an arbitrary order. Polson and Rockova
(2018) provide posterior rates of convergence for sparse deep learning.
There is another active area of research that revives traditional statistical models
with the computational power of DL (Bhadra et al., 2021). Examples include Gaus-
sian Process models (Higdon et al., 2008; Gramacy and Lee, 2008), Generalized Linear
Models (GLM) and Generalized Linear Mixed Models (GLMM) (Tran et al., 2020) and
Partial Least Squares (PLS) (Polson et al., 2021). Our method benefits from the com-
putation efficiency and flexibility of expression of the deep neural network. In addition,
our work builds on the sampling optimization literature (Pincus, 1968, 1970) which
now uses MCMC methods. Other examples include Ma et al. (2019) who study that
sampling can be faster than optimization and Neelakantan et al. (2017) showing that
gradient noise can improve learning for very deep networks. Gan et al. (2015) imple-
ments data augmentation inside learning deep sigmoid belief networks. Neal (2011) and
Chen et al. (2014) provide Hamitonian Monte Carlo (HMC) algorithms for MCMC.
Duan et al. (2018) proposes a family of calibrated data-augmentation algorithms to
increase the effective sample size.
The rest of our paper is outlined as follows. Section 2 provides the general setting
of deep neural networks and shows how DA can be integrated into deep learning us-
ing the duality between Bayesian simulation and optimization. Section 3 describes our
data augmentation (DA) schemes and two approaches to implement them. Section 4
provides applications to Gaussian regression, support vector machines and logistic re-
gression using Pólya-Gamma augmentation (Polson et al., 2013). Section 5 provides the
experiments of DA on both regression and classification problems. Section 6 concludes
with directions for future research.

2 Bayesian Deep Learning


In deep learning we wish to recover a multivariate predictive map fθ (·) denoted by

y = fθ (x),

where y = (y1 , . . . , yn )! , yi ∈ R denotes a univariate output and x = (x1 , . . . , xn )! ,


xi ∈ Rp a high-dimensional set of inputs. Using training data of input-output pairs
{yi , xi }ni=1 that generalizes well out-of-sample, the goal is to provide a predictive rule
for a new input variable x!
y! = fθ̂ (x! ),
where θ̂ is estimated from training data typically using SGD. The interest in deep learn-
ers lies in their ability to perform better than the additive rule for those interpolation or
prediction problems. Other statistical alternatives include Gaussian processes but they
often have difficulty in handling higher dimensions.
4 Data Augmentation for Bayesian Deep Learning

Deep learners use compositions (Kolmogorov, 1957; Vitushkin, 1964) of ridge func-
tions rather than additive functions that are commonplace in statistical applications.
With L ∈ N we denote the number of hidden layers and with pl ∈ N the num-
ber of neurons at the lth layer. Setting pL+1 = p, p0 = p1 = 1, we denote with
p = (p0 , p1 , . . . , pL+1 ) ∈ NL+2 the vector of neuron counts for the entire network.
Imagine composing L layers, a deep predictor then takes the form

y = fθ (x) = (fW0 ,b0 ◦ fW1 ,b1 ◦ · · · ◦ fWL ,bL )(x), (2.1)

where bl ∈ Rpl is a shift vector, Wl ∈ Rpl−1 ×pl is a weight matrix that links neurons
between (l −1)th and lth layers and fWl ,bl (x) = fl (Wl x+bl ) is a semi-affine function. We
denote with θ = {(W0 , b0 ), (W1 , b1 ), . . . , (WL , bL )} as the stacked parameters. We can
rewrite the compositions in (2.1) with a set of latent variables Z = (Z1 , Z2 , . . . , ZL )! as

y = f0 (Z1 W0 + b0 ),
Zl = fl (Zl+1 Wl + bl ), l = 1, . . . , L, (2.2)
ZL+1 = x,

where Zl ∈ Rn×pl is the matrix of hidden nodes in l-th layer. We only consider the case
p = 1 and Z1 ∈ Rn in our work. We provide discussion on extensions to cases p > 1 for
some of our applications in Section 4.

2.1 Bayesian Simulation and Regularization Duality


The problem of deep learning regularization (Polson and Sokolov, 2017) is to find a set
of parameters θ which minimizes a combination of a negative log-likelihood !(y, fθ (x))
and a penalty function φ(θ) defined by
n
! #θ
!
θ̂ := arg min !(yi , fθ (xi )) + λ φ(θj ), (2.3)
θ
i=1 j=1

where λ controls regularization and #θ denotes the number of parameters in θ.


When the function fθ (x) is a deep learner defined as (2.1), we can specify different
amount of penalty λl and form of regularization function φl (·) for each layer. Then the
objective function can be written as
n L
1! !
θ̂ = arg min !(yi , fθ (xi )) + λl φl (Wl , bl ). (2.4)
θ n i=1
l=0

Commonly used regularization techniques for deep learners include L2 (weight decay),
spike-and-slab regularization (Polson and Rockova, 2018) and dropout (Wager et al.,
2013), which can also be viewed as a variant of L2 -regularization.
As such the optimization problem in (2.4) of training a deep learner fθ (·) involves a
highly nonlinear objective function. Stochastic gradient descent (SGD) is a popular tool
Wang et al. 5

based on back-propagation (a.k.a. the chain rule), but it often suffers from local traps
and overfitting due to the non-convex nature of the problem. We propose data augmen-
tation techniques which can be seamlessly applied in this context and provide efficiency
gains. This is achieved via the hierarchical duality between optimization with regulariza-
tion and finding the maximum a posteriori (MAP) estimate (Polson and Scott, 2011),
as described in the following lemma.
Lemma 2.1. The regularization problem
" n L
#
1! !
θ̂ = arg min !(yi , fθ (xi )) + λl φl (Wl , bl )
θ n i=1
l=0

is equivalent to finding the the Bayesian MAP estimator defined by


" n L
#
1! !
arg max p(θ|y) = arg max exp − !(yi , fθ (xi )) − λl φl (Wl , bl ) ,
θ θ n i=1
l=0

which corresponds to the mode of a posterior distribution characterized as

p(θ | y) = p(y | θ)p(θ)/p(y),


n
! L
!
$ %
p(y|θ) ∝ exp{− ! yi , fθ (xi ) }, p(θ) ∝ exp{− λl φl (Wl , bl )}.
i=1 l=0

Here p(θ) can be interpreted as a prior probability distribution and the log-prior as the
regularization penalty.

2.2 A Stochastic Top Layer


By exploiting the duality from Lemma 2.1, we wish to use a Bayesian framework to add
stochastic layers – so as to fully account for the uncertainty in estimating the predictive
rule fθ (·). Thus, we convert the sequence of composite functions in the deep learner
specified in (2.2) to a stochastic version given by

y | Z1 ∼ p(y | Z1 ),
Zl ∼ N (fl (Wl Zl+1 + bl ), τl2 ), l = 1, 2, . . . , L, (2.5)
ZL+1 = x.

Now the hidden variables Z = (Z1 , . . . , ZL )! can be viewed as data augmentation vari-
ables and hence will also allow the contribution of fast scalable algorithms for inference
and prediction.
For the ease of computation, we only replace the top layer of the DNN with a stochas-
tic layer. We denote network structure below the top layer with B = {(W1 , b1 ), . . . , (WL , bL )},
and the network structure can be rewritten as

y = f0 (Z1 W0 + b0 ), Z1 = fB (x),
6 Data Augmentation for Bayesian Deep Learning

where the function f0 (Z1 W0 + b0 ) is the top layer structure and function fB (x) is the
network architecture below the top layer. Considering the objective function in (2.4),
we implement the solutions with a two-step iterative search. At iteration t, we have
1. DA-update for the top layer W0 , b0 as the MAP estimator of the distribution
(t) (t)
p(W0 , b0 | Z1 , y) ∝ p(y, Z1 | W0 , b0 )p(W0 , b0 ) (2.6)
" n
#
1!
∝ exp − !(yi , fθ (xi ) | B (t) ) + λ0 φ0 (W0 , b0 )
n i=1

2. SGD-update for the deep architecture B


n L
1! $ % !
B (t+1) = arg min ! yi , fθ (xi ) | (W0 , b0 )(t+1) + λl φl (Wl , bl )
B n i=1
l=1
n
! L
!
1 $ (t) %
= arg min ! Z1 , fB (xi ) + λl φl (Wl , bl ).
B n i=1 l=1
(t+1) (t) (t) (t) (t)
3. Sample Z1 from a normal distribution N (µz , σz ) where µz and σz are
determined jointly by {θ(t) , x, y}.
The main contribution of our work comes from two aspects: (1) we update top layer
weights {W0 , b0 } conditional on B as in (2.6), which is also equivalent to conditioning
on Z1 , with data augmentation techniques as later shown in Section 3; (2) the latent
variables Z1 is sampled from a normal distribution rather than optimized by gradient
descent methods. Z1 serves as a bridge that connects a weighted L2 -norm model f0
and a deep learner fB . Commonly used activation functions {fl }L l=1 are linear affine
functions, rectified linear units (ReLU), sigmoid, hyperbolic tangent (tanh), and etc.
We illustrate our methods with a deep ReLU network, i.e., {fl }L l=1 are ReLU functions,
due to its expressibility and inherent sparsity. In the next section, we introduce our data
augmentation strategies and show how the stochastic layers can be achieved via data
augmentation.

3 Data Augmentation for Deep Learning


Data augmentation introduces a vector of auxiliary variables, denoted by ω = (ω1 , . . . , ωn )!
with ωi ∈ R, such that the posterior can be written as
& '
p(θ | y) = Eω p(θ, ω | y) ,

where the augmented auxiliary distribution, p(θ, ω | y) factorizes nicely into complete
conditionals p(θ | ω, y) and p(ω | θ, y). A crucial ingredient is that p(θ | ω, y) is easily
managed typically via conditional Gaussians.
Data augmentation tricks allow us to express the likelihood as an expectation of a
weighted L2 -norm. Specifically, we write
( $ %) ( * $ %+)
exp − ! y, fθ (x) = Eω exp − Q y | fθ (x), ω
Wang et al. 7
, *
∞ $ %+
= exp − Q y | fθ (x), ω p(ω)d ω
0

where
$ p(ω) is %the prior on the auxiliary variables ω = (ω1 , . . . , ωn )! and the function
Q y | fθ (x), ω is designed to be a quadratic form, given the data augmentation vari-
ables ω. The function fθ (x) = (f0 ◦ · · · ◦ fL )(x) is a deep learner.
Table 1 shows that standard activation functions such as ReLU, logit, lasso and check
can be expressed in the form of (3.1). Commonly used activation functions for deep
learning, with an appropriate stochastic assumptions for w (for notation of simplicity,
we derive the standard form for the single observation case) can be expressed as
- * +.
1 1 2
exp(− max(1 − x, 0)) = Eω √ exp − (x − 1 − ω) , where ω ∼ GIG(1, 0, 0),
2πω 2ω
- .
1
exp(− log(1 + ex )) = Eω exp(− ωx2 ) , where ω ∼ PG(1, 0),
2
- * .
1 1 2+ $1%
exp(−|x|) = Eω √ exp − x , where ω ∼ E .
2πω 2ω 2
Here GIG denotes the Generalized Inverse Gaussian distribution, PG represents the
Pólya Gamma distribution (Polson et al., 2013), and E represents the exponential dis-
tribution.
l(W, b) Q(W, b, ω) p(ω)
, ∞ - . / 0
1 (x + aλ)2 1 2 max(ax, 0)
ReLU: max(1 − zi , 0) √ exp − d λ = exp − GIG(1, 0, 0)
0 2πcλ 2cλ a c
,
1 (a−b/2)ψ ∞ −ωψ2 /2 (eψ )a
Logit: log(1 + ezi ) e e p(ω)d ω = PG(b, 0)
2b 0 (1 + eψ )b
, ∞ - 2
. / 0
1 x 1 1 |x|
Lasso: | zσi | √ exp − e− 2 λ d λ = exp − E( 21 )
0 2πcλ 2cλ c c
, ∞ - . / 0
1 (x + (2τ − 1)λ)2 −2τ (1−τ )λ 1 2 √
Check: |zi | + (2τ − 1)zi √ exp − 2
e d λ = exp − ρτ (x) GIG(1, 0, 2 τ − τ 2 )
0 2πcλ 2c λ c c
x
1
$ %
Table 1: Data Augmentation Strategies. Here ρτ (x) = 2 |x| + τ − 12 x is the check
function.

Using the data augmentation strategies, the objectives are represented as mixtures
of Gaussians. DA can perform such an optimization with only the use of a sequence of
iteratively re-weighted L2 -norms. This allows us to use XLA techniques to accelerate
the training.
Remark 3.1. The log-posterior is optimized given the training data, {yi , xi }ni=1 . Deep
learning possesses the key property that ∇θ log p(y|θ, x) is computationally inexpensive
to evaluate using tensor methods for very complicated architectures and fast implemen-
tation on large datasets. One caveat is that the posterior is highly multi-modal and
providing good hyperparameter tuning can be expensive. This is clearly a fruitful area
of research for state-of-the-art stochastic MCMC algorithms to provide more efficient
algorithms. For shallow architectures, the alternating direction method of multipliers
(ADMM) is an efficient solution to the optimization problem.
8 Data Augmentation for Bayesian Deep Learning

Similarly we can represent the regularization penalty exp(−λφ(θ)) in data augmen-


tation form. Hence, we can then replace the optimization problem in (2.4) with

& * 1! n
$ % !L +'
θ̂ := arg max Eω exp − Q yi | fθ (xi ), ω − λl φl (Wl , bl ) , (3.1)
θ n i=1
l=0

using the duality in Lemma 2.1.


There are two approaches to Monte Carlo optimization which could handle our
data augmentation (Geyer, 1996), missing data methods like Expectation-Maximization
(EM) algorithms or stochastic search methods like Markov Chain Monte Carlo (MCMC).
The first approach is based on a probabilistic approximation of the objective function
(3.1) and is less concerned with exploring Θ. The second type is more exploratory which
aims to optimize the objective function by visiting the entire range of Θ and is less tied
to the properties of the function.
For EM algorithms, we consider constructing a surrogate optimization problem
which has the same solution to (3.1) (Lange et al., 2000). Specifically, we define a new
objective function as
& * 1! n
$ % !L +'
H(θ) = Eω|θ exp − Q yi | fθ (xi ), ω − λl φl (Wl , bl ) ,
n i=1
l=0

which is a concave function to be maximized. A natural choice of the surrogate function


can be constructed using Jensen’s inequality as
1 n L
2
$ % 1 ! $ % !
G θ | θ(t) = − Eω|θ(t) Q yi | fθ (xi ), ω + λl φl (Wl , bl ) ,
n i=1
l=0

where each ωi is drawn from conditional distribution p(ωi | θ) ∝ p(ωi , θ) and the
minorization is satisfied as $ %
log H(θ) ≥ G θ | θ(t) .
$ %
Maximizing G θ | θ(t) with respect to θ drives H(θ) uphill. The ascent property of
the EM algorithm relies on the nonnegativity of the Kullback-Leibler divergence of
two conditional probability densities (Hunter and Lange, 2004; Lange, 2013a). The EM
algorithm enjoys the numerical stability as it steadily increases the likelihood without
wildly overshooting or undershooting. It simplifies the optimization problem by (1)
avoiding large matrix inversion; (2) linearizing the objective function; (3) separating
the variables of the optimization problem (Lange, 2013b). In Section 4.3 we show how
Pólya-Gamma augmentation leads to an EM algorithm for logistic regression.
The exploratory alternative to solve (3.1) is stochastic search methods such as
MCMC. The data augmentation strategies enable us to sample from the joint posterior
* n
1! $ % !L +
p(θ | y) ∝ exp − ! yi , fθ (xi ) − λl φl (Wl , bl )
n i=1
l=0
Wang et al. 9

& * 1! n
$ % ! L +'
= Eω exp − Q yi | fθ (xi ), ω − λl φl (Wl , bl )
n i=1
l=0
, ∞ * 1! n +
%
= exp − Q(yi | fθ (xi ), ω p(ω)p(θ)d ω
0 n i=1
$ 3L %
where the prior is related to the regularization penalty, via p(θ) ∝ exp − l=0 λl φl (Wl , bl ) .
Hence, we can provide an MCMC algorithm in the augmented space (θ, ω) and
simulate from the joint posterior distribution, denoted by p(θ, ω | y), namely
* +
p(θ, ω | y) ∝ exp − Q(y | fθ (x), ω) p(θ)p(ω).

A sequence can be simulated using MCMC Gibbs conditionals,


$ % * +
p θ (t) | ω (t) , y ∝ exp − Q(y | fθ (x), ω (t) ) p(θ),
$ % * +
p ω (t+1) | θ (t) , y ∝ exp − Q(y | fθ (t) (x), ω) p(ω).

Then we recover stochastic draws θ (t) ∼ p(θ | y) from the marginal posterior. These
draws can be used in prediction to account for predictive uncertainty, namely
, T
$ % $ % 1 ! $ %
p y! | f (x! ) = p y! | θ, fθ (x! ) p(θ | y)d θ ≈ p y! | θ(t) , fθ(t) (x! ) . (3.2)
T t=1

As Q(y | fθ (x), ω) is conditionally quadratic, the update step for θ | ω, y can be


achieved using SGD or a weighted L2 -norm – the weights ω are adaptive and provide
an automatic choice of the learning rate, thus avoiding backtracking which can be
computationally expensive. And the performance of MCMC search is less tied to the
statistical properties (i.e. convexity or concavity) of the objective function. We provide
examples of how Gaussian regression and SVMs can be implements in Section 4.1 and
Section 4.2.

3.1 MCMC with J-copies


The MCMC methods offer a full description of the objective function (3.1) over the
entire space Θ. Inspired by the simulated annealing algorithm (Metropolis et al., 1953),
we introduce a scaling factor J to allow for faster moves on the surface of (3.1) to
maximize. It also helps avoiding the trapping attraction of local maxima. In addition,
the corresponding posterior is connected to the Boltzmann distribution, whose density
is prescribed by the energy potential f (θ) and temperature J as

πJ (θ) = exp {−Jf (θ)} /ZJ for θ ∈ Θ (3.3)


4
where ZJ = Θ exp {−Jf (θ)} dθ is an appropriate normalizing constant.
10 Data Augmentation for Bayesian Deep Learning

To simulate the posterior mode without evaluating the likelihood directly (Jacquier et al.,
2007), we sample J independent copies of hidden variable Z1 . Denoted the copies with
Z11 , . . . , Z1J , we sample them simultaneously and independently from the posterior dis-
tribution
iid
Z1j |θ, x, y ∼ N (µz , σz2 ), j = 1, . . . , J,

where µz , σz are determined by {x, y, θ}. And we stack the J copies as


     
y Z11 fB (x)
 y   Z12   fB (x) 
     
(S)  y  (S)  Z13  (S)  fB (x) 
y = , Z1 = , fB (x )=  (3.4)
 ..   ..   .. 
 .   .   . 
y Z1J fB (x)

(S) (S)
where y (S) , Z1 and fB (x(S) ) are (n × J)-dimensional vectors. We use Z1 to amplify
the information in y, which is especially useful in the finite sample problems. Figure 1
illustrates our network architecture.

Figure 1: J-copies Network Architecture

With the stacked system, the joint distribution of the parameters θ and the aug-
(S)
mented hidden variables Z1 given data y, x can be written as

J
;
(S)
πJ (θ, Z1 | x, y) ∝ p(y | θ, Z1j )p(Z1j | θ, x, y)p(θ).
j=1
Wang et al. 11

Hence, the marginal joint posterior


,
(S) (S)
p(θ | x, y) = πJ (θ, Z1 | x, y)d Z1

concentrates on the density proportional to p(x, y | θ)J p(θ) and provides us with a
simulation solution to finding the MAP estimator (Pincus, 1968, 1970).
Another alternative to simulate from the posterior mode is Hamiltonian Monte Carlo
(Neal, 2011), which is a modification of Metropolis-Hastings (MH) sampler. Adding an
additional momentum variable ν to the Boltzmann distribution in (3.3), and generating
draws from joint distribution
$ %
πJ (θ, ω) ∝ exp −Jf (θ) − (1/2)ν T M −1 ν ,

where M is a mass matrix. Chen et al. (2014) adopt this approach in a deep learning
setting.

3.2 Connection to Diffusion Theory


An alternative to the MCMC algorithm can be derived from diffusion theory (Phillips and Smith,
1996). For example, we can approximate the random walk Metropolis-Hastings algo-
rithm with the Langevin diffusion Lt defined by the stochastic differential equation
dLt = dBt + 12 ∇ log f (Lt )dt, where Bt is the standard Brownian motion. More specifi-
cally, let d := |θ|, we write the random walk like transition as

σ2
θ(t+1) = θ (t) + ∇ log f (θ(t) ) + σ$t ,
2
where $t ∼ Nd (0, Id ) and σ 2 corresponds to the discretization size.
This can also be derived by taking a second-order approximation of log(f ), namely
$ %!
log f (θ(t+1) ) = log f (θ(t) ) + θ(t+1) − θ(t) ∇ log f (θ(t) )
1$ %! $ %
− θ(t+1) − θ(t) H(θ(t) ) θ(t+1) − θ (t) ,
2
where H(θ(t) ) = −∇2 log f (θ(t) ) is the Hessian matrix. By taking exponential transfor-
mation on both sides, the random walk type approximation to f (θ(t+1) ) is
($ %! 1$ %! $ %)
f (θ(t+1) ) ∝ exp θ(t+1) − θ(t) ∇ log f (θ(t) ) − θ (t+1) − θ (t) H(θ(t) ) θ(t+1) − θ(t)
2
( 1$ (t) %! $ (t+1) %)
∝ exp − θ (t+1)
−θ < (t)
H(θ ) θ <(t) .
−θ
2
(t)
< = θ(t) +H −1 (θ (t) )∇ log f (θ (t) ). If we simplify this approximation by replacing
where θ
(t)
H(θ ) with σ −2 Ip , the Taylor approximation leads to updating step as

θ(t+1) = θ(t) + σ 2 ∇ log f (θ(t) ) + σ$t .


12 Data Augmentation for Bayesian Deep Learning

Roberts and Rosenthal (1998) give further discussion on the choice of σ that would
yield an acceptance rate of 0.574 to achieve optimal convergence rate.
Mandt et al. (2017) show that SGD can be interpreted as a multivariate Ornstein-
Uhlenbeck process =
(t) (t) C
dθ = −ηAθ dt + η dW (t) ,
S
here η is the constant learning rate, A is the symmetric Hessian matrix at the optimum
and CS is the covariance of the mini-batch (of size S) gradient noise, which is assumed
to be approximately constant near the local optimum of the loss. They also provide
results on discrete-time dynamics on other Stochastic Gradient MCMC algorithms,
such as Stochastic Gradient Langevin dynamics (SGLD) by Welling and Teh (2011)
and Stochastic Gradient Fisher Scoring by Ahn et al. (2012).
Combing their results and the Langevin dynamics of MCMC algorithms, we can
write the approximation of our DA-DL updating scheme as
> ?(t+1) > ?(t)
W0 W0 (t) (t) (t)
= + σ 2 ∇ log f0 (Z1 W0 + b0 ) + σ$0t ,
b0 b0
C
B (t+1) = B (t) − η∇2 fB ∗ (x)B (t) + √ η$Bt .
S

Similar adaptive dynamics are also observed in other methods. Geman and Hwang
(1986) show the convergence of the annealing process using Langevin equations. Slice
sampling (Neal, 2003) adaptively chooses the step size based on the local properties of
the density function. By constructing local quadratic approximations, it could adapt to
the dependencies between variables. Murray et al. (2010) further propose elliptical slice
sampling that operates on the ellipse of states.

4 Applications
To illustrate our methodology, we provide three examples: (1) a standard Gaussian
regression model with squared loss; (2) a binary classification model under the support
vector machine framework; (3) a logistic regression model paired with a Pólya mixing
distribution. For the Gaussian regression and SVM models, we implement with J-copies
stacking strategy to provide full posterior modes.
Before diving into the examples, we introduce the notations we use throughout
this section. We continue to denote the output with y = (y1 , . . . , yn )! , yi ∈ R, the
input with x = (x1 , . . . , xn )! , xi ∈ Rp , the latent variable of the top layer with Z1 =
(z1,1 , . . . , z1,n )! , z1,i ∈ R and the stacked version as in (3.4). We introduce stochastic
noises $0 = (+0,1 , . . . , +0,n )! in the top layer and $z = (+z,1 , . . . , +z,n )! in the second
iid iid
layer, where +0,i ∼ N (0, τ02 ) and +z,i ∼ N (0, τz2 ). The scale parameters τ0 and τz are
pre-specified and determine the level of randomness or uncertainty for the DA-update
and SGD-update respectively. We use η to denote the learning rate used in the SGD

Pseudocode given in Section 4 for


all 3 methods
Wang et al. 13

updates@and T is number of training epochs. We use +·+


@to denote !2 -norm such that
3n 2
+y+ = i=1 yi and the matrix-type norm as +y+Σ = y T Σy.
Our models differ from standard deep learning models and some newly proposed
Bayesian approaches in the adoption of stochastic noises $0 and $z . It distinguishes our
model from other deterministic neural networks. By letting $z follow a spiky distribution
that puts most of its mass around zero, we can control the estimation approximating to
posterior mode instead of posterior mean. The randomness allows us to adopt a stacked
system and make the best use of data especially when the dataset is small.

4.1 Gaussian Regression


We consider the regression model as
i.i.d
yi = z1,i W0 + b0 + +0,i , where yi ∈ (−∞, ∞), +0,i ∼ N (0, τ02 ),
i.i.d
z1,i = fB (xi ) + +z,i , where +z,i ∼ N (0, τz2 ).
The posterior updates are given by
Ŵ0 = Cov(Z1 , y)/Var(Z1 ), (4.1)
¯
b̂0 = ȳ − W0 Z1 , (4.2)
- .
1 1
p(Z1 | y, x, θ) = Cz exp − 2 +y − Z1 W0 − b0 +2 − 2 +Z1 − fB (x)+2 ,
2τ0 2τz
3
where ȳ = n1 ni=1 yi and Cz is a normalizing constant. The latent variable Z1 is drawn
2
from following normal distribution Z1 ∼ N (µZ , σZ ) with the mean and variance speci-
fied as
τ 2 W0 (y − b0 ) + τ02 fB (x) τ02 τz2
µZ = z , σZ
2
= . (4.3)
W02 τz2 + τ02 W02 τz2 + τ02
The J copies of Z1 are simulated and stacked as
iid (S)
Z1j ∼ N (µZ , σZ
2
), Z1 = (Z11 , . . . , Z1J )! .
The updating scheme for this Gaussian regression is summarized in Algorithm 1.
The model can also be generalized to multivariate y. Let yi be a q-dimension vector,
we denote each dimension as yik , k = 1, . . . , q, and the model is written as
iid
yik = z1,i W0k + b0k + +0,ik , where yik ∈ (−∞, ∞), +0,ik ∼ N (0, τ02 ),
iid
z1,i = fB (xi ) + +z,i , where +z,i ∼ N (0, τz2 ),
where W0 = (W01 , . . . , W0q )! is now a q-dimensional vector with W0k computed similarly
to (4.1), b0 = (b01 , . . . , b0q )! is also q-dimensional with b0k calculated as (4.2). The
posterior update for Z1 becomes
" q
#
1 ! 2 1 2
p(Z1 | y, x, θ) = Cz exp − 2 +y k − Z1 W0k − b0k + − 2 +Z1 − fB (x)+ ,
2τ0 2τz
k=1
14 Data Augmentation for Bayesian Deep Learning

Algorithm 1 Data Augmentation with J-copies for Gaussian Regression (DA-GR)


(0) (0)
1: Initialize B (0) , W0 , b0

Pseudo
2: for epoch t = 1, . . . , T do
(t,S)
3: Update the weights in the top layer with {y(S) , Z1 }
(t) (t,S) (t,S)
W0 = Cov(Z1 , y (S) )/Var(Z1 )

code (t) (t) (t,S)


b = ȳ (S) − W Z¯1
0 0
(t,S)
4: Update the deep learner fB with {Z1 , x(S) }
$ (t,S) %
B (t) = B (t−1) − η∇fB (t−1) x(S) | Z1
I
, SGD
(S)
5: Update Z1 jointly from deep learner fB and sampling layer f0
(t) (t) iid $ (t) (t) 2 %
Z1 j,(t+1) | W0 , b0 , y, fB(t) (x) ∼ N µz , σz , j = 1, . . . , J
(T ) (T )
6: return ŷ = W0 fB (T ) (x) + b0

which is a multivariate normal distribution with the mean and variance as


3q
τ2 W0k (y − b0k ) + τ02 fB (x) 2 τ 2τ 2
µZ = z k=1 2 3q k 2 2 , σZ = 2 3q 0 z 2 .
τz k=1 W0k + τ0 τz j=k W0k + τ02

4.2 Support Vector Machines (SVMs)


Support vector machines require data augmentation for rectified linear unit (ReLU)
activation functions. Polson and Scott (2011) and Mallick et al. (2005) write the support
vector machine model as

y = Z1 W0 + λ + λ$0 , where λ ∼ p(λ),
where p(λ) follows a flat uniform prior. The augmentation variable λ = (λ1 , . . . , λn )!
can be regarded as slacks admitting fuzzy boundaries between classes.
By incorporating the augmentation variable λ, the ReLU deep learning model can
be written as
@ i.i.d
yi = z1,i W0 + λi + λi +0,i , where yi ∈ {−1, 1}, +0,i ∼ N (0, τ02 ),
i.i.d
z1,i = fB (xi ) + +z,i , where +z,i ∼ N (0, τz2 ).
From a probabilistic perspective, the likelihood function for the output y is given by
- .
2
p(yi | W0 , z1,i ) ∝ exp − 2 max(1 − yi z1,i W0 , 0)
τ0
, ∞ / 0
1 1 (1 + λi − yi z1,i W0 )2
∝ √ exp − 2 d λi .
0 τ0 2πλi 2τ0 λi
Derived from this augmented likelihood function, the conditional updates are
1 n 21 " n
#2
; 1 1 ! (1 + λi − yi z1,i W0 )2
W0 | y, Z1 , λ ∝ √ exp − 2
i=1
τ0 λi 2τ0 i=1 λi
Wang et al. 15
- .
1 2 1 2
Z1 | y, x, W0 , B ∝ exp − 2 +y − Z1 W0 +Λ−1 − 2 +Z1 − fB (x)+
2τ0 2τz
where Λ = diag(λ1 , . . . , λn ) is the diagonal matrix of the augmentation variables.
In order to generate the latent variables, we use conditional Gibbs sampling as
−1
λ−1
i | W0 , yi , z1,i ∼ IG(|1 − yi z1,i W0 | , τ0−2 ) (4.4)
2
W0 | y, Z1 , λ ∼ N (µw , σw ) (4.5)
2
Z1 | y, x, W0 , B ∼ N (µz , σz ) (4.6)
with the means and variances given by
3n 1+λi
i=1 yi z1,i λi 2 1 W0 τz2 y + τ02 fB (x)Λ1 2 τ02 τz2 Λ1
µw = 3 , σw = 3 , µ z = 2 , σz = ,
2
y z
τ02 ni=1 iλi1,i
2 2
y z 2
τ02 ni=1 iλi1,i W0 τz2 + τ0 Λ1 W0 τz2 + τ02 Λ1
2

where IG denotes the Inverse Gaussian distribution and 1 = (1, . . . , 1)! is a n-dimensional
unit vector.
The J-copies strategy can also be adopted here. Z1j and λj needs to be sampled
independently for j = 1, . . . , J. Algorithm 2 summarizes the updating scheme with
J-copies for SVMs.

Algorithm 2 Data Augmentation with J-copies for SVM (DA-SVM)


(0)
1: Initialize B (0) , W0 , λ(0)
2: for epoch t = 1, . . . , T do
(t,S)
3: Update the weights and slack variables in the top layer with {y(S) , Z1 }

Pseudo
(t−1) (t,S) (t,S) (t−1)
{λ(t,S) }−1 | W0 , y (S) , Z1 ∼ IG(|1 − y (S) Z1 W0 |−1 , τ12 )
0
(t) (t,S) (t) (t) 2
W0 | y (S) , Z1 , λ(t,S) ∼ N (µω , σω )
(t,S)
4: Update the deep learner fB with {Z1 , x(S) }
(t) (t−1)
$ (S) (t,S) %
B =B − η∇fB (t−1) x | Z1 , SGD

code 5:

Z1
(S)
Update Z1 jointly from the deep learner fB and the sampling layer W0
j,(t+1) (t) iid
| W , λj,(t) , y, fB (t) (x) ∼ N (µz , σz
- 0
(t) (t) 2
), j = 1, . . . , J
(T )
1, if W0 fB (T ) (x) > 0
6: return ŷ =
−1, otherwise.

4.3 Logistic Regression


The aim of this example is to show how EM algorithm can be implemented via a
weighted L2 -norm in deep learning. Adopting the logistic regression model from Polson and Scott
(2013), we focus on the penalization of W0 , with parameter optimization given by
1 n 2
1! * $ %+
DL
Ŵ0 = arg min log 1 + exp − yi fB (xi )W0 + φ(W0 | τ ) ,
W0 n i=1
16 Data Augmentation for Bayesian Deep Learning

The outcomes yi are coded as ±1, and τ is assumed fixed.


For likelihood function ! and regularization penalty φ, we assume
, ∞ √ - .
ω ω $ 1 %2
p(yi | σ) ∝ √ i exp − i2 yi fB (xi )W0 − p(ωi )d ωj , (4.7)
0 2πσ 2σ 2ωi
, ∞ √ - .
λ λ −1 2
p(W0 | τ ) = √ exp − 2 (W0 − µW − κW λ ) p(λ)d λ, (4.8)
0 2πτ 2τ
where µW , κW are pre-specified terms controlling the prior of the penalty term and λ is
endowed with a Pólya distribution prior P (λ). Let ωi−1 have a Pólya distribution with
α = 1, κ = 1/2, the following three updates will generate a sequence of estimates that
converges to a stationary point of posterior

(t+1) 1
W0 = (τ −2 Λ(t) + xT∗ Ω(t) x∗ )−1 ( xT∗ 1),
A (t+1)
B2
(t)
(t+1) 1 ezi 1 (t+1) κW + τ 2 φ! (W0 | τ )
ωi = (t+1) (t+1)
− , λ = (t)
,
zi 1 + ezi 2 W0 − µW
(t) T (t)
where zi = yi z1,i W0 = yi logit(ŷit ), x∗ is a matrix with rows x∗i = yi z1,i , Ω =
diag(ω1 , . . . , ωn ) and Λ = diag(λ1 , . . . , λp ) are diagonal matrices. x∗ can be written as
x∗ = diag(y)Z1 , φ! (·) denotes the derivative of standard normal density function.
In the non-penalized case, with λi = 0 for every i, the updates can be simplified as
weighted least squares

(t+1) (t) T (t) 1 (t)


W0 = (Z1 diag(y)Ω(t) diag(y)Z1 )−1 ( y T Z1 ),
A B 2
(t+1)
(t+1) 1 ezi 1
ωi = (t+1) (t+1)
− .
zi 1 + ezi 2

We focus on the non-penalized binary classification case and Algorithm 3 summarizes


our approach. Further generalizations are available. For example, a ridge-regression
penalty, along with the generalized double-pareto prior (Armagan et al., 2013) can be
implemented by adding a sample-wise L2 -regularizer. A multinomial generalization of
this model can be found in Polson and Scott (2013).

5 Experiments
We illustrate the performance of our methods on both synthetic and real datasets,
compared to the deep ReLU networks without the data augmentation layer. We refer
to the latter as DL in our results. We denote the data augmented gaussian regression in
Algorithm 1 as DA-GR, the SVM implementation in Algorithm 2 as DA-SVM and the
logistic regression in Algorithm 3 as DA-logit. For appropriate comparison, we adopt the
same network structures, such as the number of layers, the number of hidden nodes, and
regularizations like dropout rates, for DL and our methods. The differences between our
Wang et al. 17

Algorithm 3 Data Augmentation for Logistic Regression (DA-logit)


(0) (0)
1: Initialize W0 , b0 B (0)
2: for epoch t = 1, . . . , T do

Pseudo 3: Retrieve the input and output of the top layer


(t)
Z1 = fB (t−1) (x)
(t−1) (t) (t−1)
, input
y (t) = sigmoid(W0 Z1 + b0 ) , output

code 4: Calculate the sample-wise weights


z (t) = y · logit(y (t) )
ω(t) = z1(t) (sigmoid(z (t) ) − 12 )
, transformed responses
, weights

3 5: Update the entire deep learner fθ with {y, x}


θ(t) = θ(t−1)
1, if fθ(T ) (x) > 2
(t)
- − η∇fθ(t−1) (x 1| y, sample weights = ω ) , SGD
6: return ŷ =
−1, otherwise.

methods and DL are that (1) the top layer weights W0 , b0 of DL are updated via SGD
optimization, while the weights W0 , b0 of our methods are updated via MCMC or EM;
(2) for binary classification, DA-logit and DL adopt a sigmoid activation function in the
top layer to produce a binary output, while DA-SVM uses a linear function in the top
layer and the augmented sampling layer transforms the continuous value into a binary
output. For all experiments, the datasets are partitioned into 70% training and 30%
testing randomly. For the optimization we use a modification of the SGD algorithm, the
TrainTest
splitgiven
Adaptive moment estimation (Adam, Kingma and Ba (2015)) algorithm. The Adam
algorithm combines the estimate of the stochastic gradient with the earlier estimate of
the gradient, and scales this using an estimate of the second moment of the unit-level
gradient. We have also explored RMSprop (Tieleman and Hinton, 2012) optimizer and
we observe similar decreases in regression or classification errors.
seeds no

given
To illustrate how the choice of J could affect the speed of convergence, we include
different implementations of DA-GR and DA-SVM with J = 2, 5, 10. We have explored
different sampling noise variance τ0 , τZ , but the choices, in general, do not affect the
results significantly.

5.1 Friedman Data


The benchmark (Friedman, 1991) setup uses a regression of the form
Linksto
data
yi = 10 sin(πxi1 xi2 ) + 20(xi3 − 0.5)2 + 10xi4 + 5xi5 + +i , with +i ∼ N (0, σ 2 ),

where xi = (xi1 , xi1 , . . . , xip ) and only the first 5 covariates are predictive of yi . We run
the experiments with n = 100, 1 000 and p = 10, 50, 100, 1 000 to explore the performance

given
in both low dimensional and high dimensional scenarios. We implement both one-layer
(L = 1) and two-layer (L = 2) ReLU networks with 64 hidden units in each layer. For
DA-GR model, we let τ0 = 0.1, τz = 1. The experiments are repeated 50 times with
different random seeds.

Hyperparameters
18 Data Augmentation for Bayesian Deep Learning

Figure 2 reports the three quartiles of the out-of-sample squared errors (MSEs).
The top row is the performance of the one-layer networks and the bottom row is the
performance of the two-layer networks. The two-layer networks perform better and
converge faster. For DA-GR, when J = 5 or J = 10, it converges significantly faster
and the prediction errors are also smaller. When J = 2, the performance of DA-GR is
relatively similar to the deep learning model with only SGD updates. This is due to
the fact that DA-GR with J-copies learns the posterior mode which is equivalent to the
minimization point of the objective function, and it concentrates on the mode faster
when J becomes larger.
The computation costs of DA are higher as shown in Figure 3. This is not entirely
unexpected since we introduce sampling steps. When J increases, the computation costs
also increase slightly. Given the improvement in convergence speed and prediction errors,
our data augmentation strategies are still worthwhile even with some extra computation
costs. In addition, for each epoch, we can draw the sample-wise posteriors in parallel
and the gap between the computation time can be further mitigated.

p=10 p=50 p=100 p=1000

200

L=1
100

method
DA−GR(J=2)
mse

DA−GR(J=5)
DA−GR(J=10)
DL

200
L=2

100

0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch

Figure 2: Quartiles of out-of-sample MSEs under the Friedman Setup. We explore cases
where n = 1 000 and p = 10, 50, 100, 1 000. The tests are repeated 50 times. The medians
of out-of-sample MSEs after training for 1 to 10 epochs are plotted with lines and
the vertical bars mark the 25 % and 75% quantiles of the MSEs. DA-GR refers to
DA Gaussian regression shown in Algorithm 1 and DL stands for the ReLU networks
without the data-augmentation layer.
Wang et al. 19

p=10 p=50 p=100 p=1000

method
DA−GR(J=2)

L=2
time

DA−GR(J=5)
DA−GR(J=10)
DL

0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch

Figure 3: Computation Time under the Friedman setup. The setups are n = 1 000 and
p = 10, 50, 100, 1 000. The averaged time (over 50 repetitions) for computing 1 to 10
epochs is plotted with lines and the vertical bars mark the 25% and 75% quantiles
of the computation time collected. We only include one figure of computation time
comparison here since the scale is relatively the same for all cases.

5.2 Boston Housing Data


Another classical regression benchmark dataset is the Boston Housing dataset1 , see, for
example, Hernández-Lobato and Adams (2015). The data contains n = 506 observations
with 13 features. To show the robustness of DA, we repeat the experiment 20 times with
different training subsets. We adopt the ReLU networks with one hidden layer of 64 units
and set the dropout rate to be 0.5. For the DA-GR model, we let τ0 = 0.1, τZ = 1.
Figure 4 shows the prediction errors of all methods. DA-GR with J = 10 performs
significantly better than the others, in terms of both prediction errors and convergence
Hyper
rates. Meanwhile, DA-GR with J = 2 behaves similarly to SGD at the beginning, but it
converges significantly faster than SGD after a few epochs. This again, shows that with
parameters
the J-copies strategy, our method helps the optimization converge at a faster speed, and
injecting the noise helps the model generalize well out-of-sample.

1 https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
20 Data Augmentation for Bayesian Deep Learning

600

400

method
DA−GR(J=2)
mse

DA−GR(J=5)
DA−GR(J=10)
DL

200

0
0 10 20 30 40 50
epoch

Figure 4: Out-of-sample MSEs for the Boston Housing dataset. The experiment is re-
peated 20 times with different training subsampling. The medians of MSEs after training
for 1 to 50 epochs are provided, with the vertical bars marking the 25% and 75% quan-
tiles of the errors. DA-GR refers to the data augmentation strategy in Algorithm 1 and
DL stands for the ReLU networks without the data-augmentation layer.

5.3 Wine Quality Data Set


The Wine Quality Data Set 2 contains 4 898 observations with 11 features. The output
wine rating is an integer variable ranging from 0 to 10 (the observed range in the data
is from 3 to 9). The frequency of each rating is reported in Table 2.

rating 3 4 5 6 7 8 9
frequency 20 163 1457 2198 880 175 5
Table 2: Frequencies of Different Wine Ratings
The most frequent ratings are 5 and 6. Since we focus on binary classification prob-
lems, we provide two types of classifications, both of which have relatively balanced
categories: (1) wine with a rating of 5 or 6 (Test 1); (2) wine with a rating of ≤ 5 or

2 P.
Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ‘Wine Quality Data Set’, UCI Machine
Learning Repository.
Wang et al. 21

> 5 (Test 2). We use the same network architectures adopted in Friedman’s example
with τ0 = τz = 0.1.

Hyper
Figure 5 provides results for the two types of binary classifications. In both cases,
DA-SVM performs better than DA-logit and DL. The advantage of large J is still
significant and helps converge especially in the early phase. DA-logit outperforms DL
in Test 1 when the network is shallow (L=1), while in other cases performs similarly to
DL.

L=1 L=2
parameters

0.5

Test 1
0.4

method
misclassification

0.3
DA−SVM(J=2)
DA−SVM(J=5)
DA−SVM(J=10)
DA−logit
0.6 DL

0.5

Test 2
0.4

0.3

0.2
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch

Figure 5: Binary Classifications on the Wine Quality dataset. Two types of binary
classifications are considered here. The experiment is repeated 20 times with different
training subsampling. We compare the misclassification rates of DA-SVM in Algorithm 2
with J = 2, 5, 10, DA-logit in Algorithm 3 and the ReLU networks without the data
augmentation layer (DL), after training for 1 to 10 epochs.

5.4 Airbnb Data Set


The Airbnb Kaggle competition3 provides a more challenging application with 21 3451
observations in total, and classified by destination into 12 classes: 10 most popular
countries, other and no destination found (NDF), where other corresponds to any other
country which is not among the top 10 and NDF corresponds to situations that no

3 https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
22 Data Augmentation for Bayesian Deep Learning

booking was made. The countries are denoted with their standard codes, as ‘AU’ for
Australia, ‘CA’ for Canada, ‘DE’ for Germany, ‘ES’ for Spain, ‘FR’ for France, ‘UK’
for United Kingdom, ‘IT’ for Italy, ‘NL’ for Netherlands, ‘PT’ for Portugal, ‘US’ for
United States. Table 3 reports the percentage of each class. We follow the preprocessing
steps in Polson and Sokolov (2017). The list of variables contains information from the
sessions records (number of sessions, summary statistics of action types, device types
and session duration), and user tables such as gender, language, affiliate provider etc.
All categorical variables are converted to binary dummies, which leads to 661 features
in total. For the neural network architecture, we use a two-layer ReLU network with 64
hidden units on each layer and set the dropout rate to be 0.3. For the SVM model, we
let τ0 = τz = 0.1.

AU CA DE ES FR UK IT NDF NL PT US other
% obs 0.25 0.67 0.50 1.05 2.35 1.09 1.33 58.35 0.36 0.10 29.22 4.73

Table 3: Percentage of Each Class (#obs = 21 3451)

Our goal is to test the binary classification models on this dataset. We consider two
types of binary responses, both of which have relatively balanced amounts of observa-
tions in each category.
1. Spain (1.05%) vs United Kingdom(1.09%)
2. United Kingdom (1.09%) vs Italy (1.33%)

Test
Spain vs UK UK vs Italy
0.550

results
0.525
0.50

method
misclassification

misclassification

0.500 SDA−SVM(J=2)

available
SDA−SVM(J=5)
0.48
SDA−SVM(J=10)
SDA−logit
0.475
DL

for all
0.46

0.450

3 methods
0.44
0 5 10 15 20 0 5 10 15 20
epoch epoch

Figure 6: Binary Classifications on the Airbnb Booking Dataset. Two types of binary
classifications are considered here. The experiment is repeated 20 times with different
training subsampling. We compare the misclassification rates of DA-SVM in Algorithm 2
with J = 2, 5, 10, DA-logit in Algorithm 3 and the ReLU networks without the data
augmentation layer, after training for 1 to 20 epochs.
Figure 6 demonstrates the binary classifications for Spain versus UK and UK versus
Italy. For both cases, the out-of-sample misclassification rates are not small and the fluc-
tuations over epochs are big, suggesting that a better model structure may be needed.
However, we still observe that DA-SVM with J = 5 or J = 10 has smaller classification
Wang et al. 23

errors over epochs and the out-of-sample errors decrease faster during earlier phase of
training.

5.5 Summary of Experiment Results


From the above examples, we observe that DA-logit which is implemented under the EM
principle does not show an obvious advantage over the vanilla neural network. It shows
some improvements on the convergence speed when the network is shallow in the Wine
Quality dataset case as in Figure 5. This could be partially due to the fact that we did
not apply regularization on the DA layer for our logit implementation. More importantly,
the performance of the EM algorithm is contingent on the statistical properties of the
objective function. Although the surrogate function is constructed via only the top layer
whose quadratic form ensures concavity, the property of the objective function as a
whole becomes complicated when the deep network architecture is more complex. Since
our method also inherits the negative side of EM and MM algorithms, convergence
to the global maximum is not guaranteed in the absence of concavity. However, this
observation could open the possibility of future research where we can combine the EM
algorithms with shape-constrained neural networks (Gupta et al., 2020).
On the contrary, the MCMC methods with the J-copies strategy significantly im-
prove the prediction errors and convergence speed of the neural networks for both
regression and classification problems. And the advantages become more outstanding
when J is larger. The phenomenon suggests that the stochastic exploratory methods
are preferable when the statistical property of the objective function is unknown or too
complex. And the J-copies scheme largely relieves the problem of being trapped into
local modes.
One concern of using MCMC methods is the extra computation costs induced by
the sampling steps. In our current version where p1 = 1, the sample-wise sampling
steps can be computed in parallel. If one wishes to introduce a higher dimension latent
variable Z1 such that p1 > 1, the computation costs will increase as it may involve
sampling from multivariate distributions. In that case, fast sampling implementation
such as Bhattacharya et al. (2016) is recommended to speed up the process.

6 Discussion
Various regularization methods have been deployed in neural networks to prevent over-
fitting, such as early stopping, weight decay, dropout (Hinton et al., 2012), gradient
noise (Neelakantan et al., 2017). Bayesian strategies tackle the regularization problem
by proposing probability structures on the weights. We show that data augmentation
strategies are available for many standard activation functions (ReLU, SVM, logit) used
in deep learning.
Using MCMC provides a natural stochastic search mechanism that avoids procedures
such as back-tracking and provides full descriptions of the objective function over the
entire range Θ. Training deep neural networks thus benefits from additional hidden
24 Data Augmentation for Bayesian Deep Learning

stochastic augmentation units (a.k.a. data augmentation). Uncertainty can be injected


into the network through the probabilistic distributions on only one or two layers,
permitting more variability of the network. When more data are observed, the level
of uncertainty decreases as more information is learned and the network becomes more
deterministic. We also exploit the duality between maximum a posteriori estimation and
optimization. We provide a J-copies stacking scheme to speed up the convergence to
posterior mode and avoid trapping attraction of the local modes. Concerning efficiency,
DA provides a natural framework to convert the objective function into weighted least
squares and is straightforward to implement with the current deep learning training
process.
Our three motivational examples illustrated the advantages of data augmentation.
Our work has the potential to be generalized to many other data augmentation schemes
and different regularization priors. Probabilistic structures on more units and layers are
also possible to allow for more uncertainty.
Our DA-DL methods enjoy the benefits of both worlds. On one hand, with the data
augmentation on top, it is robust to random weight initialization. Although we still
need to specify the learning rates for the deep architecture, the top layer can learn
adaptively and the entire network becomes less sensitive to the choice of learning rate.
On the other hand, the fast SGD updates from the deep architecture largely alleviate
the computation concerns compared to a fully Bayesian hierarchical model.
There are many directions to future research, including adding more sampling layers
so the model could accommodate more randomness and flexibility, and using weighted
Bayesian bootstrap (Newton et al., 2021) to approximate the unweighted posteriors by
assigning random weight to each observation and penalty. Uncertainty quantification for
prediction is also possible. Although we focus on the training aspect of deep learning, one
can collect posterior draws θ(t) from the MCMC procedure when the training process
converges. Using (3.2), we can construct predictive intervals and conduct inference.

References
Ahn, S., Balan, A. K., and Welling, M. (2012). “Bayesian posterior sampling via stochas-
tic gradient fisher scoring.” In Proceedings of the 29th International Conference on
Machine Learning, 1591–1598. 12
Armagan, A., Dunson, D. B., and Lee, J. (2013). “Generalized double Pareto shrinkage.”
Statistica Sinica, 23(1): 119. 16
Bauer, B. and Kohler, M. (2019). “On deep learning as a remedy for the curse of
dimensionality in nonparametric regression.” The Annals of Statistics, 47(4): 2261–
2285. 1
Bhadra, A., Datta, J., Polson, N., Sokolov, V., and Xu, J. (2021). “Merging two cultures:
deep and statistical learning.” arXiv preprint arXiv:2110.11561 . 3
Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). “Fast sampling with
Wang et al. 25

Gaussian scale mixture priors in high-dimensional regression.” Biometrika, asw042.


23
Chen, T., Fox, E., and Guestrin, C. (2014). “Stochastic gradient Hamiltonian Monte
Carlo.” In International Conference on Machine Learning, 1683–1691. 3, 11
Deng, W., Zhang, X., Liang, F., and Lin, G. (2019). “An adaptive empirical Bayesian
method for sparse deep learning.” In Advances in Neural Information Processing
Systems, 5563–5573. 2
Duan, L. L., Johndrow, J. E., and Dunson, D. B. (2018). “Scaling up data augmentation
MCMC via calibration.” Journal of Machine Learning Research, 19(1): 2575–2608. 3
Fan, J., Ma, C., and Zhong, Y. (2021). “A selective overview of deep learning.” Statistical
Science, 36(2): 264–290. 1
Friedman, J. H. (1991). “Multivariate adaptive regression splines.” The Annals of
Statistics, 19(1): 1–67. 17
Gan, Z., Henao, R., Carlson, D., and Carin, L. (2015). “Learning deep sigmoid belief
networks with data augmentation.” In Artificial Intelligence and Statistics, 268–276.
2, 3
Geman, S. and Hwang, C.-R. (1986). “Diffusions for global optimization.” SIAM Journal
on Control and Optimization, 24(5): 1031–1043. 12
Geyer, C. J. (1996). “Estimation and optimization of functions.” In Markov chain
Monte Carlo in practice, 241–258. Chapman and Hall. 8
Gramacy, R. B. and Lee, H. K. H. (2008). “Bayesian treed Gaussian process models
with an application to computer modeling.” Journal of the American Statistical
Association, 103(483): 1119–1130. 3
Green, P. J. (1984). “Iteratively reweighted least squares for maximum likelihood esti-
mation, and some robust and resistant alternatives.” Journal of the Royal Statistical
Society: Series B (Methodological), 46(2): 149–170. 2
Gupta, M., Louidor, E., Mangylov, O., Morioka, N., Narayan, T., and Zhao, S. (2020).
“Multidimensional shape constraints.” In International Conference on Machine
Learning, 3918–3928. PMLR. 23
He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image
recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 770–778. 1
Hernández-Lobato, J. M. and Adams, R. (2015). “Probabilistic backpropagation for
scalable learning of Bayesian neural networks.” In International Conference on Ma-
chine Learning, 1861–1869. 2, 19
Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). “Computer model
calibration using high-dimensional output.” Journal of the American Statistical As-
sociation, 103(482): 570–583. 3
26 Data Augmentation for Bayesian Deep Learning

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.
(2012). “Improving neural networks by preventing co-adaptation of feature detectors.”
arXiv:1207.0580 . 23
Hunter, D. R. and Lange, K. (2004). “A tutorial on MM algorithms.” The American
Statistician, 58(1): 30–37. 8
Jacquier, E., Johannes, M., and Polson, N. (2007). “MCMC maximum likelihood for
latent state models.” Journal of Econometrics, 137(2): 615–640. 2, 10
Kingma, D. P. and Ba, J. (2015). “Adam: A method for stochastic optimization.” 17
Kolmogorov, A. N. (1957). “On the representation of continuous functions of many
variables by superposition of continuous functions of one variable and addition.” In
Doklady Akademii Nauk , volume 114, 953–956. Russian Academy of Sciences. 4
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “Imagenet classification with
deep convolutional neural networks.” Advances in Neural Information Processing
Systems, 25: 1097–1105. 1
Lange, K. (2013a). “The MM algorithm.” In Optimization, 185–219. Springer. 8
— (2013b). Optimization, volume 95. Springer Science & Business Media. 8
Lange, K., Hunter, D. R., and Yang, I. (2000). “Optimization transfer using surrogate
objective functions.” Journal of Computational and Graphical Statistics, 9(1): 1–20.
8
Liang, S. and Srikant, R. (2017). “Why deep neural networks for function approxima-
tion?” In International Conference on Learning Representations. 3
Ma, Y.-A., Chen, Y., Jin, C., Flammarion, N., and Jordan, M. I. (2019). “Sampling
can be faster than optimization.” Proceedings of the National Academy of Sciences,
116(42): 20881–20885. 3
Mallick, B. K., Ghosh, D., and Ghosh, M. (2005). “Bayesian classification of tumours
by using gene expression data.” Journal of the Royal Statistical Society: Series B ,
67(2): 219–234. 2, 14
Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). “Stochastic gradient descent as ap-
proximate Bayesian inference.” Journal of Machine Learning Research, 18(1): 4873–
4907. 12
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). “Equation of state calculations by fast computing machines.” Journal of
Chemical Physics, 21(6): 1087–1092. 9
Mhaskar, H., Liao, Q., and Poggio, T. A. (2017). “When and why are deep networks
better than shallow ones?” In Proceedings of the 31th Conference on Artificial Intel-
ligence, 2343–2349. 2
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). “On the number of
linear regions of deep neural networks.” In Advances in Neural Information Processing
Systems, 2924–2932. 3
Wang et al. 27

Murray, I., Adams, R., and MacKay, D. (2010). “Elliptical slice sampling.” In Proceed-
ings of the thirteenth International Conference on Artificial Intelligence and Statis-
tics, 541–548. 12
Neal, R. M. (2003). “Slice sampling.” The Annals of Statistics, 705–741. 12
— (2011). “MCMC using Hamiltonian dynamics.” Handbook of Markov Chain Monte
Carlo, 2(11): 2. 3, 11
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens,
J. (2017). “Adding gradient noise improves learning for very deep networks.” Inter-
national Conference on Learning Representations. 3, 23
Nesterov, Y. (1983). “A method for unconstrained convex minimization problem with
the rate of convergence O (1/k 2 ).” In Doklady AN USSR, volume 269, 543–547. 2
Newton, M. A., Polson, N. G., and Xu, J. (2021). “Weighted Bayesian bootstrap for
scalable posterior distributions.” Canadian Journal of Statistics, 49(2): 421–437. 24
Phillips, D. B. and Smith, A. F. (1996). “Bayesian model comparison via jump diffu-
sions.” Markov Chain Monte Carlo in Practice, 215: 239. 11
Pincus, M. (1968). “A closed form solution of certain programming problems.” Opera-
tions Research, 16(3): 690–694. 3, 11
— (1970). “A Monte Carlo Method for the approximate solution of certain types of
constrained optimization problems.” Operations Research, 18(6): 1225–1228. 3, 11
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. (2017). “Why and
when can deep-but not shallow-networks avoid the curse of dimensionality: a review.”
International Journal of Automation and Computing, 14(5): 503–519. 1, 2
Polson, N., Sokolov, V., and Xu, J. (2021). “Deep learning partial least squares.” arXiv
preprint arXiv:2106.14085 . 3
Polson, N. G. and Rockova, V. (2018). “Posterior concentration for sparse deep learn-
ing.” In Advances in Neural Information Processing Systems, 938–949. 1, 2, 3, 4
Polson, N. G. and Scott, J. G. (2013). “Data augmentation for Non-Gaussian regression
models using variance-mean mixtures.” Biometrika, 100(2): 459–471. 15, 16
Polson, N. G., Scott, J. G., and Windle, J. (2013). “Bayesian inference for logistic
models using Pólya–Gamma latent variables.” Journal of the American statistical
Association, 108(504): 1339–1349. 3, 7
Polson, N. G. and Scott, S. L. (2011). “Data augmentation for Support Vector Ma-
chines.” Bayesian Analysis, 6(1): 1–23. 5, 14
Polson, N. G. and Sokolov, V. (2017). “Deep learning: a Bayesian perspective.” Bayesian
Analysis, 12(4): 1275–1304. 4, 22
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). “Stochastic backpropagation
and approximate inference in deep generative models.” In International Conference
on Machine Learning, 1278–1286. 2
28 Data Augmentation for Bayesian Deep Learning

Roberts, G. O. and Rosenthal, J. S. (1998). “Optimal scaling of discrete approximations


to Langevin diffusions.” Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 60(1): 255–268. 12
Schmidt-Hieber, J. (2020). “Nonparametric regression using deep neural networks with
ReLU activation function.” The Annals of Statistics, 48(4): 1875–1897. 1, 3
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G.,
Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M. (2016). “Mas-
tering the game of Go with deep neural networks and tree search.” Nature, 529(7587):
484–489. 1
Telgarsky, M. (2017). “Neural networks and rational functions.” In Proceedings of the
34th International Conference on Machine Learning, volume 70, 3387–3393. JMLR.
org. 3
Tieleman, T. and Hinton, G. (2012). “Rmsprop: Divide the gradient by a running
average of its recent magnitude. coursera: Neural networks for machine learning.”
COURSERA Neural Networks Mach. Learn. 17
Tran, N., M-N sand Nguyen, Nott, D., and Kohn, R. (2020). “Bayesian deep net GLM
and GLMM.” Journal of Computational and Graphical Statistics, 29(1): 97–113. 3
Vitushkin, A. (1964). “Proof of the existence of analytic functions of several complex
variables which are not representable by linear superpositions of continuously differ-
entiable functions of fewer variables.” Soviet Mathematics, 5: 793–796. 4
Wager, S., Wang, S., and Liang, P. S. (2013). “Dropout training as adaptive regular-
ization.” In Advances in Neural Information Processing Systems, 351–359. 1, 4
Wang, Y. and Rockova, V. (2020). “Uncertainty quantification for sparse deep learning.”
In Artificial Intelligence and Statistics. 1
Welling, M. and Teh, Y. W. (2011). “Bayesian learning via stochastic gradient Langevin
dynamics.” In Proceedings of the 28th International Conference on Machine Learning,
681–688. 12
Yarotsky, D. (2017). “Error bounds for approximations with deep ReLU networks.”
Neural Networks, 94: 103–114. 3
Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). “Beta-negative binomial
process and Poisson factor analysis.” In Artificial Intelligence and Statistics, 1462–
1471. 2

You might also like