Method: Research
Method: Research
Abstract. Deep Learning (DL) methods have emerged as one of the most pow-
erful tools for functional approximation and prediction. While the representa-
tion properties of DL have been well studied, uncertainty quantification remains
challenging and largely unexplored. Data augmentation techniques are a natu-
ral approach to provide uncertainty quantification and to incorporate stochastic
Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose
of our paper is to show that training DL architectures with data augmentation
leads to efficiency gains. We use the theory of scale mixtures of normals to de-
rive data augmentation strategies for deep learning. This allows variants of the
expectation-maximization and MCMC algorithms to be brought to bear on these
Research high dimensional nonlinear deep learning models. To demonstrate our method-
ology, we develop data augmentation algorithms for a variety of commonly used
method
activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is
compared to traditional stochastic gradient descent with back-propagation. Our
optimization procedure leads to a version of iteratively re-weighted least squares
and can be implemented at scale with accelerated linear algebra methods provid-
Prediction ing substantial improvement in speed. We illustrate our methodology on a number
of standard datasets. Finally, we conclude with directions for future research.
1
1 Introduction
Deep neural networks (DNNs) have become a central tool for Artificial Intelligence
(AI) applications such as, image processing (ImageNet, Krizhevsky et al. (2012)), object
recognition (ResNet, He et al. (2016)) and game intelligence (AlphaGoZero, Silver et al.
(2016)). The approximability (Poggio et al., 2017; Bauer and Kohler, 2019) and rate of
convergence of deep learning, either in the frequentist fashion (Schmidt-Hieber, 2020) or
Problem
from a Bayesian predictive point of view (Polson and Rockova, 2018; Wang and Rockova,
a
2020), have been well-explored and understood. Fan et al. (2021) provides a selective
overview of deep learning. However, training deep learners is challenging due to the
high dimensional search space and the non-convex objective function. Deep neural net-
works have also suffered from issues such as local traps, miscalibration and overfitting.
Various efforts have been made to improve the generalization performance and many
of their roots lie in Bayesian modeling. For example, Dropout (Wager et al., 2013) is
commonly used and can be viewed as a deterministic ridge !2 regularization. Sparsity
structure via spike-and-slab priors (Polson and Rockova, 2018) on weights helps DNNs
arXiv: 1903.09668
∗ Booth School of Business, University of Chicago, Chicago, IL, [email protected],
[email protected].
† Volgenau School of Engineering, George Mason University, Fairfax, VA, [email protected].
AcademicAffiliation
© 2022 International Society for Bayesian Analysis
2 Data Augmentation for Bayesian Deep Learning
adapt to smoothness and avoid overfitting. Rezende et al. (2014) propose stochastic
back-propagation through the use of latent Gaussian variables.
In this paper, following the spirit of hierarchical Bayesian modeling, we develop
data augmentation strategies for deep learning with a complete data likelihood function
equivalent to weighted least squares regression. By using the theory of mean-variance
mixtures of Gaussians, our latent variable representation brings all of the conditionally
linear model theory to deep learning. For example, it allows for the straightforward
specification of uncertainty at each layer of deep learning and for a wide range of reg-
ularization penalties. Our method applies to commonly used activation functions such
as ReLU, leaky ReLU, logit (see also Gan et al. (2015)), and provides a general frame-
work for training and inference in DNNs. It inherits the advantages and disadvantages of
data augmentation schemes. For approximation methods like Expectation-Maximization
(EM) and Minorize-Maximization (MM), they are stable as they increase the objective
but can be slow in the neighborhood of the maximum point even with acceleration
methods such as Nesterov acceleration available and the performance is highly depen-
dent on the properties of the objective function. Stochastic exploratory methods like
MCMC have the main advantage of addressing uncertainty quantification (UQ) and are
stable in the sense they require no tuning. Hyper-parameter estimation is immediately
available using traditional Bayesian methods. DA augments the objective function with
extra hidden units which allow for efficient step size selection for the gradient descent
search. In some of the applications, data augmentation methods can be formulated in
terms of complete data sufficient statistics, a considerable advantage when dealing with
large datasets where most of the computational expense comes from repeatedly iterating
over the data. By combing the MCMC methods with the J-copies trick (Jacquier et al.,
2007), we can move faster towards posterior mode and avoid local maxima. Traditional
Hypothesis
methods for training deep learning models such as stochastic gradient descent (SGD)
have none of the above advantages. We also note that we exploit the advantages of SGD
and accelerated linear algebra methods when we implement our weighted least squares
regression step.
Data augmentation strategies are commonplace in statistical algorithms and ac-
celerated convergence (Nesterov, 1983; Green, 1984) is available. Our goal is to show
similar efficiency improvements for deep learning. Our work builds on Deng et al. (2019)
who provide adaptive empirical Bayes methods. In particular, we show how to imple-
Objective
ment standard activation functions, including ReLU (Polson and Rockova, 2018), logis-
tic (Zhou et al., 2012; Hernández-Lobato and Adams, 2015) and SVM (Mallick et al.,
2005) activation functions and provide specific data augmentation strategies and algo-
rithms. The core subroutine of the resulting algorithms solves a least squares problem.
Scalable linear algebra libraries such as Compute Unified Device Architecture (CUDA)
and accelerated linear algebra (XLA) are available for implementation. To illustrate Software
our approach, empirically we experiment with two benchmark datasets using Pólya-
Gamma data augmentation for logit activation functions. For the deep architecture
embedded in our approach, we adopt deep ReLU networks. Deep networks are able
Dependencies
to achieve the same level of approximation accuracy with exponentially fewer param-
eters for compositional functions (Mhaskar et al., 2017). Poggio et al. (2017) further
Research show how deep networks can avoid the curse of dimensionality. The ReLU function is
l
Wang et al. 3
favored due to its ability to avoid vanishing gradients and its expressibility and inher-
ent sparsity. Approximation properties of deep ReLU networks have been developed
in Montufar et al. (2014), Telgarsky (2017), and Liang and Srikant (2017). Yarotsky
(2017) and Schmidt-Hieber (2020) show that deep ReLU networks can yield a rate-
optimal approximation of smooth functions of an arbitrary order. Polson and Rockova
(2018) provide posterior rates of convergence for sparse deep learning.
There is another active area of research that revives traditional statistical models
with the computational power of DL (Bhadra et al., 2021). Examples include Gaus-
sian Process models (Higdon et al., 2008; Gramacy and Lee, 2008), Generalized Linear
Models (GLM) and Generalized Linear Mixed Models (GLMM) (Tran et al., 2020) and
Partial Least Squares (PLS) (Polson et al., 2021). Our method benefits from the com-
putation efficiency and flexibility of expression of the deep neural network. In addition,
our work builds on the sampling optimization literature (Pincus, 1968, 1970) which
now uses MCMC methods. Other examples include Ma et al. (2019) who study that
sampling can be faster than optimization and Neelakantan et al. (2017) showing that
gradient noise can improve learning for very deep networks. Gan et al. (2015) imple-
ments data augmentation inside learning deep sigmoid belief networks. Neal (2011) and
Chen et al. (2014) provide Hamitonian Monte Carlo (HMC) algorithms for MCMC.
Duan et al. (2018) proposes a family of calibrated data-augmentation algorithms to
increase the effective sample size.
The rest of our paper is outlined as follows. Section 2 provides the general setting
of deep neural networks and shows how DA can be integrated into deep learning us-
ing the duality between Bayesian simulation and optimization. Section 3 describes our
data augmentation (DA) schemes and two approaches to implement them. Section 4
provides applications to Gaussian regression, support vector machines and logistic re-
gression using Pólya-Gamma augmentation (Polson et al., 2013). Section 5 provides the
experiments of DA on both regression and classification problems. Section 6 concludes
with directions for future research.
y = fθ (x),
Deep learners use compositions (Kolmogorov, 1957; Vitushkin, 1964) of ridge func-
tions rather than additive functions that are commonplace in statistical applications.
With L ∈ N we denote the number of hidden layers and with pl ∈ N the num-
ber of neurons at the lth layer. Setting pL+1 = p, p0 = p1 = 1, we denote with
p = (p0 , p1 , . . . , pL+1 ) ∈ NL+2 the vector of neuron counts for the entire network.
Imagine composing L layers, a deep predictor then takes the form
where bl ∈ Rpl is a shift vector, Wl ∈ Rpl−1 ×pl is a weight matrix that links neurons
between (l −1)th and lth layers and fWl ,bl (x) = fl (Wl x+bl ) is a semi-affine function. We
denote with θ = {(W0 , b0 ), (W1 , b1 ), . . . , (WL , bL )} as the stacked parameters. We can
rewrite the compositions in (2.1) with a set of latent variables Z = (Z1 , Z2 , . . . , ZL )! as
y = f0 (Z1 W0 + b0 ),
Zl = fl (Zl+1 Wl + bl ), l = 1, . . . , L, (2.2)
ZL+1 = x,
where Zl ∈ Rn×pl is the matrix of hidden nodes in l-th layer. We only consider the case
p = 1 and Z1 ∈ Rn in our work. We provide discussion on extensions to cases p > 1 for
some of our applications in Section 4.
Commonly used regularization techniques for deep learners include L2 (weight decay),
spike-and-slab regularization (Polson and Rockova, 2018) and dropout (Wager et al.,
2013), which can also be viewed as a variant of L2 -regularization.
As such the optimization problem in (2.4) of training a deep learner fθ (·) involves a
highly nonlinear objective function. Stochastic gradient descent (SGD) is a popular tool
Wang et al. 5
based on back-propagation (a.k.a. the chain rule), but it often suffers from local traps
and overfitting due to the non-convex nature of the problem. We propose data augmen-
tation techniques which can be seamlessly applied in this context and provide efficiency
gains. This is achieved via the hierarchical duality between optimization with regulariza-
tion and finding the maximum a posteriori (MAP) estimate (Polson and Scott, 2011),
as described in the following lemma.
Lemma 2.1. The regularization problem
" n L
#
1! !
θ̂ = arg min !(yi , fθ (xi )) + λl φl (Wl , bl )
θ n i=1
l=0
Here p(θ) can be interpreted as a prior probability distribution and the log-prior as the
regularization penalty.
y | Z1 ∼ p(y | Z1 ),
Zl ∼ N (fl (Wl Zl+1 + bl ), τl2 ), l = 1, 2, . . . , L, (2.5)
ZL+1 = x.
Now the hidden variables Z = (Z1 , . . . , ZL )! can be viewed as data augmentation vari-
ables and hence will also allow the contribution of fast scalable algorithms for inference
and prediction.
For the ease of computation, we only replace the top layer of the DNN with a stochas-
tic layer. We denote network structure below the top layer with B = {(W1 , b1 ), . . . , (WL , bL )},
and the network structure can be rewritten as
y = f0 (Z1 W0 + b0 ), Z1 = fB (x),
6 Data Augmentation for Bayesian Deep Learning
where the function f0 (Z1 W0 + b0 ) is the top layer structure and function fB (x) is the
network architecture below the top layer. Considering the objective function in (2.4),
we implement the solutions with a two-step iterative search. At iteration t, we have
1. DA-update for the top layer W0 , b0 as the MAP estimator of the distribution
(t) (t)
p(W0 , b0 | Z1 , y) ∝ p(y, Z1 | W0 , b0 )p(W0 , b0 ) (2.6)
" n
#
1!
∝ exp − !(yi , fθ (xi ) | B (t) ) + λ0 φ0 (W0 , b0 )
n i=1
where the augmented auxiliary distribution, p(θ, ω | y) factorizes nicely into complete
conditionals p(θ | ω, y) and p(ω | θ, y). A crucial ingredient is that p(θ | ω, y) is easily
managed typically via conditional Gaussians.
Data augmentation tricks allow us to express the likelihood as an expectation of a
weighted L2 -norm. Specifically, we write
( $ %) ( * $ %+)
exp − ! y, fθ (x) = Eω exp − Q y | fθ (x), ω
Wang et al. 7
, *
∞ $ %+
= exp − Q y | fθ (x), ω p(ω)d ω
0
where
$ p(ω) is %the prior on the auxiliary variables ω = (ω1 , . . . , ωn )! and the function
Q y | fθ (x), ω is designed to be a quadratic form, given the data augmentation vari-
ables ω. The function fθ (x) = (f0 ◦ · · · ◦ fL )(x) is a deep learner.
Table 1 shows that standard activation functions such as ReLU, logit, lasso and check
can be expressed in the form of (3.1). Commonly used activation functions for deep
learning, with an appropriate stochastic assumptions for w (for notation of simplicity,
we derive the standard form for the single observation case) can be expressed as
- * +.
1 1 2
exp(− max(1 − x, 0)) = Eω √ exp − (x − 1 − ω) , where ω ∼ GIG(1, 0, 0),
2πω 2ω
- .
1
exp(− log(1 + ex )) = Eω exp(− ωx2 ) , where ω ∼ PG(1, 0),
2
- * .
1 1 2+ $1%
exp(−|x|) = Eω √ exp − x , where ω ∼ E .
2πω 2ω 2
Here GIG denotes the Generalized Inverse Gaussian distribution, PG represents the
Pólya Gamma distribution (Polson et al., 2013), and E represents the exponential dis-
tribution.
l(W, b) Q(W, b, ω) p(ω)
, ∞ - . / 0
1 (x + aλ)2 1 2 max(ax, 0)
ReLU: max(1 − zi , 0) √ exp − d λ = exp − GIG(1, 0, 0)
0 2πcλ 2cλ a c
,
1 (a−b/2)ψ ∞ −ωψ2 /2 (eψ )a
Logit: log(1 + ezi ) e e p(ω)d ω = PG(b, 0)
2b 0 (1 + eψ )b
, ∞ - 2
. / 0
1 x 1 1 |x|
Lasso: | zσi | √ exp − e− 2 λ d λ = exp − E( 21 )
0 2πcλ 2cλ c c
, ∞ - . / 0
1 (x + (2τ − 1)λ)2 −2τ (1−τ )λ 1 2 √
Check: |zi | + (2τ − 1)zi √ exp − 2
e d λ = exp − ρτ (x) GIG(1, 0, 2 τ − τ 2 )
0 2πcλ 2c λ c c
x
1
$ %
Table 1: Data Augmentation Strategies. Here ρτ (x) = 2 |x| + τ − 12 x is the check
function.
Using the data augmentation strategies, the objectives are represented as mixtures
of Gaussians. DA can perform such an optimization with only the use of a sequence of
iteratively re-weighted L2 -norms. This allows us to use XLA techniques to accelerate
the training.
Remark 3.1. The log-posterior is optimized given the training data, {yi , xi }ni=1 . Deep
learning possesses the key property that ∇θ log p(y|θ, x) is computationally inexpensive
to evaluate using tensor methods for very complicated architectures and fast implemen-
tation on large datasets. One caveat is that the posterior is highly multi-modal and
providing good hyperparameter tuning can be expensive. This is clearly a fruitful area
of research for state-of-the-art stochastic MCMC algorithms to provide more efficient
algorithms. For shallow architectures, the alternating direction method of multipliers
(ADMM) is an efficient solution to the optimization problem.
8 Data Augmentation for Bayesian Deep Learning
& * 1! n
$ % !L +'
θ̂ := arg max Eω exp − Q yi | fθ (xi ), ω − λl φl (Wl , bl ) , (3.1)
θ n i=1
l=0
where each ωi is drawn from conditional distribution p(ωi | θ) ∝ p(ωi , θ) and the
minorization is satisfied as $ %
log H(θ) ≥ G θ | θ(t) .
$ %
Maximizing G θ | θ(t) with respect to θ drives H(θ) uphill. The ascent property of
the EM algorithm relies on the nonnegativity of the Kullback-Leibler divergence of
two conditional probability densities (Hunter and Lange, 2004; Lange, 2013a). The EM
algorithm enjoys the numerical stability as it steadily increases the likelihood without
wildly overshooting or undershooting. It simplifies the optimization problem by (1)
avoiding large matrix inversion; (2) linearizing the objective function; (3) separating
the variables of the optimization problem (Lange, 2013b). In Section 4.3 we show how
Pólya-Gamma augmentation leads to an EM algorithm for logistic regression.
The exploratory alternative to solve (3.1) is stochastic search methods such as
MCMC. The data augmentation strategies enable us to sample from the joint posterior
* n
1! $ % !L +
p(θ | y) ∝ exp − ! yi , fθ (xi ) − λl φl (Wl , bl )
n i=1
l=0
Wang et al. 9
& * 1! n
$ % ! L +'
= Eω exp − Q yi | fθ (xi ), ω − λl φl (Wl , bl )
n i=1
l=0
, ∞ * 1! n +
%
= exp − Q(yi | fθ (xi ), ω p(ω)p(θ)d ω
0 n i=1
$ 3L %
where the prior is related to the regularization penalty, via p(θ) ∝ exp − l=0 λl φl (Wl , bl ) .
Hence, we can provide an MCMC algorithm in the augmented space (θ, ω) and
simulate from the joint posterior distribution, denoted by p(θ, ω | y), namely
* +
p(θ, ω | y) ∝ exp − Q(y | fθ (x), ω) p(θ)p(ω).
Then we recover stochastic draws θ (t) ∼ p(θ | y) from the marginal posterior. These
draws can be used in prediction to account for predictive uncertainty, namely
, T
$ % $ % 1 ! $ %
p y! | f (x! ) = p y! | θ, fθ (x! ) p(θ | y)d θ ≈ p y! | θ(t) , fθ(t) (x! ) . (3.2)
T t=1
To simulate the posterior mode without evaluating the likelihood directly (Jacquier et al.,
2007), we sample J independent copies of hidden variable Z1 . Denoted the copies with
Z11 , . . . , Z1J , we sample them simultaneously and independently from the posterior dis-
tribution
iid
Z1j |θ, x, y ∼ N (µz , σz2 ), j = 1, . . . , J,
(S) (S)
where y (S) , Z1 and fB (x(S) ) are (n × J)-dimensional vectors. We use Z1 to amplify
the information in y, which is especially useful in the finite sample problems. Figure 1
illustrates our network architecture.
With the stacked system, the joint distribution of the parameters θ and the aug-
(S)
mented hidden variables Z1 given data y, x can be written as
J
;
(S)
πJ (θ, Z1 | x, y) ∝ p(y | θ, Z1j )p(Z1j | θ, x, y)p(θ).
j=1
Wang et al. 11
concentrates on the density proportional to p(x, y | θ)J p(θ) and provides us with a
simulation solution to finding the MAP estimator (Pincus, 1968, 1970).
Another alternative to simulate from the posterior mode is Hamiltonian Monte Carlo
(Neal, 2011), which is a modification of Metropolis-Hastings (MH) sampler. Adding an
additional momentum variable ν to the Boltzmann distribution in (3.3), and generating
draws from joint distribution
$ %
πJ (θ, ω) ∝ exp −Jf (θ) − (1/2)ν T M −1 ν ,
where M is a mass matrix. Chen et al. (2014) adopt this approach in a deep learning
setting.
σ2
θ(t+1) = θ (t) + ∇ log f (θ(t) ) + σ$t ,
2
where $t ∼ Nd (0, Id ) and σ 2 corresponds to the discretization size.
This can also be derived by taking a second-order approximation of log(f ), namely
$ %!
log f (θ(t+1) ) = log f (θ(t) ) + θ(t+1) − θ(t) ∇ log f (θ(t) )
1$ %! $ %
− θ(t+1) − θ(t) H(θ(t) ) θ(t+1) − θ (t) ,
2
where H(θ(t) ) = −∇2 log f (θ(t) ) is the Hessian matrix. By taking exponential transfor-
mation on both sides, the random walk type approximation to f (θ(t+1) ) is
($ %! 1$ %! $ %)
f (θ(t+1) ) ∝ exp θ(t+1) − θ(t) ∇ log f (θ(t) ) − θ (t+1) − θ (t) H(θ(t) ) θ(t+1) − θ(t)
2
( 1$ (t) %! $ (t+1) %)
∝ exp − θ (t+1)
−θ < (t)
H(θ ) θ <(t) .
−θ
2
(t)
< = θ(t) +H −1 (θ (t) )∇ log f (θ (t) ). If we simplify this approximation by replacing
where θ
(t)
H(θ ) with σ −2 Ip , the Taylor approximation leads to updating step as
Roberts and Rosenthal (1998) give further discussion on the choice of σ that would
yield an acceptance rate of 0.574 to achieve optimal convergence rate.
Mandt et al. (2017) show that SGD can be interpreted as a multivariate Ornstein-
Uhlenbeck process =
(t) (t) C
dθ = −ηAθ dt + η dW (t) ,
S
here η is the constant learning rate, A is the symmetric Hessian matrix at the optimum
and CS is the covariance of the mini-batch (of size S) gradient noise, which is assumed
to be approximately constant near the local optimum of the loss. They also provide
results on discrete-time dynamics on other Stochastic Gradient MCMC algorithms,
such as Stochastic Gradient Langevin dynamics (SGLD) by Welling and Teh (2011)
and Stochastic Gradient Fisher Scoring by Ahn et al. (2012).
Combing their results and the Langevin dynamics of MCMC algorithms, we can
write the approximation of our DA-DL updating scheme as
> ?(t+1) > ?(t)
W0 W0 (t) (t) (t)
= + σ 2 ∇ log f0 (Z1 W0 + b0 ) + σ$0t ,
b0 b0
C
B (t+1) = B (t) − η∇2 fB ∗ (x)B (t) + √ η$Bt .
S
Similar adaptive dynamics are also observed in other methods. Geman and Hwang
(1986) show the convergence of the annealing process using Langevin equations. Slice
sampling (Neal, 2003) adaptively chooses the step size based on the local properties of
the density function. By constructing local quadratic approximations, it could adapt to
the dependencies between variables. Murray et al. (2010) further propose elliptical slice
sampling that operates on the ellipse of states.
4 Applications
To illustrate our methodology, we provide three examples: (1) a standard Gaussian
regression model with squared loss; (2) a binary classification model under the support
vector machine framework; (3) a logistic regression model paired with a Pólya mixing
distribution. For the Gaussian regression and SVM models, we implement with J-copies
stacking strategy to provide full posterior modes.
Before diving into the examples, we introduce the notations we use throughout
this section. We continue to denote the output with y = (y1 , . . . , yn )! , yi ∈ R, the
input with x = (x1 , . . . , xn )! , xi ∈ Rp , the latent variable of the top layer with Z1 =
(z1,1 , . . . , z1,n )! , z1,i ∈ R and the stacked version as in (3.4). We introduce stochastic
noises $0 = (+0,1 , . . . , +0,n )! in the top layer and $z = (+z,1 , . . . , +z,n )! in the second
iid iid
layer, where +0,i ∼ N (0, τ02 ) and +z,i ∼ N (0, τz2 ). The scale parameters τ0 and τz are
pre-specified and determine the level of randomness or uncertainty for the DA-update
and SGD-update respectively. We use η to denote the learning rate used in the SGD
Pseudo
2: for epoch t = 1, . . . , T do
(t,S)
3: Update the weights in the top layer with {y(S) , Z1 }
(t) (t,S) (t,S)
W0 = Cov(Z1 , y (S) )/Var(Z1 )
where IG denotes the Inverse Gaussian distribution and 1 = (1, . . . , 1)! is a n-dimensional
unit vector.
The J-copies strategy can also be adopted here. Z1j and λj needs to be sampled
independently for j = 1, . . . , J. Algorithm 2 summarizes the updating scheme with
J-copies for SVMs.
Pseudo
(t−1) (t,S) (t,S) (t−1)
{λ(t,S) }−1 | W0 , y (S) , Z1 ∼ IG(|1 − y (S) Z1 W0 |−1 , τ12 )
0
(t) (t,S) (t) (t) 2
W0 | y (S) , Z1 , λ(t,S) ∼ N (µω , σω )
(t,S)
4: Update the deep learner fB with {Z1 , x(S) }
(t) (t−1)
$ (S) (t,S) %
B =B − η∇fB (t−1) x | Z1 , SGD
code 5:
Z1
(S)
Update Z1 jointly from the deep learner fB and the sampling layer W0
j,(t+1) (t) iid
| W , λj,(t) , y, fB (t) (x) ∼ N (µz , σz
- 0
(t) (t) 2
), j = 1, . . . , J
(T )
1, if W0 fB (T ) (x) > 0
6: return ŷ =
−1, otherwise.
(t+1) 1
W0 = (τ −2 Λ(t) + xT∗ Ω(t) x∗ )−1 ( xT∗ 1),
A (t+1)
B2
(t)
(t+1) 1 ezi 1 (t+1) κW + τ 2 φ! (W0 | τ )
ωi = (t+1) (t+1)
− , λ = (t)
,
zi 1 + ezi 2 W0 − µW
(t) T (t)
where zi = yi z1,i W0 = yi logit(ŷit ), x∗ is a matrix with rows x∗i = yi z1,i , Ω =
diag(ω1 , . . . , ωn ) and Λ = diag(λ1 , . . . , λp ) are diagonal matrices. x∗ can be written as
x∗ = diag(y)Z1 , φ! (·) denotes the derivative of standard normal density function.
In the non-penalized case, with λi = 0 for every i, the updates can be simplified as
weighted least squares
5 Experiments
We illustrate the performance of our methods on both synthetic and real datasets,
compared to the deep ReLU networks without the data augmentation layer. We refer
to the latter as DL in our results. We denote the data augmented gaussian regression in
Algorithm 1 as DA-GR, the SVM implementation in Algorithm 2 as DA-SVM and the
logistic regression in Algorithm 3 as DA-logit. For appropriate comparison, we adopt the
same network structures, such as the number of layers, the number of hidden nodes, and
regularizations like dropout rates, for DL and our methods. The differences between our
Wang et al. 17
methods and DL are that (1) the top layer weights W0 , b0 of DL are updated via SGD
optimization, while the weights W0 , b0 of our methods are updated via MCMC or EM;
(2) for binary classification, DA-logit and DL adopt a sigmoid activation function in the
top layer to produce a binary output, while DA-SVM uses a linear function in the top
layer and the augmented sampling layer transforms the continuous value into a binary
output. For all experiments, the datasets are partitioned into 70% training and 30%
testing randomly. For the optimization we use a modification of the SGD algorithm, the
TrainTest
splitgiven
Adaptive moment estimation (Adam, Kingma and Ba (2015)) algorithm. The Adam
algorithm combines the estimate of the stochastic gradient with the earlier estimate of
the gradient, and scales this using an estimate of the second moment of the unit-level
gradient. We have also explored RMSprop (Tieleman and Hinton, 2012) optimizer and
we observe similar decreases in regression or classification errors.
seeds no
given
To illustrate how the choice of J could affect the speed of convergence, we include
different implementations of DA-GR and DA-SVM with J = 2, 5, 10. We have explored
different sampling noise variance τ0 , τZ , but the choices, in general, do not affect the
results significantly.
where xi = (xi1 , xi1 , . . . , xip ) and only the first 5 covariates are predictive of yi . We run
the experiments with n = 100, 1 000 and p = 10, 50, 100, 1 000 to explore the performance
given
in both low dimensional and high dimensional scenarios. We implement both one-layer
(L = 1) and two-layer (L = 2) ReLU networks with 64 hidden units in each layer. For
DA-GR model, we let τ0 = 0.1, τz = 1. The experiments are repeated 50 times with
different random seeds.
Hyperparameters
18 Data Augmentation for Bayesian Deep Learning
Figure 2 reports the three quartiles of the out-of-sample squared errors (MSEs).
The top row is the performance of the one-layer networks and the bottom row is the
performance of the two-layer networks. The two-layer networks perform better and
converge faster. For DA-GR, when J = 5 or J = 10, it converges significantly faster
and the prediction errors are also smaller. When J = 2, the performance of DA-GR is
relatively similar to the deep learning model with only SGD updates. This is due to
the fact that DA-GR with J-copies learns the posterior mode which is equivalent to the
minimization point of the objective function, and it concentrates on the mode faster
when J becomes larger.
The computation costs of DA are higher as shown in Figure 3. This is not entirely
unexpected since we introduce sampling steps. When J increases, the computation costs
also increase slightly. Given the improvement in convergence speed and prediction errors,
our data augmentation strategies are still worthwhile even with some extra computation
costs. In addition, for each epoch, we can draw the sample-wise posteriors in parallel
and the gap between the computation time can be further mitigated.
200
L=1
100
method
DA−GR(J=2)
mse
DA−GR(J=5)
DA−GR(J=10)
DL
200
L=2
100
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch
Figure 2: Quartiles of out-of-sample MSEs under the Friedman Setup. We explore cases
where n = 1 000 and p = 10, 50, 100, 1 000. The tests are repeated 50 times. The medians
of out-of-sample MSEs after training for 1 to 10 epochs are plotted with lines and
the vertical bars mark the 25 % and 75% quantiles of the MSEs. DA-GR refers to
DA Gaussian regression shown in Algorithm 1 and DL stands for the ReLU networks
without the data-augmentation layer.
Wang et al. 19
method
DA−GR(J=2)
L=2
time
DA−GR(J=5)
DA−GR(J=10)
DL
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch
Figure 3: Computation Time under the Friedman setup. The setups are n = 1 000 and
p = 10, 50, 100, 1 000. The averaged time (over 50 repetitions) for computing 1 to 10
epochs is plotted with lines and the vertical bars mark the 25% and 75% quantiles
of the computation time collected. We only include one figure of computation time
comparison here since the scale is relatively the same for all cases.
1 https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
20 Data Augmentation for Bayesian Deep Learning
600
400
method
DA−GR(J=2)
mse
DA−GR(J=5)
DA−GR(J=10)
DL
200
0
0 10 20 30 40 50
epoch
Figure 4: Out-of-sample MSEs for the Boston Housing dataset. The experiment is re-
peated 20 times with different training subsampling. The medians of MSEs after training
for 1 to 50 epochs are provided, with the vertical bars marking the 25% and 75% quan-
tiles of the errors. DA-GR refers to the data augmentation strategy in Algorithm 1 and
DL stands for the ReLU networks without the data-augmentation layer.
rating 3 4 5 6 7 8 9
frequency 20 163 1457 2198 880 175 5
Table 2: Frequencies of Different Wine Ratings
The most frequent ratings are 5 and 6. Since we focus on binary classification prob-
lems, we provide two types of classifications, both of which have relatively balanced
categories: (1) wine with a rating of 5 or 6 (Test 1); (2) wine with a rating of ≤ 5 or
2 P.
Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ‘Wine Quality Data Set’, UCI Machine
Learning Repository.
Wang et al. 21
> 5 (Test 2). We use the same network architectures adopted in Friedman’s example
with τ0 = τz = 0.1.
Hyper
Figure 5 provides results for the two types of binary classifications. In both cases,
DA-SVM performs better than DA-logit and DL. The advantage of large J is still
significant and helps converge especially in the early phase. DA-logit outperforms DL
in Test 1 when the network is shallow (L=1), while in other cases performs similarly to
DL.
L=1 L=2
parameters
0.5
Test 1
0.4
method
misclassification
0.3
DA−SVM(J=2)
DA−SVM(J=5)
DA−SVM(J=10)
DA−logit
0.6 DL
0.5
Test 2
0.4
0.3
0.2
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
epoch
Figure 5: Binary Classifications on the Wine Quality dataset. Two types of binary
classifications are considered here. The experiment is repeated 20 times with different
training subsampling. We compare the misclassification rates of DA-SVM in Algorithm 2
with J = 2, 5, 10, DA-logit in Algorithm 3 and the ReLU networks without the data
augmentation layer (DL), after training for 1 to 10 epochs.
3 https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
22 Data Augmentation for Bayesian Deep Learning
booking was made. The countries are denoted with their standard codes, as ‘AU’ for
Australia, ‘CA’ for Canada, ‘DE’ for Germany, ‘ES’ for Spain, ‘FR’ for France, ‘UK’
for United Kingdom, ‘IT’ for Italy, ‘NL’ for Netherlands, ‘PT’ for Portugal, ‘US’ for
United States. Table 3 reports the percentage of each class. We follow the preprocessing
steps in Polson and Sokolov (2017). The list of variables contains information from the
sessions records (number of sessions, summary statistics of action types, device types
and session duration), and user tables such as gender, language, affiliate provider etc.
All categorical variables are converted to binary dummies, which leads to 661 features
in total. For the neural network architecture, we use a two-layer ReLU network with 64
hidden units on each layer and set the dropout rate to be 0.3. For the SVM model, we
let τ0 = τz = 0.1.
AU CA DE ES FR UK IT NDF NL PT US other
% obs 0.25 0.67 0.50 1.05 2.35 1.09 1.33 58.35 0.36 0.10 29.22 4.73
Our goal is to test the binary classification models on this dataset. We consider two
types of binary responses, both of which have relatively balanced amounts of observa-
tions in each category.
1. Spain (1.05%) vs United Kingdom(1.09%)
2. United Kingdom (1.09%) vs Italy (1.33%)
Test
Spain vs UK UK vs Italy
0.550
results
0.525
0.50
method
misclassification
misclassification
0.500 SDA−SVM(J=2)
available
SDA−SVM(J=5)
0.48
SDA−SVM(J=10)
SDA−logit
0.475
DL
for all
0.46
0.450
3 methods
0.44
0 5 10 15 20 0 5 10 15 20
epoch epoch
Figure 6: Binary Classifications on the Airbnb Booking Dataset. Two types of binary
classifications are considered here. The experiment is repeated 20 times with different
training subsampling. We compare the misclassification rates of DA-SVM in Algorithm 2
with J = 2, 5, 10, DA-logit in Algorithm 3 and the ReLU networks without the data
augmentation layer, after training for 1 to 20 epochs.
Figure 6 demonstrates the binary classifications for Spain versus UK and UK versus
Italy. For both cases, the out-of-sample misclassification rates are not small and the fluc-
tuations over epochs are big, suggesting that a better model structure may be needed.
However, we still observe that DA-SVM with J = 5 or J = 10 has smaller classification
Wang et al. 23
errors over epochs and the out-of-sample errors decrease faster during earlier phase of
training.
6 Discussion
Various regularization methods have been deployed in neural networks to prevent over-
fitting, such as early stopping, weight decay, dropout (Hinton et al., 2012), gradient
noise (Neelakantan et al., 2017). Bayesian strategies tackle the regularization problem
by proposing probability structures on the weights. We show that data augmentation
strategies are available for many standard activation functions (ReLU, SVM, logit) used
in deep learning.
Using MCMC provides a natural stochastic search mechanism that avoids procedures
such as back-tracking and provides full descriptions of the objective function over the
entire range Θ. Training deep neural networks thus benefits from additional hidden
24 Data Augmentation for Bayesian Deep Learning
References
Ahn, S., Balan, A. K., and Welling, M. (2012). “Bayesian posterior sampling via stochas-
tic gradient fisher scoring.” In Proceedings of the 29th International Conference on
Machine Learning, 1591–1598. 12
Armagan, A., Dunson, D. B., and Lee, J. (2013). “Generalized double Pareto shrinkage.”
Statistica Sinica, 23(1): 119. 16
Bauer, B. and Kohler, M. (2019). “On deep learning as a remedy for the curse of
dimensionality in nonparametric regression.” The Annals of Statistics, 47(4): 2261–
2285. 1
Bhadra, A., Datta, J., Polson, N., Sokolov, V., and Xu, J. (2021). “Merging two cultures:
deep and statistical learning.” arXiv preprint arXiv:2110.11561 . 3
Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). “Fast sampling with
Wang et al. 25
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.
(2012). “Improving neural networks by preventing co-adaptation of feature detectors.”
arXiv:1207.0580 . 23
Hunter, D. R. and Lange, K. (2004). “A tutorial on MM algorithms.” The American
Statistician, 58(1): 30–37. 8
Jacquier, E., Johannes, M., and Polson, N. (2007). “MCMC maximum likelihood for
latent state models.” Journal of Econometrics, 137(2): 615–640. 2, 10
Kingma, D. P. and Ba, J. (2015). “Adam: A method for stochastic optimization.” 17
Kolmogorov, A. N. (1957). “On the representation of continuous functions of many
variables by superposition of continuous functions of one variable and addition.” In
Doklady Akademii Nauk , volume 114, 953–956. Russian Academy of Sciences. 4
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “Imagenet classification with
deep convolutional neural networks.” Advances in Neural Information Processing
Systems, 25: 1097–1105. 1
Lange, K. (2013a). “The MM algorithm.” In Optimization, 185–219. Springer. 8
— (2013b). Optimization, volume 95. Springer Science & Business Media. 8
Lange, K., Hunter, D. R., and Yang, I. (2000). “Optimization transfer using surrogate
objective functions.” Journal of Computational and Graphical Statistics, 9(1): 1–20.
8
Liang, S. and Srikant, R. (2017). “Why deep neural networks for function approxima-
tion?” In International Conference on Learning Representations. 3
Ma, Y.-A., Chen, Y., Jin, C., Flammarion, N., and Jordan, M. I. (2019). “Sampling
can be faster than optimization.” Proceedings of the National Academy of Sciences,
116(42): 20881–20885. 3
Mallick, B. K., Ghosh, D., and Ghosh, M. (2005). “Bayesian classification of tumours
by using gene expression data.” Journal of the Royal Statistical Society: Series B ,
67(2): 219–234. 2, 14
Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). “Stochastic gradient descent as ap-
proximate Bayesian inference.” Journal of Machine Learning Research, 18(1): 4873–
4907. 12
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). “Equation of state calculations by fast computing machines.” Journal of
Chemical Physics, 21(6): 1087–1092. 9
Mhaskar, H., Liao, Q., and Poggio, T. A. (2017). “When and why are deep networks
better than shallow ones?” In Proceedings of the 31th Conference on Artificial Intel-
ligence, 2343–2349. 2
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). “On the number of
linear regions of deep neural networks.” In Advances in Neural Information Processing
Systems, 2924–2932. 3
Wang et al. 27
Murray, I., Adams, R., and MacKay, D. (2010). “Elliptical slice sampling.” In Proceed-
ings of the thirteenth International Conference on Artificial Intelligence and Statis-
tics, 541–548. 12
Neal, R. M. (2003). “Slice sampling.” The Annals of Statistics, 705–741. 12
— (2011). “MCMC using Hamiltonian dynamics.” Handbook of Markov Chain Monte
Carlo, 2(11): 2. 3, 11
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens,
J. (2017). “Adding gradient noise improves learning for very deep networks.” Inter-
national Conference on Learning Representations. 3, 23
Nesterov, Y. (1983). “A method for unconstrained convex minimization problem with
the rate of convergence O (1/k 2 ).” In Doklady AN USSR, volume 269, 543–547. 2
Newton, M. A., Polson, N. G., and Xu, J. (2021). “Weighted Bayesian bootstrap for
scalable posterior distributions.” Canadian Journal of Statistics, 49(2): 421–437. 24
Phillips, D. B. and Smith, A. F. (1996). “Bayesian model comparison via jump diffu-
sions.” Markov Chain Monte Carlo in Practice, 215: 239. 11
Pincus, M. (1968). “A closed form solution of certain programming problems.” Opera-
tions Research, 16(3): 690–694. 3, 11
— (1970). “A Monte Carlo Method for the approximate solution of certain types of
constrained optimization problems.” Operations Research, 18(6): 1225–1228. 3, 11
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. (2017). “Why and
when can deep-but not shallow-networks avoid the curse of dimensionality: a review.”
International Journal of Automation and Computing, 14(5): 503–519. 1, 2
Polson, N., Sokolov, V., and Xu, J. (2021). “Deep learning partial least squares.” arXiv
preprint arXiv:2106.14085 . 3
Polson, N. G. and Rockova, V. (2018). “Posterior concentration for sparse deep learn-
ing.” In Advances in Neural Information Processing Systems, 938–949. 1, 2, 3, 4
Polson, N. G. and Scott, J. G. (2013). “Data augmentation for Non-Gaussian regression
models using variance-mean mixtures.” Biometrika, 100(2): 459–471. 15, 16
Polson, N. G., Scott, J. G., and Windle, J. (2013). “Bayesian inference for logistic
models using Pólya–Gamma latent variables.” Journal of the American statistical
Association, 108(504): 1339–1349. 3, 7
Polson, N. G. and Scott, S. L. (2011). “Data augmentation for Support Vector Ma-
chines.” Bayesian Analysis, 6(1): 1–23. 5, 14
Polson, N. G. and Sokolov, V. (2017). “Deep learning: a Bayesian perspective.” Bayesian
Analysis, 12(4): 1275–1304. 4, 22
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). “Stochastic backpropagation
and approximate inference in deep generative models.” In International Conference
on Machine Learning, 1278–1286. 2
28 Data Augmentation for Bayesian Deep Learning