SlideShare a Scribd company logo
Unbiased Bayes for Big Data:
Paths of Partial Posteriors
Heiko Strathmann
Gatsby Unit, University College London
Oxford ML lunch, February 25, 2015
Joint work
Being Bayesian: Averaging beliefs of the unknown
φ =
ˆ
dθϕ(θ) p(θ|D)
posterior
where p(θ|D) ∝ p(D|θ)
likelihood data
p(θ)
prior
Metropolis Hastings Transition Kernel
Target π(θ) ∝ p(θ|D)
At iteration j + 1, state θ(j)
Propose θ ∼ q θ|θ(j)
Accept θ(j+1)
← θ with probability
min
π(θ )
π(θ(j))
×
q(θ(j)
|θ )
q(θ |θ(j))
, 1
Reject θ(j+1)
← θ(j)
otherwise.
Big D & MCMC
Need to evaluate
p(θ|D) ∝ p(D|θ)p(θ)
in every iteration.
For example, for D = {x1, . . . , xN},
p(D|θ) =
N
i=1
p(xi|θ)
Infeasible for growing N
Lots of current research: Can we use subsets of D?
Desiderata for Bayesian estimators
1. No (additional) bias
2. Finite & controllable variance
3. Computational costs sub-linear in N
4. No problems with transition kernel design
Outline
Literature Overview
Partial Posterior Path Estimators
Experiments & Extensions
Discussion
Outline
Literature Overview
Partial Posterior Path Estimators
Experiments & Extensions
Discussion
Stochastic gradient Langevin (Welling & Teh 2011)
θ =
2
θ=θ(j) log p(θ) + θ=θ(j)
N
i=1
log p(xi|θ) + ηj
Two changes:
1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N}
and use log-likelihood gradient
θ=θ(j)
i∈I
log p(xi|θ)
2. Don't evaluate MH ratio, but always accept, decrease
step-size/noise j → 0 to compensate
∞
i=1
i = ∞
∞
i=1
2
i < ∞
Austerity (Korattikara, Chen, Welling 2014)
Idea: rewrite MH ratio as hypothesis test
At iteration j, draw u ∼ Uniform[0, 1] and compute
µ0 =
1
N log u ×
p(θ(j)
)
p(θ )
×
q(θ |θ(j)
)
q(θ(j)|θ )
µ =
1
N
N
i=1
li li := log p(xi|θ ) − log p(xi|θ(j)
)
Accept if µ > µ0; reject otherwise
Subsample the li, central limit theorem, t-test
Increase data if no signicance, multiple testing correction
Bardenet, Doucet, Holmes 2014
Similar to Austerity, but with analysis:
Concentration bounds for MH (CLT might not hold)
Bound for probability of wrong decision
For uniformly ergodic original kernel
Approximate kernel converges
Bound for TV distance of approximation and target
Limitations:
Still approximate
Only random walk
Uses all data on hard (?) problems
Firey MCMC (Maclaurin  Adams 2014)
First asymptotically exact MCMC kernel using
sub-sampling
Augment state space with binary indcator variables
Only few data bright
Dark points approximated by a lower bound on likelihood
Limitations:
Bound might not be available
Loose bounds → worse than standard MCMC→ need
MAP estimate
Linear in N. Likelihood evaluations at least qdark→bright · N
Mixing time cannot be better than 1/qdark→bright
Alternative transition kernels
Existing methods construct alternative transition kernels.
(Welling  Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014)
(Maclaurin  Adams 2014), (Chen, Fox, Guestrin 2014).
They
use mini-batches
inject noise
augment the state space
make clever use of approximations
Problem: Most methods
are biased
have no convergence guarantees
mix badly
Reminder: Where we came from  expectations
Ep(θ|D) {ϕ(θ)} ϕ : Θ → R
Idea: Assuming the goal is estimation, give up on simulation.
Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion
Idea Outline
1. Construct partial posterior distributions
2. Compute partial expectations (biased)
3. Remove bias
Note:
No simulation from p(θ|D)
Partial posterior expectations less challenging
Exploit standard MCMC methodology  engineering
But not restricted to MCMC
Disclaimer
Goal is not to replace posterior sampling, but to provide a ...
dierent perspective when the goal is estimation
Method does not do uniformly better than MCMC, but ...
we show cases where computational gains can be achieved
Partial Posterior Paths
Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN}
Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ)
L subsets Dl of sizes |Dl| = nl
Here: n1 = a, n2 = 2
1
a, n3 = 2
2
a, . . . , nL = 2
L−1
a
Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ)
Path from prior to full posterior
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
Gaussian Mean, Conjugate Prior
−5 −4 −3 −2 −1 0 1 2 3
µ1
−3
−2
−1
0
1
2
3
4
µ2
Prior
1/100
2/100
4/100
8/100
16/100
32/100
64/100
100/100
Partial posterior path statistics
For partial posterior paths
p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
dene a sequence {φt}∞
t=1
as
φt := ˆE˜πt{ϕ(θ)} t  L
φt := φ := ˆEπN{ϕ(θ)} t ≥ L
This gives
φ1 → φ2 → · · · → φL = φ
ˆE˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.
Debiasing Lemma (Rhee  Glynn 2012, 2014)
φ and {φt}∞
t=1
real-valued random variables. Assume
lim
t→∞
E |φt − φ|2
= 0
T integer rv with P [T ≥ t]  0 for t ∈ N
Assume
∞
t=1
E |φt−1 − φ|2
P [T ≥ t]
 ∞
Unbiased estimator of E{φ}
φ∗
T =
T
t=1
φt − φt−1
P [T ≥ t]
Here: P [T ≥ t] = 0 for t  L since φt+1 − φt = 0
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Algorithm illustration
0 2 4 6
µ2
−2
−1
0
1
2
3
4
µ1
Prior mean
Debiasing estimate 1
R
R
r=1 ϕ∗
r
True Posterior mean
Debiasing estimates ϕ∗
r
Computational complexity
Assume geometric batch size increase nt and truncation
probabilities
Λt := P(T = t) ∝ 2
−αt
α ∈ (0, 1)
Average computational cost sub-linear
O a N
a
1−α
Variance-computation tradeos in Big Data
Variance
E (φ∗
T )2
=
∞
t=1
E {|φt−1 − φ|2
} − E {|φt − φ|2
}
P [T ≥ t]
If we assume ∀t ≤ L, there is a constant c and β  0 s.t.
E |φt−1 − φ|2
≤
c
nβ
t
and furthermore α  β, then
L
t=1
E |φt−1 − φ|2
P [T ≥ t]
= O(1)
and variance stays bounded as N → ∞.
Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion
Synthetic log-Gaussian
101 102 103 104 105
Number of data nt
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
log N(0, σ2
), posterior mean σ
(Bardenet, Doucet, Holmes 2014)  all data
(Korattikara, Chen, Welling 2014)  wrong result
Synthetic log-Gaussian  debiasing
0 50 100 150 200 250 300
Replication r
1.0
1.2
1.4
1.6
1.8
2.0
Running1
R
R
r=1ϕ∗
r
log N(µ, σ2
), posterior mean σ
0
2000
4000
6000
Tr
t=1nt
Used Data
Truly large-scale version: N ≈ 10
8
Sum of likelihood evaluations: ≈ 0.25N
Non-factorising likelihoods
No need for
p(D|θ) =
N
i=1
p(xi|θ)
Example: Approximate Gaussian Process regression
Estimate predictive mean
k∗ (K + λI)−1
y
No MCMC (!)
Toy example
N = 10
4
, D = 1
m = 100 random Fourier features (Rahimi, Recht, 2007)
Predictive mean on 1000 test data
101 102 103 104
Number of data nt
0.0
0.2
0.4
0.6
0.8
1.0
1.2
MSE
GP Regression, predictive mean
0 50 100 150 200
Replication R
0.0
0.5
1.0
MSE
Average cost: 469
MSE Convergence Debiasing
Gaussian Processes for Big Data
(Hensman, Fusi, Lawrence, 2013): SVI  inducing variables
Airtime delays, N = 700, 000, D = 8
Estimate predictive mean on 100, 000 test data
0 20 40 60 80 100
Replication R
30
40
50
60
70
RMSE
Average cost: 2773
Outline
Literature Overview
Partial Posterior Path Estimators
Experiments  Extensions
Discussion
Conclusions
If goal is estimation rather than simulation, we arrive at
1. No bias
2. Finite  controllable variance
3. Data complexity sub-linear in N
4. No problems with transition kernel design
Practical:
Not limited to MCMC
Not limited to factorising likelihoods
Competitiveinitial results
Parallelisable, re-uses existing engineering eort
Still biased?
MCMC and nite time
MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased
Could imagine two-stage process
Apply debiasing to MC estimator
Use to debias partial posterior path
Need conditions on MC convergence to control variance,
(Agapiou, Roberts, Vollmer, 2014)
Memory restrictions
Partial posterior expectations need be computable
Memory limitations cause bias
e.g. large-scale GMRF (Lyne et al, 2014)
Free lunch? Not uniformly better than MCMC
Need P [T ≥ t]  0 for all t
Negative example: a9a dataset (Welling  Teh, 2011)
N ≈ 32, 000
Converges, but full posterior sampling likely
0 50 100 150 200
Replication r
−4
−2
0
2
β1
Useful for very large (redundant) datasets
Xi'an's og, Feb 2015
Discussion of M. Betancourt's note on HMC and subsampling.
...the information provided by the whole data is only available
when looking at the whole data.
See http://goo.gl/bFQvd6
Thank you
Questions?

More Related Content

PDF
Delayed acceptance for Metropolis-Hastings algorithms
Christian Robert
 
PDF
MCMC and likelihood-free methods
Christian Robert
 
PDF
Chris Sherlock's slides
Christian Robert
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Can we estimate a constant?
Christian Robert
 
PDF
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Christian Robert
 
PDF
ABC in Venezia
Christian Robert
 
PDF
Richard Everitt's slides
Christian Robert
 
Delayed acceptance for Metropolis-Hastings algorithms
Christian Robert
 
MCMC and likelihood-free methods
Christian Robert
 
Chris Sherlock's slides
Christian Robert
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Can we estimate a constant?
Christian Robert
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Christian Robert
 
ABC in Venezia
Christian Robert
 
Richard Everitt's slides
Christian Robert
 

What's hot (20)

PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Coordinate sampler: A non-reversible Gibbs-like sampler
Christian Robert
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Jere Koskela slides
Christian Robert
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Introduction to MCMC methods
Christian Robert
 
PDF
short course at CIRM, Bayesian Masterclass, October 2018
Christian Robert
 
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
Valentin De Bortoli
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Mark Girolami's Read Paper 2010
Christian Robert
 
PDF
Poster for Bayesian Statistics in the Big Data Era conference
Christian Robert
 
PDF
Approximate Bayesian Computation with Quasi-Likelihoods
Stefano Cabras
 
PDF
ABC with Wasserstein distances
Christian Robert
 
PDF
Nested sampling
Christian Robert
 
PDF
Bayesian model choice in cosmology
Christian Robert
 
PDF
Bayesian hybrid variable selection under generalized linear models
Caleb (Shiqiang) Jin
 
PDF
accurate ABC Oliver Ratmann
olli0601
 
PDF
ABC-Gibbs
Christian Robert
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Christian Robert
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Jere Koskela slides
Christian Robert
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Introduction to MCMC methods
Christian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
Christian Robert
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Valentin De Bortoli
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Mark Girolami's Read Paper 2010
Christian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Christian Robert
 
Approximate Bayesian Computation with Quasi-Likelihoods
Stefano Cabras
 
ABC with Wasserstein distances
Christian Robert
 
Nested sampling
Christian Robert
 
Bayesian model choice in cosmology
Christian Robert
 
Bayesian hybrid variable selection under generalized linear models
Caleb (Shiqiang) Jin
 
accurate ABC Oliver Ratmann
olli0601
 
ABC-Gibbs
Christian Robert
 
Ad

Similar to Unbiased Bayes for Big Data (20)

PDF
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Matt Moores
 
PDF
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Anmol Dwivedi
 
PDF
Bayesian Deep Learning
RayKim51
 
PPT
Bayesian phylogenetic inference_big4_ws_2016-10-10
FredrikRonquist
 
PDF
Presentation.pdf
Chiheb Ben Hammouda
 
PDF
CDT 22 slides.pdf
Christian Robert
 
PDF
SIAM - Minisymposium on Guaranteed numerical algorithms
Jagadeeswaran Rathinavel
 
PDF
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
Chiheb Ben Hammouda
 
PDF
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
PDF
Automatic bayesian cubature
Jagadeeswaran Rathinavel
 
PDF
Patch Matching with Polynomial Exponential Families and Projective Divergences
Frank Nielsen
 
PDF
Tensor train to solve stochastic PDEs
Alexander Litvinenko
 
PDF
lecture 5 about bayesian econometrics and statistics
jessezheng742247
 
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Monte Carlo Statistical Methods
Christian Robert
 
PPT
Jörg Stelzer
butest
 
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Workshop in honour of Don Poskitt and Gael Martin
Christian Robert
 
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Accelerating Pseudo-Marginal MCMC using Gaussian Processes
Matt Moores
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Anmol Dwivedi
 
Bayesian Deep Learning
RayKim51
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
FredrikRonquist
 
Presentation.pdf
Chiheb Ben Hammouda
 
CDT 22 slides.pdf
Christian Robert
 
SIAM - Minisymposium on Guaranteed numerical algorithms
Jagadeeswaran Rathinavel
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
The Statistical and Applied Mathematical Sciences Institute
 
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
Chiheb Ben Hammouda
 
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
Automatic bayesian cubature
Jagadeeswaran Rathinavel
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Frank Nielsen
 
Tensor train to solve stochastic PDEs
Alexander Litvinenko
 
lecture 5 about bayesian econometrics and statistics
jessezheng742247
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Monte Carlo Statistical Methods
Christian Robert
 
Jörg Stelzer
butest
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
Workshop in honour of Don Poskitt and Gael Martin
Christian Robert
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Ad

More from Christian Robert (20)

PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
Christian Robert
 
PDF
The future of conferences towards sustainability and inclusivity
Christian Robert
 
PDF
Adaptive Restore algorithm & importance Monte Carlo
Christian Robert
 
PDF
Asymptotics of ABC, lecture, Collège de France
Christian Robert
 
PDF
discussion of ICML23.pdf
Christian Robert
 
PDF
How many components in a mixture?
Christian Robert
 
PDF
restore.pdf
Christian Robert
 
PDF
Testing for mixtures at BNP 13
Christian Robert
 
PDF
Inferring the number of components: dream or reality?
Christian Robert
 
PDF
Testing for mixtures by seeking components
Christian Robert
 
PDF
discussion on Bayesian restricted likelihood
Christian Robert
 
PDF
Coordinate sampler : A non-reversible Gibbs-like sampler
Christian Robert
 
PDF
eugenics and statistics
Christian Robert
 
PDF
Laplace's Demon: seminar #1
Christian Robert
 
PDF
ABC-Gibbs
Christian Robert
 
PDF
asymptotics of ABC
Christian Robert
 
PDF
ABC-Gibbs
Christian Robert
 
PDF
Likelihood-free Design: a discussion
Christian Robert
 
PDF
the ABC of ABC
Christian Robert
 
PDF
CISEA 2019: ABC consistency and convergence
Christian Robert
 
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
Christian Robert
 
The future of conferences towards sustainability and inclusivity
Christian Robert
 
Adaptive Restore algorithm & importance Monte Carlo
Christian Robert
 
Asymptotics of ABC, lecture, Collège de France
Christian Robert
 
discussion of ICML23.pdf
Christian Robert
 
How many components in a mixture?
Christian Robert
 
restore.pdf
Christian Robert
 
Testing for mixtures at BNP 13
Christian Robert
 
Inferring the number of components: dream or reality?
Christian Robert
 
Testing for mixtures by seeking components
Christian Robert
 
discussion on Bayesian restricted likelihood
Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Christian Robert
 
eugenics and statistics
Christian Robert
 
Laplace's Demon: seminar #1
Christian Robert
 
ABC-Gibbs
Christian Robert
 
asymptotics of ABC
Christian Robert
 
ABC-Gibbs
Christian Robert
 
Likelihood-free Design: a discussion
Christian Robert
 
the ABC of ABC
Christian Robert
 
CISEA 2019: ABC consistency and convergence
Christian Robert
 

Recently uploaded (20)

PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPTX
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPTX
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PDF
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PPTX
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PDF
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PPTX
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
PPTX
Limbic system_components_connections_ functions.pptx
muralinath2
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
Limbic system_components_connections_ functions.pptx
muralinath2
 

Unbiased Bayes for Big Data

  • 1. Unbiased Bayes for Big Data: Paths of Partial Posteriors Heiko Strathmann Gatsby Unit, University College London Oxford ML lunch, February 25, 2015
  • 3. Being Bayesian: Averaging beliefs of the unknown φ = ˆ dθϕ(θ) p(θ|D) posterior where p(θ|D) ∝ p(D|θ) likelihood data p(θ) prior
  • 4. Metropolis Hastings Transition Kernel Target π(θ) ∝ p(θ|D) At iteration j + 1, state θ(j) Propose θ ∼ q θ|θ(j) Accept θ(j+1) ← θ with probability min π(θ ) π(θ(j)) × q(θ(j) |θ ) q(θ |θ(j)) , 1 Reject θ(j+1) ← θ(j) otherwise.
  • 5. Big D & MCMC Need to evaluate p(θ|D) ∝ p(D|θ)p(θ) in every iteration. For example, for D = {x1, . . . , xN}, p(D|θ) = N i=1 p(xi|θ) Infeasible for growing N Lots of current research: Can we use subsets of D?
  • 6. Desiderata for Bayesian estimators 1. No (additional) bias 2. Finite & controllable variance 3. Computational costs sub-linear in N 4. No problems with transition kernel design
  • 7. Outline Literature Overview Partial Posterior Path Estimators Experiments & Extensions Discussion
  • 8. Outline Literature Overview Partial Posterior Path Estimators Experiments & Extensions Discussion
  • 9. Stochastic gradient Langevin (Welling & Teh 2011) θ = 2 θ=θ(j) log p(θ) + θ=θ(j) N i=1 log p(xi|θ) + ηj Two changes: 1. Noisy gradients with mini-batches. Let I ⊆ {1, . . . , N} and use log-likelihood gradient θ=θ(j) i∈I log p(xi|θ) 2. Don't evaluate MH ratio, but always accept, decrease step-size/noise j → 0 to compensate ∞ i=1 i = ∞ ∞ i=1 2 i < ∞
  • 10. Austerity (Korattikara, Chen, Welling 2014) Idea: rewrite MH ratio as hypothesis test At iteration j, draw u ∼ Uniform[0, 1] and compute µ0 = 1 N log u × p(θ(j) ) p(θ ) × q(θ |θ(j) ) q(θ(j)|θ ) µ = 1 N N i=1 li li := log p(xi|θ ) − log p(xi|θ(j) ) Accept if µ > µ0; reject otherwise Subsample the li, central limit theorem, t-test Increase data if no signicance, multiple testing correction
  • 11. Bardenet, Doucet, Holmes 2014 Similar to Austerity, but with analysis: Concentration bounds for MH (CLT might not hold) Bound for probability of wrong decision For uniformly ergodic original kernel Approximate kernel converges Bound for TV distance of approximation and target Limitations: Still approximate Only random walk Uses all data on hard (?) problems
  • 12. Firey MCMC (Maclaurin Adams 2014) First asymptotically exact MCMC kernel using sub-sampling Augment state space with binary indcator variables Only few data bright Dark points approximated by a lower bound on likelihood Limitations: Bound might not be available Loose bounds → worse than standard MCMC→ need MAP estimate Linear in N. Likelihood evaluations at least qdark→bright · N Mixing time cannot be better than 1/qdark→bright
  • 13. Alternative transition kernels Existing methods construct alternative transition kernels. (Welling Teh 2011), (Korattikara, Chen, Welling 2014), (Bardenet, Doucet, Holmes 2014) (Maclaurin Adams 2014), (Chen, Fox, Guestrin 2014). They use mini-batches inject noise augment the state space make clever use of approximations Problem: Most methods are biased have no convergence guarantees mix badly
  • 14. Reminder: Where we came from expectations Ep(θ|D) {ϕ(θ)} ϕ : Θ → R Idea: Assuming the goal is estimation, give up on simulation.
  • 15. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 16. Idea Outline 1. Construct partial posterior distributions 2. Compute partial expectations (biased) 3. Remove bias Note: No simulation from p(θ|D) Partial posterior expectations less challenging Exploit standard MCMC methodology engineering But not restricted to MCMC
  • 17. Disclaimer Goal is not to replace posterior sampling, but to provide a ... dierent perspective when the goal is estimation Method does not do uniformly better than MCMC, but ... we show cases where computational gains can be achieved
  • 18. Partial Posterior Paths Model p(x, θ) = p(x|θ)p(θ), data D = {x1, . . . , xN} Full posterior πN := p(θ|D) ∝ p(x1, . . . , xN|θ)p(θ) L subsets Dl of sizes |Dl| = nl Here: n1 = a, n2 = 2 1 a, n3 = 2 2 a, . . . , nL = 2 L−1 a Partial posterior ˜πl := p(Dl|θ) ∝ p(Dl|θ)p(θ) Path from prior to full posterior p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ)
  • 19. Gaussian Mean, Conjugate Prior −5 −4 −3 −2 −1 0 1 2 3 µ1 −3 −2 −1 0 1 2 3 4 µ2 Prior 1/100 2/100 4/100 8/100 16/100 32/100 64/100 100/100
  • 20. Partial posterior path statistics For partial posterior paths p(θ) = ˜π0 → ˜π1 → ˜π2 → · · · → ˜πL = πN = p(D|θ) dene a sequence {φt}∞ t=1 as φt := ˆE˜πt{ϕ(θ)} t L φt := φ := ˆEπN{ϕ(θ)} t ≥ L This gives φ1 → φ2 → · · · → φL = φ ˆE˜πt{ϕ(θ)} is empirical estimate. Not necessarily MCMC.
  • 21. Debiasing Lemma (Rhee Glynn 2012, 2014) φ and {φt}∞ t=1 real-valued random variables. Assume lim t→∞ E |φt − φ|2 = 0 T integer rv with P [T ≥ t] 0 for t ∈ N Assume ∞ t=1 E |φt−1 − φ|2 P [T ≥ t] ∞ Unbiased estimator of E{φ} φ∗ T = T t=1 φt − φt−1 P [T ≥ t] Here: P [T ≥ t] = 0 for t L since φt+1 − φt = 0
  • 22. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 23. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 24. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 25. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 26. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 27. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 28. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean
  • 29. Algorithm illustration 0 2 4 6 µ2 −2 −1 0 1 2 3 4 µ1 Prior mean Debiasing estimate 1 R R r=1 ϕ∗ r True Posterior mean Debiasing estimates ϕ∗ r
  • 30. Computational complexity Assume geometric batch size increase nt and truncation probabilities Λt := P(T = t) ∝ 2 −αt α ∈ (0, 1) Average computational cost sub-linear O a N a 1−α
  • 31. Variance-computation tradeos in Big Data Variance E (φ∗ T )2 = ∞ t=1 E {|φt−1 − φ|2 } − E {|φt − φ|2 } P [T ≥ t] If we assume ∀t ≤ L, there is a constant c and β 0 s.t. E |φt−1 − φ|2 ≤ c nβ t and furthermore α β, then L t=1 E |φt−1 − φ|2 P [T ≥ t] = O(1) and variance stays bounded as N → ∞.
  • 32. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 33. Synthetic log-Gaussian 101 102 103 104 105 Number of data nt 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 log N(0, σ2 ), posterior mean σ (Bardenet, Doucet, Holmes 2014) all data (Korattikara, Chen, Welling 2014) wrong result
  • 34. Synthetic log-Gaussian debiasing 0 50 100 150 200 250 300 Replication r 1.0 1.2 1.4 1.6 1.8 2.0 Running1 R R r=1ϕ∗ r log N(µ, σ2 ), posterior mean σ 0 2000 4000 6000 Tr t=1nt Used Data Truly large-scale version: N ≈ 10 8 Sum of likelihood evaluations: ≈ 0.25N
  • 35. Non-factorising likelihoods No need for p(D|θ) = N i=1 p(xi|θ) Example: Approximate Gaussian Process regression Estimate predictive mean k∗ (K + λI)−1 y No MCMC (!)
  • 36. Toy example N = 10 4 , D = 1 m = 100 random Fourier features (Rahimi, Recht, 2007) Predictive mean on 1000 test data 101 102 103 104 Number of data nt 0.0 0.2 0.4 0.6 0.8 1.0 1.2 MSE GP Regression, predictive mean 0 50 100 150 200 Replication R 0.0 0.5 1.0 MSE Average cost: 469 MSE Convergence Debiasing
  • 37. Gaussian Processes for Big Data (Hensman, Fusi, Lawrence, 2013): SVI inducing variables Airtime delays, N = 700, 000, D = 8 Estimate predictive mean on 100, 000 test data 0 20 40 60 80 100 Replication R 30 40 50 60 70 RMSE Average cost: 2773
  • 38. Outline Literature Overview Partial Posterior Path Estimators Experiments Extensions Discussion
  • 39. Conclusions If goal is estimation rather than simulation, we arrive at 1. No bias 2. Finite controllable variance 3. Data complexity sub-linear in N 4. No problems with transition kernel design Practical: Not limited to MCMC Not limited to factorising likelihoods Competitiveinitial results Parallelisable, re-uses existing engineering eort
  • 40. Still biased? MCMC and nite time MCMC estimator ˆE˜πt{ϕ(θ)} is not unbiased Could imagine two-stage process Apply debiasing to MC estimator Use to debias partial posterior path Need conditions on MC convergence to control variance, (Agapiou, Roberts, Vollmer, 2014) Memory restrictions Partial posterior expectations need be computable Memory limitations cause bias e.g. large-scale GMRF (Lyne et al, 2014)
  • 41. Free lunch? Not uniformly better than MCMC Need P [T ≥ t] 0 for all t Negative example: a9a dataset (Welling Teh, 2011) N ≈ 32, 000 Converges, but full posterior sampling likely 0 50 100 150 200 Replication r −4 −2 0 2 β1 Useful for very large (redundant) datasets
  • 42. Xi'an's og, Feb 2015 Discussion of M. Betancourt's note on HMC and subsampling. ...the information provided by the whole data is only available when looking at the whole data. See http://goo.gl/bFQvd6