0% found this document useful (0 votes)
125 views

Bayesian Structural Time Series Models

This document provides an introduction to Bayesian structural time series models. It discusses different strategies for time series modeling, including regression models, ARMA models, smoothing models, and structural time series models. It uses an example of airline passenger data to show that a simple linear time trend does not adequately capture the patterns in the data and that structural time series models may fit better. The document then outlines topics to be covered, including the basics of structural time series modeling, regression modeling with spike and slab priors, applications of these techniques using the bsts R package.

Uploaded by

Claudya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Bayesian Structural Time Series Models

This document provides an introduction to Bayesian structural time series models. It discusses different strategies for time series modeling, including regression models, ARMA models, smoothing models, and structural time series models. It uses an example of airline passenger data to show that a simple linear time trend does not adequately capture the patterns in the data and that structural time series models may fit better. The document then outlines topics to be covered, including the basics of structural time series modeling, regression modeling with spike and slab priors, applications of these techniques using the bsts R package.

Uploaded by

Claudya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Bayesian Structural Time Series Models

Steven L. Scott

August 10, 2015


Welcome!

The goal for the day is to introduce you to:


I basic ideas in structural time series modeling,
I regression modeling with spike and slab priors, and
I the bsts R package.
Course notes and materials at https://goo.gl/VUWUC9

Steven L. Scott (Google) bsts August 10, 2015 2 / 100


Some good books
For structural time series, and time series in general.

Harvey Durbin and Koopman West and Harrison

Chatfield Brockwell and Davis Petris et. al


Steven L. Scott (Google) bsts August 10, 2015 3 / 100
Introduction to time series modeling

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions

Steven L. Scott (Google) bsts August 10, 2015 4 / 100


Introduction to time series modeling

Strategies for time series models

I Regression
I ARMA
I Smoothing
I Structural time series

Steven L. Scott (Google) bsts August 10, 2015 5 / 100


Introduction to time series modeling

Regression models

Introductory statistics courses teach students to fit models like

yt = β0 + β1 t +β2 xt + t .
|{z}
linear

1. The trend probably won’t follow a parametric form.


2. Even if it does, the residuals will be autocorrelated.

Steven L. Scott (Google) bsts August 10, 2015 6 / 100


Introduction to time series modeling

Airline passengers
An example from elementary textbooks

2.8
log10(AirPassengers)
500

2.6
AirPassengers

2.4
300

2.2
100

2.0
1950 1954 1958 1950 1954 1958

Time Time

Air passengers log scale

Steven L. Scott (Google) bsts August 10, 2015 7 / 100


Introduction to time series modeling

Linear time trend doesn’t quite fit


See air-passengers-bsts.R

0.06 ●


air <- log10(AirPassengers) 0.04 ●
● ●●



● ●

● ●

● ● ●
time <- 1:length(air) 0.02 ●
●● ●
● ●

● ●●


● ●
● ●●




● ●●



● ●● ●●
months <- time %% 12 ●● ● ● ●

residuals
● ● ●

● ●● ● ● ●
●● ● ●
0.00 ● ●● ●
● ●

● ●
● ●●
● ●

months[months==0] <- 12 ●

● ● ●● ● ●

●● ●

●● ●


●●

−0.02 ● ●
● ●● ●
months <- factor(months, ● ●
●● ●




● ●
●●

● ● ● ●
−0.04
label = month.name) ●


● ●●

reg <- lm(air ~ time + months) −0.06 ●


2.2 2.4 2.6 2.8

fitted

Steven L. Scott (Google) bsts August 10, 2015 8 / 100


Quadratic time trend ●


Misses serial correlation 0.04 ●
● ●



● ● ●
● ● ●●
● ● ● ● ●
0.02 ●
●● ● ● ● ● ●
●●
● ●● ● ●● ● ●
● ● ● ●●● ● ●

●● ●
●●●

residuals
● ● ● ● ● ●
● ● ●
● ● ● ●● ● ●●● ●
●● ●
0.00 ● ● ● ● ●●● ● ● ●
●● ● ●● ● ●
● ● ●
● ●

● ● ●●
● ●● ●
● ● ●●●● ●
−0.02 ●
● ● ● ● ●●● ● ●●●●

reg <- lm(air ~ poly(time, 2), ●
●●●
● ●● ● ●

● ●●
months) −0.04


plot(reg$residuals) ●

acf(residuals) 0 20 40 60 80 100 120 140

time
1.0

I Predictions between months 80 0.8

- 100 predictably too low. 0.6

I Between months 100 - 120 0.4

predictably too high. ACF 0.2

0.0
Serial correlation is cured by locality.
−0.2

0 5 10 15 20

Lag
Introduction to time series modeling

ARMA models

ARMA(P,Q) models have the form


P
X Q
X
yt = φp yt−p + θq t−q .
p=1 q=0

Some features that make ARMA models difficult:


1. yt must be stationary. If non-stationary then take differences until it
becomes stationary.
2. If yt contains a seasonal component, then seasonal differencing is also
required.
3. Harder to think about. (Regression of y on x vs of ∆52 ∆2 y on x).
ARMA models can be written as a special case of state space models.

Steven L. Scott (Google) bsts August 10, 2015 10 / 100


Introduction to time series modeling

Stationary vs Nonstationary
See code in stationary.R

sample.size <- 1000


number.of.series <- 1000
many.ar1 <- matrix(nrow = sample.size, ncol =number.of.series)
for (i in 1:number.of.series) {
many.ar1[, i] <- arima.sim(model = list(ar = .95),
n = sample.size)
}
many.random.walk <- matrix(nrow = sample.size,
ncol = number.of.series)
for (i in 1:number.of.series) {
many.random.walk[, i] <- cumsum(rnorm(sample.size))
}
par(mfrow = c(1, 2))
plot.ts(many.ar1, plot.type = "single")
plot.ts(many.random.walk, plot.type = "single")

Steven L. Scott (Google) bsts August 10, 2015 11 / 100


Introduction to time series modeling

What it looks like


Single series

AR1 Random Walk


10

10
many.random.walk[, 1]
5

0
many.ar1[, 1]

−10
0

−20
−5

−30
0 200 400 600 800 1000 0 200 400 600 800 1000

Time Time

yt = .95yt−1 +  yt = yt−1 + 

Steven L. Scott (Google) bsts August 10, 2015 12 / 100


Introduction to time series modeling

What it looks like


Many series

yt = .95yt−1 +  yt = yt−1 + 

Steven L. Scott (Google) bsts August 10, 2015 13 / 100


Introduction to time series modeling

Variance
AR1
yt = φyt−1 + t
= φ(φyt−2 + t−1 ) + t
= ...
t
X
= φi t−i .
i=0

If |φ| < 1 then as t → ∞, Var (yt ) = Var (t )/(1 − |φ|).


Random walk
t
X
yt = t
i=0
2
Var (yt ) = σ t
Variance diverges to ∞.
Steven L. Scott (Google) bsts August 10, 2015 14 / 100
Introduction to time series modeling

Smoothing
Exponential smoothing
st = αyt + (1 − α)st−1
turns out to be the Kalman filter for the “local level” model.
Holt-Winters or “double exponential smoothing” captures a trend.

st = αyt + (1 − α)(st−1 + bt−1 )


bt = β(st − st−1 ) + (1 − β)bt−1

This is the Kalman filter for the “local linear trend” model.
“Triple” exponential smoothing can handle seasonality as well, but
the formulas are getting ridiculous!

I What happens if you want to include a regression component?


I Confidence about the “smoothed” estimate?

Steven L. Scott (Google) bsts August 10, 2015 15 / 100


Introduction to time series modeling

Advantages of structural time series models

I All the flexibility of regression models.

I The locality of ARMA models and smoothing.

I Can handle non-stationarity.

I Modular, so easy combine with other additive components.

I All those “smoothing parameters” become variances that can be


estimated from data.

Steven L. Scott (Google) bsts August 10, 2015 16 / 100


Structural time series models

Outline

Introduction to time series modeling

Structural time series models


Models for trend
Modeling seasonality

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions

Steven L. Scott (Google) bsts August 10, 2015 17 / 100


Structural time series models

Structural time series models


State space form

There are two pieces to a structural time series model


Observation equation

yt = ZtT αt + t t ∼ N (0, Ht )
I yt is the observed data at time t.
I αt is a vector of latent variables (the “state”).
I Zt and Ht are structural parameters (partly known).
Transition equation

αt+1 = Tt αt + Rt ηt ηt ∼ N (0, Qt )
I Tt , Rt , and Qt are structural parameters (partly known).
I ηt may be of lower dimension that αt .

Steven L. Scott (Google) bsts August 10, 2015 18 / 100


Structural time series models

Structural time series models are modular


Add your favorite trend, seasonal, regression, holiday, etc. models to the mix

State Vector Zt Tt

Trend

Seasonal

Regression

Steven L. Scott (Google) bsts August 10, 2015 19 / 100


Structural time series models

Example:
The “basic structural model” with a regression effect S seasons can be
written

yt = µt + γt + β T xt +t
|{z} |{z} | {z }
trend seasonal regression

µt = µt−1 + δt−1 + ut
δt = δt−1 + vt
S−1
X
γt = − γt−s + wt
s=1

I Local linear trend: “level” µt + “slope” δt .


I Seasonal: S − 1 dummy variables with time varying coefficients.
Sums to zero in expectation.

Steven L. Scott (Google) bsts August 10, 2015 20 / 100


Structural time series models Models for trend

Some models for trend

I Local level
I Local linear trend
I Generalized local linear trend
I Autoregressive models

Steven L. Scott (Google) bsts August 10, 2015 21 / 100


Structural time series models Models for trend

Understanding the local level model


I The local level model is
t ∼ N 0, σ 2

yt = µt + t
ηt ∼ N 0, τ 2

µt = µt−1 + ηt−1

I A compromise between the random walk model (when σ 2 = 0) and


the constant mean model (when τ 2 = 0).
I In the random walk model, your forecast of the future (given data to
time t) is yt .
I In the constant mean model, your forecast is ȳ .
I The larger the ratio σ 2 /τ 2 the closer this model is to the “constant
mean model”.
I In “state space form”
Tt = 1, Zt = 1, Rt = 1, Ht = σ 2 , Qt = τ 2
Steven L. Scott (Google) bsts August 10, 2015 22 / 100
Structural time series models Models for trend

Simulating the local level model


local-level.R
tau = 1, sigma = 0
10
5
local.level.rw

0
−5
−15

0 20 40 60 80 100

Time

tau = 0, sigma = 1
6
local.level.constant

5
4
3
2
1

0 20 40 60 80 100

Time

tau = 1, sigma = 0.5


10
local.level

5
0
−5

0 20 40 60 80 100

Steven L. Scott (Google) bsts


Time August 10, 2015 23 / 100
Structural time series models Models for trend

Local linear trend


local-linear-trend.R

I The model is

t ∼ N 0, σ 2

yt = µt + t
ηµ,t ∼ N 0, τµ2

µt = µt−1 + δt−1 + ηµ,t−1
ηδ,t ∼ N 0, τδ2

δt = δt−1 + ηδ,t−1

I We normally think of a “linear trend” as

y = β0 + β1 t + t .
I With change ∆t, the expected increase in y is β1 ∆t.
I Now each ∆t = 1, and β1 = δt is a changing slope.
I Neat fact! The posterior mean of the local linear trend model is a
smoothing spline.
Steven L. Scott (Google) bsts August 10, 2015 24 / 100
Simulating local linear trend
3 simulations with σlevel = 1, σslope = .25 , σobs = .5
60
40
20
0

0 20 40 60 80 100

Time
0
−60 −40 −20
−100

0 20 40 60 80 100

Time
0
−40
−80
−120

0 20 40 60 80 100

Time
Structural time series models Modeling seasonality

Modeling seasonality
I In the “classroom regression model”
I We used a dummy variable for each “season.”
I Left one season out (set its coefficient to zero).
I In state space models
S−1
X
γt = − γt−s + ηt−1
s=1

I γsummer = −(γspring + γwinter + γfall ) + ηt−1


I Mean over the year is zero.
I State is S − 1 dimensional.
I Only one dimension of randomness.
     
1 −1 −1 −1 1
Zt =  0 Tt =
  1 0 0 Rt =
  0 
0 0 1 0 0
Steven L. Scott (Google) bsts August 10, 2015 26 / 100
Structural time series models Modeling seasonality

Example
Modeling the air passengers data

data(AirPassengers)
y <- log10(AirPassengers)
ss <- AddLocalLinearTrend(
list(), ## No previous state specification.
y) ## Peek at the data to specify default priors.
ss <- AddSeasonal(
ss, ## Adding state to ss.
y, ## Peeking at the data.
nseasons = 12) ## 12 "seasons"
model <- bsts(y, state.specification = ss, niter = 1000)
plot(model)
plot(model, "help")
plot(model, "comp") ## "components"
plot(model, "resid") ## "residuals"

Steven L. Scott (Google) bsts August 10, 2015 27 / 100


Structural time series models Modeling seasonality

Posterior distribution of state

2.8
●●
●● ●

●●
●● ● ● ●● ●
● ● ●
●●● ● ●● ●
2.6
●● ● ● ● ●
● ● ●● ● ● ●
● ● ●●● ● ●● ● ●●
● ● ●●● ●
● ●●●
distribution

●● ● ●

●●●
●● ● ● ●●● ●
2.4


● ●●●● ● ● ● ● ●●● ●

● ● ●
● ●
●● ●
● ● ●●● ●
● ● ● ● ●● ●
●● ● ● ●
●●
2.2

● ●
●● ● ●● ●
● ●
●● ● ● ● ●
● ●
● ● ● ●●
● ●
2.0

1950 1952 1954 1956 1958 1960

Time

plot(model)
I “Fuzzy line” shows posterior distribution of state at time t.
I Blue dots are actual observations.

Steven L. Scott (Google) bsts August 10, 2015 28 / 100


Structural time series models Modeling seasonality

Contributions from each component

trend seasonal.12.1
2.0

2.0
distribution

distribution
1.0

1.0
0.0

0.0
1950 1954 1958 1950 1954 1958

Time Time

plot(model, "comp") ## "components"

Steven L. Scott (Google) bsts August 10, 2015 29 / 100


Structural time series models Modeling seasonality

Contributions from each component

trend seasonal.12.1
2.7

0.10
2.5
distribution

distribution

0.00
2.3

−0.10
2.1

1950 1954 1958 1950 1954 1958

Time Time

plot(model, "comp", same.scale = FALSE) ## "components"

Steven L. Scott (Google) bsts August 10, 2015 30 / 100


Evolution of the seasonal component
Season 1 Season 2 Season 3 Season 4
distribution

distribution

distribution

distribution
0.05

0.05

0.05

0.05
−0.10

−0.10

−0.10

−0.10
1950 1954 1958 1950 1954 1958 1950 1954 1958 1950 1954 1958

Time Time Time Time

Season 5 Season 6 Season 7 Season 8


distribution

distribution

distribution

distribution
0.05

0.05

0.05

0.05
−0.10

−0.10

−0.10

−0.10
1950 1954 1958 1950 1954 1958 1950 1954 1958 1950 1954 1958

Time Time Time Time

Season 9 Season 10 Season 11 Season 12


distribution

distribution

distribution

distribution
0.05

0.05

0.05

0.05
−0.10

−0.10

−0.10

−0.10
1950 1954 1958 1950 1954 1958 1950 1954 1958 1950 1954 1958

Time Time Time Time


Structural time series models Modeling seasonality

Setting priors

AddLocalLinearTrend(
state.specification = NULL,
y,
level.sigma.prior = NULL, # SdPrior
slope.sigma.prior = NULL, # SdPrior
initial.level.prior = NULL, # NormalPrior
initial.slope.prior = NULL, # NormalPrior
sdy,
initial.y)

Steven L. Scott (Google) bsts August 10, 2015 32 / 100


Structural time series models Modeling seasonality

Priors on standard deviations

SdPrior(sigma.guess,
sample.size = .01,
initial.value = sigma.guess,
fixed = FALSE,
upper.limit = Inf)

I This puts a gamma prior on 1/σ 2 .


I Shape (α) = sigma.guess2 × sample.size/2
I Scale (β) = sample.size/2
I If specify an upper limit on σ then support will be truncated.

Steven L. Scott (Google) bsts August 10, 2015 33 / 100


Structural time series models Modeling seasonality

What’s in the model object


Varies depending on how the function was called.
> names(model)
[1] "sigma.obs" "sigma.trend.level"
[3] "sigma.trend.slope" "sigma.seasonal.12"
[5] "final.state" "state.contributions"
[7] "one.step.prediction.errors" "has.regression"
[9] "state.specification" "original.series"

I MCMC draws of model parameters (each one is named).


I Draws of the “final” state vector (used for forecasting).
I Draws of each component’s contributions to the state mean.
I Draws of the one-step-ahead prediction errors (from the Kalman filter).
I A logical value indicating whethter the model has a (static) regression
component.
I The state specification you used to call the model.
I A copy of the original data series.
Steven L. Scott (Google) bsts August 10, 2015 34 / 100
Structural time series models Modeling seasonality

Prediction
### Predict the next 24 periods.
pred <- predict(model, horizon = 24)

### Plot prediction along with last 36 observations


### from training series.
plot(pred, plot.original = 36)
3.0
original.series

2.5
2.0

1958 1959 1960 1961 1962 1963

time
Steven L. Scott (Google) bsts August 10, 2015 35 / 100
MCMC and the Kalman filter

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions

Steven L. Scott (Google) bsts August 10, 2015 36 / 100


MCMC and the Kalman filter

Gibbs sampling for state space models


1. Simulate α ∼ p(α|θ, y) using the Kalman filter and simulation
smoother.
2. Simulate θ ∼ p(θ|α, y).
3. Goto 1.
Simulating p(θ|α, y) is on a model by model basis, but for most models is
trivial.
I For models with only variance parameters, compute the right “sum of
squared errors” and draw the variances.
I Example: local level model:
!
1 X (αt − αt−1 )2
p(σα−2 |α) ∝ p(σα−2 )σα−T exp − 2
σα t 2
If p(σα2 ) = Ga(df /2, ss/2) then
df + T ss + t (αt − αt−1 )2
 P 
p(σα−2 |α) = Ga ,
2 2
Steven L. Scott (Google) bsts August 10, 2015 37 / 100
MCMC and the Kalman filter

The Kalman filter

y t−2 y t−1 yt

α t−2 α t−1 αt

I The graph shows the conditional independence relationships among


the latent and observed variables in the model.

Steven L. Scott (Google) bsts August 10, 2015 38 / 100


MCMC and the Kalman filter

The Kalman filter

y t−2 y t−1 yt

α t−2 α t−1 αt

I At time t − 1 we start off knowing the mean and variance of αt−1


given y1 , . . . , yt−2 . (recursion)

Steven L. Scott (Google) bsts August 10, 2015 39 / 100


MCMC and the Kalman filter

The Kalman filter

y t−2 y t−1 yt

α t−2 α t−1 αt

I At time t − 1 we start off knowing the mean and variance of αt−1


given y1 , . . . , yt−2 . (recursion)
I Then we observe yt−1 .

Steven L. Scott (Google) bsts August 10, 2015 40 / 100


MCMC and the Kalman filter

The Kalman filter


y t−2 y t−1 yt

α t−2 α t−1 αt

I At time t − 1 we start off knowing the mean and variance of αt−1


given y1 , . . . , yt−2 . (recursion)
I Then we observe yt−1 .
I The Kalman filter computes p(αt |y1 , . . . , yt−1 ),
and the incremental likelihood: p(yt−1 |y1 , . . . , yt−2 ).
Steven L. Scott (Google) bsts August 10, 2015 41 / 100
MCMC and the Kalman filter

The Kalman equations


Recall the state space form of the model is

yt = ZtT αt + t t ∼ N (0, Ht )
αt+1 = Tt αt + Rt ηt ηt ∼ N (0, Qt )

The Kalman filter recursively computes P(αt+1 |y1,...,t ) = N (at+1 , Pt+1 ).

vt = yt − ZtT at (1-step prediction error)


Ft = ZtT Pt Zt + Ht (forecast variance)
Kt = Tt Pt Zt Ft−1 (Kalman gain . . .
at+1 = Tt at + Kt vt . . . is a regression coefficient)
Pt+1 = Tt Pt (Tt − Kt ZtT )T + Rt Qt RtT

The deriviation is tedious, but elementary.


You can use Bayes’ rule, or properties of the multivariate normal.
See [Durbin and Koopman(2012)] or many other sources.
Steven L. Scott (Google) bsts August 10, 2015 42 / 100
MCMC and the Kalman filter

Forward and backward

I The Kalman filter marches forward through the data, collecting


information.

I There are corresponding algorithms that march backward through the


data, distributing information.
I Kalman smoother (useful for EM algorithm) computes p(αt |y).
I Simulation smoother
[Carter and Kohn(1994), Frühwirth-Schnatter(1995),
de Jong and Shepard(1995), Durbin and Koopman(2002)]

I The output of the Kalman filter + simulation smoother is an exact


draw from p(α|y, θ).

Steven L. Scott (Google) bsts August 10, 2015 43 / 100


MCMC and the Kalman filter

Simulation smoother
[Durbin and Koopman(2002)] thought of a clever way to simulate p(α|y).
1. Simulate data with the wrong mean, but the right variance.
2. Subtract off the wrong mean, and put in the right one.

The argument goes like this:


1. For multivariate normal (α, y), Var (α|y) is independent of y.
2. Simulate fake data (α, ỹ) from a structural time series model. The
conditional distribution (α|ỹ) has the same variance as (α|y).
3. Subtract E (α|ỹ) from your simulated α’s, and add E (α|y).

[Durbin and Koopman(2012)] (Section 4.6.2) have a “fast state smoother” that
can quickly compute E (αt |y) (without computing each Pt ).
I The DK simulation smoother requires two Kalman filters (for y and ỹ) and
two “fast state smoothers.”
I Works even if Rt is not full rank.
Steven L. Scott (Google) bsts August 10, 2015 44 / 100
MCMC and the Kalman filter

Break time!

Let’s take 15 minutes.

Steven L. Scott (Google) bsts August 10, 2015 45 / 100


Bayesian regression and spike-and-slab priors

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions

Steven L. Scott (Google) bsts August 10, 2015 46 / 100


Bayesian regression and spike-and-slab priors

Linear regression

I “Bayesian regression” is just the ordinary linear model

yn×1 ∼ N Xn×k βk×1 , σ 2 In×n




with a prior on β and σ.

I A convenient prior distribution is p(β, σ 2 ) = p(β|σ 2 )p(σ −2 ), where


 
2 2
 1 df ss
β|σ ∼ N b, σ Ω ∼Γ ,
σ2 2 2

I This prior is conjugate to the regression likelihood (i.e. prior and


posterior are from the same model family).

Steven L. Scott (Google) bsts August 10, 2015 47 / 100


Bayesian regression and spike-and-slab priors

Posterior distribution

I Write (prior) × (likelihood), do some algebra, and you get


 
2

2
 1 DF SS
β|σ , y ∼ N β̃, σ V |y ∼ Γ ,
σ2 2 2

where

V −1 = XT X + Ω−1 β̃ = V (XT y + Ω−1 b)


DF = df + n SS = ss + yT y + b T Ω−1 b − β̃ T V −1 β̃

Steven L. Scott (Google) bsts August 10, 2015 48 / 100


Bayesian regression and spike-and-slab priors

Some useful facts about the posterior distribution

I The posterior mean

β̃ = V (XT y + Ω−1 b)

is the information-weighted average of the OLS estimate and the prior


mean. (XT y = XT Xβ̂)

I The (scaled) posterior information

V −1 = XT X + Ω−1

is the sum of the information in the prior (Ω−1 ) and data (XT X).

I If Ω−1 is positive definite then so is V −1 (and thus V ). Saves you


from perfect colinearity, k > n, etc.

Steven L. Scott (Google) bsts August 10, 2015 49 / 100


Bayesian regression and spike-and-slab priors

Using default values makes prior specification easier

b=0 (Helpful to cheat a tiny bit and set b0 = ȳ .)


XT X/n is the average information in a single
XT X
Ω−1 = n κ observation. κ is the “number of prior observa-
tions” worth of weight given to the prior.
ss
E (σ 2 ) ≈ df
Weight (number of prior observations) given to
df
your guess at σ 2 .
I Now “specifying the prior” means supplying 3 numbers: κ, df , and
your guess at σ 2 .
I If you don’t want to guess at σ 2 , peek at the sample variance of y,
and guess at R 2 , where σ 2 = (1 − R 2 )× (sample variance).
Some useful default values: κ = 1, df = 1, R 2 = 0.5.

Steven L. Scott (Google) bsts August 10, 2015 50 / 100


Bayesian regression and spike-and-slab priors

The marginal distribution of the data.

Because regression models are Gaussian, we can do some of the hard


integrals we can’t do in other models.

p(y|β, σ 2 )p(β|σ 2 )p(σ −2 )


p(β, σ −2 |y) =
p(y)
= p(β|σ 2 , y)p(σ −2 |y)

Solve for
p(y|β, σ 2 )p(β|σ 2 )p(σ −2 )
p(y) = .
p(β|σ 2 , y)p(σ −2 |y)

Steven L. Scott (Google) bsts August 10, 2015 51 / 100


Bayesian regression and spike-and-slab priors

Sparse modeling

I If there are many predictors, one could expect many of them to have
zero coefficients.

I Machine learning people like to use “penalized log likelihood.”


I Lasso, elastic net, Dantzig selector, etc.
I Penalties to log likelihood can be interpreted as log prior distributions.
I These induce sparsity at the mode, but not in the distribution
(zero probability mass at zero).

I Spike and slab priors set some coefficients to zero with positive
probability.

Steven L. Scott (Google) bsts August 10, 2015 52 / 100


The “lasso” (and related priors) are not sparse
They induces sparsity at the mode, but not in the posterior distribution
 
X
p(β) ∝ exp − |βj |
j
0.6

prior prior
likelihood likelihood

0.6
posterior posterior
0.5

0.5
0.4

0.4
0.3
pri

pri

0.3
0.2

0.2
0.1

0.1
0.0

0.0

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

beta beta

(weak likelihood) (stronger likelihood)


Bayesian regression and spike-and-slab priors

Why is this important?


I Penalized methods make a single decision about which variables are
included / excluded.
I With 100 predictors there are 2100 models, which is about 1030 .
I Avogadro’s number is 6 × 1023 , so if each model was an atom, the
space of models would have 1.66 million moles of mass.
I A mole of carbon is (by definition) 12g, so that’s about 20,000kg, or
22 (US) tons.
I So finding the best model in a space of 100 predictors is analogous to
finding the best atom in a 22 ton block of carbon.

I This argument absurdly overstates the case (because not all predictors
are exchangeable), but any algorithm that claims to find “the right”
model with this many candidates should be viewed with suspicion.
I Some variables are obviously helpful.
I Some are obviously garbage.
I With some you’re not sure. This is where the win comes from.
Steven L. Scott (Google) bsts August 10, 2015 54 / 100
Bayesian regression and spike-and-slab priors

Spike and slab priors


[George and McCulloch (1997)]

I We think most elements of β are zero.


I Let γj = 1 if βj 6= 0 and γj = 0 if βj = 0.

γ = (1, 0, 0, 1, · · · , 1, 0, 0)
I Now factor the prior distribution

p(β, γ, σ −2 ) = p(βγ |γ, σ 2 )p(σ 2 |γ)p(γ)

Steven L. Scott (Google) bsts August 10, 2015 55 / 100


Bayesian regression and spike-and-slab priors

A useful parameterization
This prior is conditionally conjugate given γ.

Notation
I bγ means the elements of b where γ = 1.
I Ω−1
γ means the rows and columns of Ω
−1 where γ = 1.

γ
Y
γ∼ πj j (1 − πj )1−γj “Spike”
j
 −1 
βγ |γ, σ 2 ∼ N bγ , σ 2 Ω−1
γ “Slab”
 
1 df ss
∼Γ ,
σ2 2 2

Steven L. Scott (Google) bsts August 10, 2015 56 / 100


Bayesian regression and spike-and-slab priors

Prior elicitation

πj = “expected model size” / number of predictors


b = 0 (vector)
Ω−1 = κ{αXT X + (1 − α)diagXT X}/n
2
ss/df = (1 − Rexpected )sy2
df = 1

I The Ω−1 expression is κ observations worth of prior information.


I It can help to average Ω−1 with its diagonal.
I Prior elicitation is 4 numbers: expected model size, expected R 2 , beta
weight (κ), and sigma weight (df ).

Steven L. Scott (Google) bsts August 10, 2015 57 / 100


Bayesian regression and spike-and-slab priors

Gibbs sampling for spike and slab regression


For each variable j, draw γj |γ−j , y.
1
|Ω−1
γ |2 p(γ)
γ|y ∼ C (y) 1 DF
|Vγ−1 | SSγ 2
2 −1

I Each γj only assumes the values 0 or 1, so the full conditional of γj


only has 2 values. Compute them both and normalize.
I The symbols in this forumla are the same as slide 48, but with γ
subscripts.
I Vγ is the posterior variance of βγ .
I SSγ is a “sum of squares”.
I A |γ| × |γ| matrix needs to be inverted to compute p(γ|y).
Cheap! (if there are lots of 0’s).

Steven L. Scott (Google) bsts August 10, 2015 58 / 100


Bayesian regression and spike-and-slab priors

A regression component in a structural time series model

The Kalman filter requires matrix-matrix multiplication at each step.


I With T time points and latent state dimension m the complexity is
O(Tm3 ).
I With pain, the exponent on m can be lowered, but it is still important
to keep the dimension down, where possible.

A regression component β T xt can be added to the Kalman filter at the


cost of a single dimension.

αt = 1, Zt = β T xt , Tt = 1, Rt = 0

Steven L. Scott (Google) bsts August 10, 2015 59 / 100


Bayesian regression and spike-and-slab priors

MCMC for spike-and-slab bsts

I Time series parameters: θ


I Regression coefficients β.

1. Simulate α ∼ p(α|θ, β, y).


I Note the conditioning on β.
I You’re effecively subtracting off the regression component, then fitting
the “state space model” to the residuals.
2. Set yt∗ = yt − ZtT αt
I Simulate p(θ|α)
I Simulate β|y∗
The componenets are independent so the simulation could be done in
parallel, but θ is so trivial it usually isn’t worth the effort.

Steven L. Scott (Google) bsts August 10, 2015 60 / 100


Bayesian regression and spike-and-slab priors

“Orthogonal Data Augmentation”


A neat trick!

I If p(β|σ 2 ) was diagonal and independent of σ 2 then the only think


keeping p(β|σ, y) from being diagonal is XT X.
I What if you happened to have a set of x’s lying around, and if you
added these x’s to your data XT X would be diagonal?
I You’d need to know the y ’s that go along with these x’s, so you’d
have a missing data problem.
Step 1 Find the x’s needed to diagonalize XT X.
Step 2 Repeat the the following steps:
1. Simulate the missing y ’s given β and σ 2 .
2. Simulate β and σ 2 given complete data.

Steven L. Scott (Google) bsts August 10, 2015 61 / 100


Bayesian regression and spike-and-slab priors

Pro’s and cons of ODA

Pro I The γi ’s can be sampled independently, as can βi |γi .


I This can be done in parallel, and is really really fast.
Cons I You have to decompose the whole XT X matrix once at
the beginning of the algorithm to find the necessary x’s.
I Some of the x’s can have high leverage.
I High leverage points determine where the line goes.
I If the missing data determine the line, and the line
determines the missing data then you have slow mixing.
bsts includes suport for ODA, but it is still experimental at this point.

Steven L. Scott (Google) bsts August 10, 2015 62 / 100


Applications

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications
Nowcasting with Google Trends
Causal Impact

Extensions

Steven L. Scott (Google) bsts August 10, 2015 63 / 100


Applications Nowcasting with Google Trends

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications
Nowcasting with Google Trends
Causal Impact

Extensions

Steven L. Scott (Google) bsts August 10, 2015 64 / 100


Applications Nowcasting with Google Trends

Nowcasting
Maintaining “real time” estimates of infrequently observed time series.

I US weekly initial claims


900

for unemployment
(ICNSA).
800

I Recession leading
700

indicator.
Thousands

600

I Can we learn this week’s


number before it is
500

released?
400

I We’d need a real time


300

signal correlated with


the outcome.
Jan 10 Jul 02 Jul 01 Jul 07 Jul 05 Jul 04 Jul 03 Jul 02 Jul 07
2004 2005 2006 2007 2008 2009 2010 2011 2012

Steven L. Scott (Google) bsts August 10, 2015 65 / 100


Google searches are a real time indicator of public interest
Google searches are a real time indicator of public interest
Applications Nowcasting with Google Trends

Google trends public interface


I Get it from google.com/trends
I Click the little gear to “download as CSV.”
I Data are percentage of all search traffic, normalized so the maximum
is 100.
I You can restrict by type of search, time range, geo, or search category
(“vertical”).

There are ∼600 verticals


I Hierarchical: “Automotive” vertical has “Hybrid and Alternative
Vehicles” subvertical.
I If you compare your search to a “vertical” and then “download as
CSV” then you get the vertical’s series too.
I That’s ∼600 “public interest indicies” you can use to predict YOUR
time series!
Steven L. Scott (Google) bsts August 10, 2015 68 / 100
Applications Nowcasting with Google Trends

Individual search queries


Google correlate can provide the most highly correlated individual queries (up to 100)

Steven L. Scott (Google) bsts August 10, 2015 69 / 100


Applications Nowcasting with Google Trends

Posterior inclusion probabilities


With expected model size = 3, and the top 100 predictors from correlate

plot(model, "coef", inc = .1)

unemployment.office

I Only showing inclusion


filing.for.unemployment probabilities < .1.
I Shading shows
idaho.unemployment Pr (βj > 0|y).
I White: positive
sirius.internet.radio
coefficients
I Black: negative
coefficients
sirius.internet

0.0 0.2 0.4 0.6 0.8 1.0

Inclusion Probability

Steven L. Scott (Google) bsts August 10, 2015 70 / 100


Applications Nowcasting with Google Trends

What got chosen?


plot(model, "predictors", inc = .1)

1 unemployment.office
5

0.94 filing.for.unemployment
0.47 idaho.unemployment
0.14 sirius.internet.radio
4

0.11 sirius.internet
3

I Solid blue line:


actual
Scaled Value

I Remaining lines
1

shaded by inclusion
probability.
0
−1
−2

2004 2006 2008 2010 2012

Steven L. Scott (Google) bsts August 10, 2015 71 / 100


Applications Nowcasting with Google Trends

How much explaining got done?


Dynamic distribution plot shows evolving pointwise posterior distribution of state
components.

plot(model, "components")

trend seasonal.52.1 regression


4

4
3

3
2

2
distribution

distribution

distribution
1

1
0

0
−1

−1

−1
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012 2004 2006 2008 2010 2012

time time time

Steven L. Scott (Google) bsts August 10, 2015 72 / 100


Applications Nowcasting with Google Trends

Did it help?
CompareBstsModels(list("pure time series" = model1,
"with Google Trends" = model2))
pure time series
with Google Trends
100
cumulative absolute error

I Plot shows cumulative


80

absolute
60

one-step-ahead
40

prediction error
20

I The regressors are not


5 0

very helpful during


4

normal times.
scaled values

I They help the model to


2

quickly adapt to the


1

recession.
0
−1

Jan 04 Jul 03 Jul 02 Jul 01 Jul 06 Jul 05 Jul 04 Jul 03 Jul 01


2004 2005 2006 2007 2008 2009 2010 2011 2012

Steven L. Scott (Google) bsts August 10, 2015 73 / 100


Applications Causal Impact

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications
Nowcasting with Google Trends
Causal Impact

Extensions

Steven L. Scott (Google) bsts August 10, 2015 74 / 100


Applications Causal Impact

Measuring advertising effectiveness is a tricky business


I know that half my advertising dollars are wasted.
I just don’t know which half.
John Wanamaker

I One of the basic promises of online advertising is measurement.


I It is supposed to be easy.
I Change something (e.g. increase bid on Google).
I Look to see how many incremental ad clicks you get.
I Life is never easy.
I Ad clicks and native search clicks interact in complicated ways.
I Tough to get “incremental clicks” attributable to the ad campaign.
- Ad clicks can cannibalize native search clicks.
- Ads have a branding effect that can:
1. Be hard to measure,
2. Drive native search clicks,
3. Outlast the campaign.
Steven L. Scott (Google) bsts August 10, 2015 75 / 100
Applications Causal Impact

Example
Real Google advertiser. 6-week ad campaign. Random shift added to both axes.

11000

10000

9000
clicks

8000

7000

6000

5000

Apr 01 Apr 15 May 01 May 15 Jun 01 Jun 15


Steven L. Scott (Google) bsts August 10, 2015 76 / 100
Applications Causal Impact

Problem statement

I An actor engages in a market intervention.


I Has a sale.
I Begins (or modifies) an advertising campaign.
I Introduces (or adopts) a new product.
I Other similar actors don’t engage in the intervention.
I We have data on both the actor and the similar actors prior to the
intervention.

I Question: What was the effect of the intervention?


I Total change to the bottom line.
I How quickly did changes begin to occur?
I How quickly did the effect begin to die out?

Steven L. Scott (Google) bsts August 10, 2015 77 / 100


Difference in differences
An old trick from econometrics. Only measures at two points.
Applications Causal Impact

Synthetic controls
A more realistic counterfactual model than DnD

Abadie et al. (2003, 2010) suggested synthetic controls as counterfactuals.


I Weighted averages of untreated actors used to forecast actor of
interest.
I Weights (0 ≤ wi ≤ 1) estimated so that “synthetic control” series
matches actor’s series in pre-treatment period.
I Difference from forecast is estimated treatment effect.
Good Allows multiple controls, captures temporal effects.
Bad Scaling issues (California vs. Rhode Island), sign constraints
(negative correlations?), other time series?
Especially problematic for marketing. You know your sales, but not your
competitor’s sales.

Steven L. Scott (Google) bsts August 10, 2015 79 / 100


Applications Causal Impact

CausalImpact
Extends DnD and synthetic controls using BSTS

I Use data in the pre-treatement period to build a flexible time series


model for the series of interest.
I Forecast the time series over the intervention period given data from
the pre-treatment period.
I Can use contemporaneous regressors in the forecast.
I Model fit is based on pre-treatment data.
I Deviations from the forecast are the “treatment effect.”

I Assumes “no interference between units.” Often violated. Benign if


effect on untreated is small relative to effect on treated.

Steven L. Scott (Google) bsts August 10, 2015 80 / 100


The picture
Simulated data

a
post-intervention period

pre-intervention period

c
Applications Causal Impact

Potential outcomes

I Let yjst denote the value of Y for unit j under treatment s at time t.
T is the time of the market intervention.

I What we observe:
Before T We observe yj0t for everyone
After T We observe yj1t for the actor and yk0t for the potential
controls k 6= j.

I If we could also observe yj0t for the actor then yj1t − yj0t would be
the treatment effect.

I For t > T we have a model for yj0t |yk0t .

Steven L. Scott (Google) bsts August 10, 2015 82 / 100


Applications Causal Impact

Case study
A Google advertiser ran a marketing experiment.

I Google search ads ran 6 weeks.


I Response is total search related visits to the site.
I Native search clicks.
I Ad clicks.

I 95 of 190 “designated marketing areas” received the ads. (DMA’s are


areas that can receive distinct TV ads).

Steven L. Scott (Google) bsts August 10, 2015 83 / 100


Applications Causal Impact

This particular advertiser ran an experiment


Plot shows clicks from treated vs untreated geos. Each dot is a time point.
11000




● ●


● ●● ●

clicks (treatment region)

● ●
●●
● ● ● ●
● ●
9000

● ●

●●
● ●●

● ●
● ●
● ● ●● ● ●
●● ●●
●● ● ●


7000




●●


●● ●


● ●
●● ● before
5000



during

after

4000 5000 6000 7000

Steven L. Scott (Google) clicks (control


bsts region) August 10, 2015 84 / 100
Applications Causal Impact

Case study
Google advertiser. Treated vs. Untreated regions
a

pre-intervention intervention post-intervention

c
week -4

week -3

week -2

week -1

week 0

week 1

week 2

week 3

week 4

week 5

week 6

week 7
Steven L. Scott (Google) bsts August 10, 2015 85 / 100
Applications Causal Impact

Case study
Google advertiser. Competitor’s clicks as predictors
a

pre-intervention intervention post-intervention

c
week -4

week -3

week -2

week -1

week 0

week 1

week 2

week 3

week 4

week 5

week 6

week 7
Steven L. Scott (Google) bsts August 10, 2015 86 / 100
Applications Causal Impact

Case study
Google advertiser. Untreated regions. Competitor’s sales as predictors
a

pre-intervention intervention post-intervention

c
week -4

week -3

week -2

week -1

week 0

week 1

week 2

week 3

week 4

week 5

week 6

week 7
Steven L. Scott (Google) bsts August 10, 2015 87 / 100
Applications Causal Impact

Case study
Summary

Clicks % 95% Interval


vs. Untreated (1) 84,100 20 (15, 26)%
vs. Competitors (2) 84,800 21 (13, 26)%
A-A (placebo) test 8,000 2 (-5, 6)%
I Need experimental data to do analysis 1.
I Analysis 2 is observational, but replicates the experimenatal results.
I Using Google trends (instead of competitor information) gets about
the same results.
I Google trends are publicly available, while competitor clicks are not.
I Many more potential controls for Google trends. Spike and slab
variable selection / model averaging is useful for selecting appropriate
control groups.

Steven L. Scott (Google) bsts August 10, 2015 88 / 100


Extensions

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions
Normal mixtures
Longer term forecasting

Steven L. Scott (Google) bsts August 10, 2015 89 / 100


Extensions Normal mixtures

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions
Normal mixtures
Longer term forecasting

Steven L. Scott (Google) bsts August 10, 2015 90 / 100


Extensions Normal mixtures

Relaxing Gaussian assumptions

I All “model matrices” in state space models are subscripted by t.


I We can replace the Gaussian assumpttions with conditionally
Gaussian assumptions, where there is a latent variable at each t
determining the means and variances.
The MCMC now looks like this
I Draw latent variables w = (w1 , . . . , wT ) given α and θ.
I Draw α ∼ p(α|y, θ, w)
I Draw θ ∼ p(θ|y, α, w).
Example: The T distribution is a mixture of normals
(a normal divided by a chi-square).
I w ∼ Ga(ν/2, ν/2)
y |w ∼ N µ, σ 2 /w

I

Steven L. Scott (Google) bsts August 10, 2015 91 / 100


Extensions Normal mixtures

Example: retail sales


RSXFS: retail sales, excluding food service

Retail Sales
(Excluding Food Service)
350

I Monthly data, already seasonally


300

adjusted.
RSXFS / 1000

I Catastrophic drop in 2008.


250

I Shift in slope of local linear


200

trend is too large to be handled


by Gaussian assumption.
150

Jan Jan Jan Jan Jan Jan


1992 1996 2000 2004 2008 2012

Steven L. Scott (Google) bsts August 10, 2015 92 / 100


Extensions Normal mixtures

Local linear trend with student T errors


rsxfs-analysis.R

t ∼ N 0, σ 2

y t = µt +  t
µt = µt−1 + δt−1 + ηµ,t−1 ηµ,t ∼ Tν (0, τµ2 )
δt = δt−1 + ηδ,t−1 ηδ,t ∼ Tν (0, τδ2 )

I This is an old Bayesian trick to ensure “robustness.”


I If you tell the model that occasional large errors are possible, it is not
surprised by occasional large errors.

Steven L. Scott (Google) bsts August 10, 2015 93 / 100


Extensions Normal mixtures

Comparing dispersion parameters


Under the Guassian and T models

Standard Deviations

Student ●
Slope

Gaussian ●
●●
● ●●
Slope

Student
Level

Gaussian ●

●●
●●●
●●● ● ●
Level

0.0 0.2 0.4 0.6 0.8 1.0

I Because the model is aware that occasional large errors can occur, the
“standard deviation” parameters can be smaller.

Steven L. Scott (Google) bsts August 10, 2015 94 / 100


Extensions Normal mixtures

Impact on predictions

Original Data
600

600

600
500

500

500
RSXFS / 1000

original.series

original.series
400

400

400
300

300

300
200

200

200
1995 2000 2005 2010 2009 2010 2011 2012 2013 2014 2009 2010 2011 2012 2013 2014

Time time time

I The extreme quantiles of the predictions under the Student model are
wider than under the Gaussian model.
I The central (e.g. 95%, 90%) intervals are narrower.

Steven L. Scott (Google) bsts August 10, 2015 95 / 100


Extensions Normal mixtures

More normal mixtures

Similar tricks can be used to model probit, logit, and Poisson responses,
and even dynamic support vector machines by expressing these
distributions as normal mixtures.

Steven L. Scott (Google) bsts August 10, 2015 96 / 100


Extensions Longer term forecasting

Outline

Introduction to time series modeling

Structural time series models

MCMC and the Kalman filter

Bayesian regression and spike-and-slab priors

Applications

Extensions
Normal mixtures
Longer term forecasting

Steven L. Scott (Google) bsts August 10, 2015 97 / 100


Extensions Longer term forecasting

Long term predictions


I The local linear trend model is focused on detecting short term
changes in the trend.
I Very flexible, but it forgets the past quickly.
I A less flexible, but more robust trend model is

t ∼ N 0, σ 2

yt = µt + t
ηµ,t ∼ N 0, τµ2

µt = µt−1 + δt−1 + ηµ,t−1
ηδ,t ∼ N 0, τδ2

δt = D + ρ(δt−1 − D) + ηδ,t
I Now the slopes δt follow an AR(1) process instead of a random walk.
I If |ρ| < 1 then AR(1) is stationary, so it does not entirely forget the
past.
I D is the “long run” trend of the series.
I The slope can locally deviate from D, but it will eventually return.

Steven L. Scott (Google) bsts August 10, 2015 98 / 100


Extensions Longer term forecasting

Original Data Local linear trend Long term predictions


600

600

600
500

500

500
RSXFS / 1000

original.series

original.series
400

400

400
300

300

300
200

200

200
1995 2000 2005 2010 2009 2010 2011 2012 2013 2014 2009 2010 2011 2012 2013 2014

Time time time

Steven L. Scott (Google) bsts August 10, 2015 99 / 100


References

Carter, C. K. and Kohn, R. (1994).


On Gibbs sampling for state space models.
Biometrika 81, 541–553.
de Jong, P. and Shepard, N. (1995).
The simulation smoother for time series models.
Biometrika 82, 339–350.
Durbin, J. and Koopman, S. J. (2002).
A simple and efficient simulation smoother for state space time series analysis.
Biometrika 89, 603–616.
Durbin, J. and Koopman, S. J. (2012).
Time Series Analysis by State Space Methods.
Oxford University Press.
Frühwirth-Schnatter, S. (1995).
Bayesian model discrimination and Bayes factors for linear Gaussian state space models.
Journal of the Royal Statistical Society, Series B: Methodological 57, 237–246.
George, E. and McCulloch R. (1997).
Approaches for Bayesian Variable Selection.
Statistica Sinica 7, 339–374.

Steven L. Scott (Google) bsts August 10, 2015 100 / 100

You might also like