Stan Reference Manual 2 28
Stan Reference Manual 2 28
Version 2.28
Overview 7
Language 9
1. Character Encoding 10
1.1 Content characters 10
1.2 Comment characters 10
1.3 String literals 10
2. Includes 11
2.1 Recursive includes 12
2.2 Include paths 12
3. Comments 14
3.1 Line-based comments 14
3.2 Bracketed comments 14
4. Whitespace 15
4.1 Whitespace characters 15
4.2 Whitespace neutrality 15
4.3 Whitespace location 15
2
CONTENTS 3
6. Expressions 46
6.1 Numeric literals 46
6.2 Variables 47
6.3 Vector, matrix, and array expressions 50
6.4 Parentheses for grouping 53
6.5 Arithmetic and matrix operations on expressions 54
6.6 Conditional operator 57
6.7 Indexing 59
6.8 Multiple indexing and range indexing 60
6.9 Function application 62
6.10 Type inference 64
6.11 Higher-order functions 67
6.12 Chain rule and derivatives 70
7. Statements 72
7.1 Statement block contexts 72
7.2 Assignment statements 72
7.3 Increment log density 76
7.4 Sampling statements 78
7.5 For loops 86
7.6 Foreach loops 88
7.7 Conditional statements 89
7.8 While statements 90
7.9 Statement blocks and local variable declarations 90
7.10 Break and continue statements 93
7.11 Print statements 95
7.12 Reject statements 98
Algorithms 165
Usage 199
References 204
Overview
This is the official reference manual for Stan’s programming language for coding
probability models, inference algorithms for fitting models and making predictions,
and posterior analysis tools for evaluating the results. This manual applies to all Stan
interfaces.
There are two additional interface-neutral manuals, the [Stan Functions Refer-
ence]((https://mc-stan.org/docs/functions-reference/index.html) listing all the
built-in functions and their signatures, and the [Stan User’s Guide]((https://mc-
stan.org/docs/stan-users-guide/index.html) providing examples and programming
techniques. There is also a separate installation and getting started guide for each
interface.
Web resources
Stan is an open-source software project, resources for which are hosted on various
web sites:
• Stan web site: links to the official Stan releases, source code, installation in-
structions, and full documentation, including the latest version of this manual,
the user’s guide and the getting started guide for each interface, tutorials, case
studies, and reference materials for developers.
• Stan forum: message board for questions, discussion, and announcements
related to Stan for both users and developers.
• Stan GitHub organization: version controlled code and document repositories,
issue trackers for bug reports and feature requests, code review, and wikis;
includes all of Stan’s source code, documentation, and web pages.
7
8 CONTENTS
Licensing
9
1. Character Encoding
10
2. Includes
Stan allows one file to be included within another file using a syntax similar to that
from C++. For example, suppose the file my-std-normal.stan defines the standard
normal log probability density function (up to an additive constant).
functions {
real my_std_normal_lpdf(vector y) {
return -0.5 * y' * y;
}
}
Suppose we also have a file containing a Stan program with an include statement.
#include my-std-normal.stan
parameters {
real y;
}
model {
y ~ my_std_normal();
}
This Stan program behaves as if the contents of the file my-std-normal.stan replace
the line with the #include statement, behaving as if a single Stan program were
provided.
functions {
real my_std_normal_lpdf(vector y) {
return -0.5 * y' * y;
}
}
parameters {
real y;
}
model {
y ~ my_std_normal();
}
There are no restrictions on where include statements may be placed within a file or
11
12 CHAPTER 2. INCLUDES
Line comments are discarded when the entire line is replaced with the contents of
the included file.
The result of processing this file will be empty, because a.stan will include b.stan,
from which the include of a.stan is ignored and a warning printed.
14
4. Whitespace
15
5. Data Types and Declarations
This chapter covers the data types for expressions in Stan. Every variable used
in a Stan program must have a declared data type. Only values of that type will
be assignable to the variable (except for temporary states of transformed data
and transformed parameter values). This follows the convention of programming
languages like C++, not the conventions of scripting languages like Python or
statistical languages such as R or BUGS.
The motivation for strong, static typing is threefold.
1. Strong typing forces the programmer’s intent to be declared with the vari-
able, making programs easier to comprehend and hence easier to debug and
maintain.
2. Strong typing allows programming errors relative to the declared intent to
be caught sooner (at compile time) rather than later (at run time). The Stan
compiler (called through an interface such as CmdStan, RStan, or PyStan) will
flag any type errors and indicate the offending expressions quickly when the
program is compiled.
3. Constrained types will catch runtime data, initialization, and intermediate
value errors as soon as they occur rather than allowing them to propagate and
potentially pollute final results.
Strong typing disallows assigning the same variable to objects of different types at
different points in the program or in different invocations of the program.
16
5.1. OVERVIEW OF DATA TYPES 17
Primitive types
Stan provides two primitive data types, real for continuous values and int for
integer values.
Complex types
Stan provides a complex number data type complex, where a complex number
contains both a real and an imaginary component, both of which are of type real.
Array types
Any type (including the constrained types discussed in the next section) can be made
into an array type by declaring array arguments. For example,
array[10] real x;
array[6, 7] matrix[3, 3] m;
array[12, 8, 15] complex z;
int<lower=1> N;
real<upper=0> log_p;
vector<lower=-1, upper=1>[3] rho;
There are also special data types for structured vectors and matrices. There are
four constrained vector data types, simplex for unit simplexes, unit_vector for
unit-length vectors, ordered for ordered vectors of scalars and positive_ordered
for vectors of positive ordered scalars. There are specialized matrix data types
corr_matrix and cov_matrix for correlation matrices (symmetric, positive definite,
unit diagonal) and covariance matrices (symmetric, positive definite). The type
cholesky_factor_cov is for Cholesky factors of covariance matrices (lower trian-
gular, positive diagonal, product with own transpose is a covariance matrix). The
type cholesky_factor_corr is for Cholesky factors of correlation matrices (lower
triangular, positive diagonal, unit-length rows).
Constraints provide error checking for variables defined in the data, transformed
data, transformed parameters, and generated quantities blocks. Constraints
are critical for variables declared in the parameters block, where they determine the
transformation from constrained variables (those satisfying the declared constraint)
to unconstrained variables (those ranging over all of Rn ).
It is worth calling out the most important aspect of constrained data types:
The model must have support (non-zero density, equivalently finite log density) at
parameter values that satisfy the declared constraints.
If this condition is violated with parameter values that satisfy declared constraints
but do not have finite log density, then the samplers and optimizers may have any of
a number of pathologies including just getting stuck, failure to initialize, excessive
Metropolis rejection, or biased draws due to inability to explore the tails of the
distribution.
Integers
Stan uses 32-bit (4-byte) integers for all of its integer representations. The maximum
value that can be represented as an integer is 231 − 1; the minimum value is −(231 ).
5.2. PRIMITIVE NUMERICAL DATA TYPES 19
When integers overflow, their values wrap. Thus it is up to the Stan programmer
to make sure the integer values in their programs stay in range. In particular, every
intermediate expression must have an integer value that is in range.
Integer arithmetic works in the expected way for addition, subtraction, and multipli-
cation, but rounds the result of division (see section for more information).
Reals
Stan uses 64-bit (8-byte) floating point representations of real numbers. Stan
roughly1 follows the IEEE 754 standard for floating-point computation. The range of
a 64-bit number is roughly ±21022 , which is slightly larger than ±10307 . It is a good
idea to stay well away from such extreme values in Stan models as they are prone to
cause overflow.
64-bit floating point representations have roughly 15 decimal digits of accuracy.
But when they are combined, the result often has less accuracy. In some cases, the
difference in accuracy between two operands and their result is large.
There are three special real values used to represent (1) not-a-number value for error
conditions, (2) positive infinity for overflow, and (3) negative infinity for overflow.
The behavior of these special numbers follows standard IEEE 754 behavior.
Not-a-number
Infinite values
Positive infinity is greater than all numbers other than itself and not-a-number;
negative infinity is similarly smaller. Adding an infinite value to a finite value returns
the infinite value. Dividing a finite number by an infinite value returns zero; dividing
an infinite number by a finite number returns the infinite number of appropriate
sign. Dividing a finite number by zero returns positive infinity. Dividing two infinite
numbers produces a not-a-number value as does subtracting two infinite numbers.
1 Stan compiles integers to int and reals to double types in C++. Precise details of rounding will
depend on the compiler and hardware architecture on which the code is run.
20 CHAPTER 5. DATA TYPES AND DECLARATIONS
Some functions are sensitive to infinite values; for example, the exponential function
returns zero if given negative infinity and positive infinity if given positive infinity.
Often the gradients will break down when values are infinite, making these boundary
conditions less useful than they may appear at first.
The getter functions then extract the real and imaginary components of z and assign
them to re and im respectively.
The function to_complex constructs a complex number from its real and imaginary
components. The functional form needs to be used whenever the components are
not literal numerals, as in the following example.
vector[K] re;
vector[K] im;
// ...
5.4. UNIVARIATE DATA TYPES AND VARIABLE DECLARATIONS 21
for (k in 1:K) {
complex z = to_complex(re[k], im[k]);
// ...
}
The real number assigned to a complex number determine’s the complex number’s
real component, with the imaginary component set to zero.
Assignability is transitive, so that expressions of type int may also be assigned to
variables of type complex, as in the following example.
int n = 2;
complex z = n;
Unconstrained integer
Unconstrained integers are declared using the int keyword. For example, the
variable N is declared to be an integer as follows.
int N;
Constrained integer
Integer data types may be constrained to allow values only in a specified interval by
providing a lower bound, an upper bound, or both. For instance, to declare N to be a
22 CHAPTER 5. DATA TYPES AND DECLARATIONS
Unconstrained real
Unconstrained real variables are declared using the keyword real. The following
example declares theta to be an unconstrained continuous value.
real theta;
Constrained real
Real variables may be bounded using the same syntax as integers. In theory (that is,
with arbitrary-precision arithmetic), the bounds on real values would be exclusive.
Unfortunately, finite-precision arithmetic rounding errors will often lead to values
on the boundaries, so they are allowed in Stan.
The variable sigma may be declared to be non-negative as follows.
real<lower=0> sigma;
To ensure rho takes on values between −1 and 1, use the following declaration.
real<lower=-1, upper=1> rho;
Infinite constraints
Lower bounds that are negative infinity or upper bounds that are positive infinity are
ignored. Stan provides constants positive_infinity() and negative_infinity()
which may be used for this purpose, or they may be read as data in the dump format.
5.4. UNIVARIATE DATA TYPES AND VARIABLE DECLARATIONS 23
Finally, we can combine both declarations to declare a variable with offset 1 and
multiplier 2.
real<offset=1, multiplier=2> x;
or equivalently
parameters {
real<offset=0, multiplier=1> x;
}
model {
x ~ normal(mu, sigma);
}
This declares a real-valued parameter phi to take values greater than the value of
the real-valued data variable lb. Constraints may be complex expressions, but must
be of type int for integer variables and of type real for real variables (including
constraints on vectors, row vectors, and matrices). Variables used in constraints
can be any variable that has been defined at the point the constraint is used. For
instance,
data {
int<lower=1> N;
array[N] real y;
}
parameters {
real<lower=min(y), upper=max(y)> phi;
}
length N, and then a parameter ranging between the minimum and maximum value
of y. As shown in the example code, the functions min() and max() may be applied
to containers such as arrays.
A more subtle case involves declarations of parameters or transformed parameters
based on parameters declared previously. For example, the following program will
work as intended.
parameters {
real a;
real<lower=a> b; // enforces a < b
}
transformed parameters {
real c;
real<lower=c> d;
c = a;
d = b;
}
The parameters instance works because all parameters are defined externally before
the block is executed. The transformed parameters case works even though c isn’t
defined at the point it is used, because constraints on transformed parameters are
only validated at the end of the block. Data variables work like parameter variables,
whereas transformed data and generated quantity variables work like transformed
parameter variables.
If include_alpha is true, the model will include the vector alpha; if the flag is false,
the model will not include alpha (technically, it will include alpha of size 0, which
26 CHAPTER 5. DATA TYPES AND DECLARATIONS
means it won’t contain any values and won’t be included in any output).
This technique is not just useful for containers. If the value of N is set to 1, then
the vector alpha will contain a single element and thus alpha[1] behaves like an
optional scalar, the existence of which is controlled by include_alpha.
This coding pattern allows a single Stan program to define different models based on
the data provided as input. This strategy is used extensively in the implementation
of the RStanArm package.
Indexing from 1
Vectors and matrices, as well as arrays, are indexed starting from one (1) in Stan.
This follows the convention in statistics and linear algebra as well as their implemen-
tations in the statistical software packages R, MATLAB, BUGS, and JAGS. General
computer programming languages, on the other hand, such as C++ and Python,
index arrays starting from zero.
2 This
may change if Stan is called upon to do complicated integer matrix operations or boolean matrix
operations. Integers are not appropriate inputs for linear algebra functions.
5.5. VECTOR AND MATRIX DATA TYPES 27
Vectors
Vectors in Stan are column vectors; see the next subsection for information on row
vectors. Vectors are declared with a size (i.e., a dimensionality). For example, a
3-dimensional vector is declared with the keyword vector, as follows.
vector[3] u;
Similarly, they may be declared with a offset and/or multiplier, as in the following
example
vector<offset=42, multiplier=3>[3] u;
Unit simplexes
A unit simplex is a vector with non-negative values whose entries sum to 1. For
instance, [0.2, 0.3, 0.4, 0.1]> is a unit 4-simplex. Unit simplexes are most often used
as parameters in categorical or multinomial distributions, and they are also the
sampled variate in a Dirichlet distribution. Simplexes are declared with their full
dimensionality. For instance, theta is declared to be a unit 5-simplex by
simplex[5] theta;
Unit simplexes are implemented as vectors and may be assigned to other vectors
and vice-versa. Simplex variables, like other constrained variables, are validated to
ensure they contain simplex values; for simplexes, this is only done up to a statically
specified accuracy threshold to account for errors arising from floating-point
imprecision.
In high dimensional problems, simplexes may require smaller step sizes in the in-
ference algorithms in order to remain stable; this can be achieved through higher
target acceptance rates for samplers and longer warmup periods, tighter tolerances
for optimization with more iterations, and in either case, with less dispersed param-
eter initialization or custom initialization if there are informative priors for some
parameters.
28 CHAPTER 5. DATA TYPES AND DECLARATIONS
Unit vectors
A unit vector is a vector with a norm of one. For instance, [0.5, 0.5, 0.5, 0.5]> is a unit
4-vector. Unit vectors are sometimes used in directional statistics. Unit vectors are
declared with their full dimensionality. For instance, theta is declared to be a unit
5-vector by
unit_vector[5] theta;
Unit vectors are implemented as vectors and may be assigned to other vectors and
vice-versa. Unit vector variables, like other constrained variables, are validated
to ensure that they are indeed unit length; for unit vectors, this is only done up
to a statically specified accuracy threshold to account for errors arising from
floating-point imprecision.
Ordered vectors
An ordered vector type in Stan represents a vector whose entries are sorted in
ascending order. For instance, (−1.3, 2.7, 2.71)> is an ordered 3-vector. Ordered
vectors are most often employed as cut points in ordered logistic regression models
(see section).
The variable c is declared as an ordered 5-vector by
ordered[5] c;
After their declaration, ordered vectors, like unit simplexes, may be assigned to other
vectors and other vectors may be assigned to them. Constraints will be checked after
executing the block in which the variables were declared.
Like ordered vectors, after their declaration, positive ordered vectors may be assigned
to other vectors and other vectors may be assigned to them. Constraints will be
checked after executing the block in which the variables were declared.
5.5. VECTOR AND MATRIX DATA TYPES 29
Row vectors
Row vectors are declared with the keyword row_vector. Like (column) vectors, they
are declared with a size. For example, a 1093-dimensional row vector u would be
declared as
row_vector[1093] u;
Offset and multiplier are also similar as for the following 3-row-vector with offset
-42 and multiplier 3.
row_vector<offset=-42, multiplier=3>[3] u;
Row vectors may not be assigned to column vectors, nor may column vectors be
assigned to row vectors. If assignments are required, they may be accommodated
through the transposition operator.
Matrices
Matrices are declared with the keyword matrix along with a number of rows and
number of columns. For example,
matrix[3, 3] A;
matrix[M, N] B;
Similarly, matrices can be declared to have a set offset and/or multiplier, as in this
matrix with multiplier 5.
matrix<multiplier=5>[3, 4] B;
30 CHAPTER 5. DATA TYPES AND DECLARATIONS
This copies the values from row vector b to a[1], which is the first row of the matrix
a. If the number of columns in a is not the same as the size of b, a run-time error is
raised; the number of columns of a is N, which is also the number of columns of b.
Assignment works by copying values in Stan. That means any subsequent assignment
to a[1] does not affect b, nor does an assignment to b affect a.
Covariance matrices
Matrix variables may be constrained to represent covariance matrices. A matrix is a
covariance matrix if it is symmetric and positive definite. Like correlation matrices,
covariance matrices only need a single dimension in their declaration. For instance,
cov_matrix[K] Omega;
Correlation matrices
Matrix variables may be constrained to represent correlation matrices. A matrix is a
correlation matrix if it is symmetric and positive definite, has entries between −1
and 1, and has a unit diagonal. Because correlation matrices are square, only one
dimension needs to be declared. For example,
corr_matrix[3] Sigma;
In general, two dimensions may be declared, with the above being equal to
cholesky_factor_cov[4, 4]. The type cholesky_factor_cov[M, N] may be
used for the general M × N case to produce positive semi-definite matrices of
rank M .
but basic data types are. For instance, a variable declared to be real<lower=0,
upper=1> could be assigned to a variable declared as real and vice-versa. Similarly,
a variable declared as matrix[3, 3] may be assigned to a variable declared as
cov_matrix[3] or cholesky_factor_cov[3], and vice-versa.
Checks are carried out at the end of each relevant block of statements to ensure
constraints are enforced. This includes run-time size checks. The Stan compiler
isn’t able to catch the fact that an attempt may be made to assign a matrix of one
dimensionality to a matrix of mismatching dimensionality.
The type of m[2] is row_vector because it is the second row of m. Thus it is possible
to write m[2][3] instead of m[2, 3] to access the third element in the second row.
When given a choice, the form m[2, 3] is preferred.
5.6. ARRAY DATA TYPES 33
The form m[2, 3] is more efficient because it does not require the creation and use
of an intermediate expression template for m[2]. In later versions, explicit calls to
m[2][3] may be optimized to be as efficient as m[2, 3] by the Stan compiler.
Any integer-denoting expression may be used for the size declaration, providing
all variables involved are either data, transformed data, or local variables. That is,
expressions used for size declarations may not include parameters or transformed
parameters or generated quantities.
array[5] int n;
A two-dimensional array of real values with three rows and four columns is declared
with the following.
array[3, 4] real a;
A three-dimensional array z of positive reals with five rows, four columns, and two
shelves can be declared as follows.
array[5, 4, 2] real<lower=0> z;
Assigning
Subarrays may be manipulated and assigned just like any other variables. Similar to
the behavior of matrices, Stan allows blocks such as
array[9, 10, 11] real w;
array[10, 11] real x;
array[11] real y;
real z;
// ...
x = w[5];
y = x[4]; // y == w[5][4] == w[5, 4]
z = y[3]; // z == w[5][4][3] == w[5, 4, 3]
Row vectors and other derived vector types (simplex and ordered) behave the same
way in terms of indexing.
Consider the following matrix, vector and scalar declarations.
array[3, 4] matrix[6, 5] d;
array[4] matrix[6, 5] e;
matrix[6, 5] f;
row_vector[5] g;
real x;
36 CHAPTER 5. DATA TYPES AND DECLARATIONS
As shown, the result f[2] of supplying a single index to a matrix is the indexed row,
here row 2 of matrix f.
array[4] real a;
vector[4] b;
row_vector[4] c;
// ...
a = b; // illegal assignment of vector to array
b = a; // illegal assignment of array to vector
a = c; // illegal assignment of row vector to array
c = a; // illegal assignment of array to row vector
It is not even legal to assign row vectors to column vectors or vice versa.
vector[4] b;
row_vector[4] c;
// ...
b = c; // illegal assignment of row vector to column vector
c = b; // illegal assignment of column vector to row vector
The same holds for matrices, where 2-dimensional arrays may not be assigned to
matrices or vice-versa.
array[3, 4] real a;
matrix[3, 4] b;
// ...
a = b; // illegal assignment of matrix to array
b = a; // illegal assignment of array to matrix
then the resulting size of a is zero and querying any of its dimensions at run time will
result in the value zero. Declared as above, a[1] will be a size-zero one-dimensional
array. For comparison, declaring
array[0, 3] real b;
also produces an array with an overall size of zero, but in this case, there is no way
to index legally into b, because b[0] is undefined. The array will behave at run time
as if it’s a 0 × 0 array. For example, the result of to_matrix(b) will be a 0 × 0 matrix,
not a 0 × 3 matrix.
declares the variable a to be an array. The fact that it was declared to have size 3 is
part of its declaration, but not part of its underlying type.
Sizes are determined dynamically (at run time) and thus cannot be type-checked
statically when the program is compiled. As a result, any conformance error on size
will raise a run-time error. For example, trying to assign an array of size 5 to an array
of size 6 will cause a run-time error. Similarly, multiplying an N × M by a J × K
matrix will raise a run-time error if M 6= J.
For arguments to functions, constraints are sometimes, but not always checked
when the function is called. Exclusions include C++ standard library functions.
All probability functions and cumulative distribution functions check that their
arguments are appropriate at run time as the function is called.
For data variables, constraints are checked after the variable is read from a data file or
other source. For transformed data variables, the check is done after the statements
40 CHAPTER 5. DATA TYPES AND DECLARATIONS
in the transformed data block have executed. Thus it is legal for intermediate values
of variables to not satisfy declared constraints.
For parameters, constraints are enforced by the transform applied and do not need
to be checked. For transformed parameters, the check is done after the statements
in the transformed parameter block have executed.
For all blocks defining variables (transformed data, transformed parameters, gener-
ated quantities), real values are initialized to NaN and integer values are initialized
to the smallest legal integer (i.e., a large absolute value negative number).
For generated quantities, constraints are enforced after the statements in the gener-
ated quantities block have executed.
In the following table, the leftmost column is a list of the unconstrained and undi-
mensioned basic types; these are used as function return types and argument types.
The middle column is of unconstrained types with dimensions; these are used as
local variable types. The variables M and N indicate number of columns and rows,
respectively. The variable K is used for square matrices, i.e., K denotes both the
number of rows and columns. The rightmost column lists the corresponding con-
strained types. An expression of any right-hand column type may be assigned to
its corresponding left-hand column basic type. At runtime, dimensions are checked
for consistency for all variables; containers of any sizes may be assigned to func-
tion arguments. The constrained matrix types cov_matrix[K], corr_matrix[K],
cholesky_factor_cov[K], and cholesky_factor_corr[K] are only assignable to
matrices of dimensions matrix[K, K] types.
Function
Argument Local
(unsized) (unconstrained) Block (constrained)
int int int
int<lower=L>
int<upper=U>
int<lower=L, upper=U>
int<offset=O>
int<multiplier=M>
int<offset=O, multiplier=M>
real real real
real<lower=L>
real<upper=U>
real<lower=L, upper=U>
real<offset=O>
real<multiplier=M>
real<offset=O, multiplier=M>
42 CHAPTER 5. DATA TYPES AND DECLARATIONS
Function
Argument Local
(unsized) (unconstrained) Block (constrained)
vector vector[N] vector[N]
vector[N]<lower=L>
vector[N]<upper=U>
vector[N]<lower=L, upper=U>
vector[N]<offset=O>
vector[N]<multiplier=M>
vector[N]<offset=O, multiplier=M>
ordered[N]
positive_ordered[N]
simplex[N]
unit_vector[N]
row_vector row_vector[N] row_vector[N]
row_vector[N]<lower=L>
row_vector[N]<upper=U>
row_vector[N]<lower=L, upper=U>
row_vector[N]<offset=O>
row_vector[N]<multiplier=M>
row_vector[N]<offset=O,
multiplier=M>
matrix matrix[M, N] matrix[M, N]
matrix[M, N]<lower=L>
matrix[M, N]<upper=U>
matrix[M, N]<lower=L, uppers=U>
matrix[M, N]<offset=O>
matrix[M, N]<multiplier=M>
matrix[M, N]<offset=O,
multiplier=M>
matrix[K, K] corr_matrix[K]
matrix[K, K] cov_matrix[K]
matrix[K, K] cholesky_factor_corr[K]
matrix[K, K] cholesky_factor_cov[K]
array[] vector array[M] array[M] vector[N]
vector[N]
array[M] vector[N]<lower=L>
array[M] vector[N]<upper=U>
5.9. COMPOUND VARIABLE DECLARATION AND DEFINITION 43
Function
Argument Local
(unsized) (unconstrained) Block (constrained)
array[M] vector[N]<lower=L,
upper=U>
array[M] vector[N]<offset=O>
array[M] vector[N]<multiplier=M>
array[M] vector[N]<offset=O,
multiplier=M>
array[M] ordered[N]
array[M] positive_ordered[N]
array[M] simplex[N]
array[M] unit_vector[N]
Additional array types follow the same basic template as the final example in the
table and can contain any of the previous types. The unsized version of arrays with
more than one dimension is specified by using commas, e.g. array[ , ] is a 2-D
array.
For more on how function arguments and return types are declared, consult the
User’s Guide chapter on functions.
declares the variable N to be an integer scalar type and at the same time defines it to
be the value of the expression 5.
44 CHAPTER 5. DATA TYPES AND DECLARATIONS
Assignment typing
The type of the expression on the right-hand side of the assignment must be
assignable to the type of the variable being declared. For example, it is legal
to have
real sum = 0;
even though 0 is of type int and sum is of type real, because integer-typed scalar
expressions can be assigned to real-valued scalar variables. In all other cases, the
type of the expression on the right-hand side of the assignment must be identical to
the type of the variable being declared.
Any type may be assigned. For example,
matrix[3, 2] a = b;
declares a matrix variable a and assigns it to the value of b, which must be of type
matrix for the compound statement to be well formed. The sizes of matrices are not
part of their static typing and cannot be validated until run time.
assigns the matrix variable a to half of the sum of b and c. The only requirement
on b and c is that the expression b + c be of type matrix. For example, b could be
of type matrix and c of type real, because adding a matrix to a scalar produces a
matrix, and the multiplying by a scalar produces another matrix.
The right-hand side expression can be a call to a user defined function, allowing
general algorithms to be applied that might not be otherwise expressible as simple
expressions (e.g., iterative or recursive algorithms).
is equivalent to
real x;
real y;
As a result, all declarations on the same line must be of the same type.
Constrained data types can also be declared together, so long as the constraint for
each variable is the same:
real<lower=0> x, y;
6. Expressions
An expression is the syntactic unit in a Stan program that denotes a value. Every
expression in a well-formed Stan program has a type that is determined statically (at
compile time), based only on the type of its variables and the types of the functions
used in it. If an expressions type cannot be determined statically, the Stan compiler
will report the location of the problem.
This chapter covers the syntax, typing, and usage of the various forms of expressions
in Stan.
Integer literals
Integer literals represent integers of type int. Integer literals are written in base 10
without any separators. Integer literals may contain a single negative sign. (The
expression --1 is interpreted as the negation of the literal -1.)
The following list contains well-formed integer literals.
0, 1, -1, 256, -127098, 24567898765
Integer literals must have values that fall within the bounds for integer values (see
section).
Integer literals may not contain decimal points (.). Thus the expressions 1. and 1.0
are of type real and may not be used where a value of type int is required.
Real literals
A number written with a period or with scientific notation is assigned to a the
continuous numeric type real. Real literals are written in base 10 with a period
(.) as a separator and optionally an exponent with optional sign. Examples of
well-formed real literals include the following.
0.0, 1.0, 3.14, -217.9387, 2.7e3, -2E-5, 1.23e+3.
46
6.2. VARIABLES 47
Imaginary literals
A number followed by the character i denotes an imaginary number and is assigned
to the numeric type complex. The number preceding i may be either a real or
integer literal and determines the magnitude of the imaginary number. Examples of
well-formed imaginary literals include the following.
1i, 2i, -325.786i, 1e10i, 2.87e-10i.
Note that the character i by itself is not a well-formed imaginary literal. The unit
imaginary number must be written as 1i.
Complex literals
Stan does not include complex literals directly, but a real or integer literal can be
added to an imaginary literal to derive an expression that behaves like a complex
literal. Examples include the following.
1 + 2i, -3.2e9 + 1e10i
These will be assigned the type complex, which is the result of adding a real or
integer and a complex number. They will also function like literals in the sense that
the C++ compiler is able to reduce them to a single complex constant at compile
time.
6.2. Variables
A variable by itself is a well-formed expression of the same type as the variable.
Variables in Stan consist of ASCII strings containing only the basic lower-case and
upper-case Roman letters, digits, and the underscore (_) character. Variables must
start with a letter (a--z and A--Z) and may not end with two underscores (__).
Examples of legal variable identifiers are as follows.
a, a3, a_3, Sigma, my_cpp_style_variable, myCamelCaseVariable
Unlike in R and BUGS, variable identifiers in Stan may not contain a period character.
48 CHAPTER 6. EXPRESSIONS
Reserved names
Stan reserves many strings for internal use and these may not be used as the name
of a variable. An attempt to name a variable after an internal string results in the
stanc translator halting with an error message indicating which reserved name was
used and its location in the model code.
Model name
The name of the model cannot be used as a variable within the model. This is usually
not a problem because the default in bin/stanc is to append _model to the name
of the file containing the model specification. For example, if the model is in file
foo.stan, it would not be legal to have a variable named foo_model when using the
default model name through bin/stanc. With user-specified model names, variables
cannot match the model.
The following list contains reserved words for Stan’s programming language. Not all
of these features are implemented in Stan yet, but the tokens are reserved for future
use.
for, in, while, repeat, until, if, then, else,
true, false, target
Variables should not be named after types, either, and thus may not be any of the
following.
int, real, vector, simplex, unit_vector, ordered,
positive_ordered, row_vector, matrix,
cholesky_factor_corr, cholesky_factor_cov,
corr_matrix, cov_matrix
The following block identifiers are reserved and cannot be used as variable names:
functions, model, data, parameters, quantities,
transformed, generated
6.2. VARIABLES 49
Some variable names are reserved because they are used within Stan’s C++ imple-
mentation. These are
var, fvar, STAN_MAJOR, STAN_MINOR, STAN_PATCH,
STAN_MATH_MAJOR, STAN_MATH_MINOR, STAN_MATH_PATCH
Variable names will also conflict with the names of distributions suffixed with _lpdf,
_lpmf, _lcdf, and _lccdf, _cdf, and _ccdf, such as normal_lcdf_log; this also
holds for the deprecated forms _log, _cdf_log, and _ccdf_log. No user-defined
variable can take a name ending in _lupdf or _lupmf even if a corresponding _lpdf
or _lpmf is not defined.
Using any of these variable names causes the stanc translator to halt and report the
name and location of the variable causing the conflict.
Finally, variable names, including the names of models, should not conflict with any
of the C++ keywords.
alignas, alignof, and, and_eq, asm, auto, bitand, bitor, bool,
break, case, catch, char, char16_t, char32_t, class, compl,
const, constexpr, const_cast, continue, decltype, default,
delete, do, double, dynamic_cast, else, enum, explicit,
export, extern, false, float, for, friend, goto, if,
inline, int, long, mutable, namespace, new, noexcept,
not, not_eq, nullptr, operator, or, or_eq, private,
protected, public, register, reinterpret_cast, return,
short, signed, sizeof, static, static_assert, static_cast,
struct, switch, template, this, thread_local, throw, true,
try, typedef, typeid, typename, union, unsigned, using,
virtual, void, volatile, wchar_t, while, xor, xor_eq
50 CHAPTER 6. EXPRESSIONS
Legal characters
The legal characters for variable identifiers are given in the identifier characters
table.
Identifier Characters Table. id:identifier-characters-table The alphanumeric charac-
ters and underscore in base ASCII are the only legal characters in Stan identifiers.
Although not the most expressive character set, ASCII is the most portable and least
prone to corruption through improper character encodings or decodings. Sticking to
this range of ASCII makes Stan compatible with Latin-1 or UTF-8 encodings of these
characters, which are byte-for-byte identical to ASCII.
Within comments, Stan can work with any ASCII-compatible character encoding,
such as ASCII itself, UTF-8, or Latin1. It is up to user shells and editors to display
them properly.
Vector expressions
Square brackets may be wrapped around a sequence of comma separated primitive
expressions to produce a row vector expression. For example, the expression [ 1,
10, 100 ] denotes a row vector of three elements with real values 1.0, 10.0, and
100.0. Applying the transpose operator to a row vector expression produces a vector
expression. This syntax provides a way declare and define small vectors a single line,
as follows.
6.3. VECTOR, MATRIX, AND ARRAY EXPRESSIONS 51
row_vector[2] rv2= [ 1, 2 ];
vector[3] v3 = [ 3, 4, 5 ]';
Matrix expressions
A matrix expression consists of square brackets wrapped around a sequence of
comma separated row vector expressions. This syntax provides a way declare and
define a matrix in a single line, as follows.
matrix[3, 2] m1 = [ [ 1, 2 ], [ 3, 4 ], [5, 6 ] ];
Any expression denoting a row vector can be used in a matrix expression. For
example, the following code is valid:
vector[2] vX = [ 1, 10 ]';
row_vector[2] vY = [ 100, 1000 ];
matrix[3, 2] m2 = [ vX', vY, [ 1, 2 ] ];
The empty expression [ ] is ambiguous and therefore is not allowed and similarly
expressions such as [ [ ] ] or [ [ ], [ ] ] are not allowed.
Array expressions
Curly braces may be wrapped around a sequence of expressions to produce an array
expression. For example, the expression { 1, 10, 100 } denotes an integer array
of three elements with values 1, 10, and 100. This syntax is particularly convenient
to define small arrays in a single line, as follows.
array[3] int a = { 1, 10, 100 };
The elements may also be a mixture of int and real typed expressions, in which
case the result is an array of real values.
array[2] real b = { 1, 1.9 };
Restrictions on values
There are some restrictions on how array expressions may be used that arise from
their types being calculated bottom up and the basic data type and assignment rules
of Stan.
Although it is tempting to try to define a ragged array expression, all Stan data
types are rectangular (or boxes or other higher-dimensional generalizations). Thus
the following nested array expression will cause an error when it tries to create a
non-rectangular array.
{ { 1, 2, 3 }, { 4, 5 } } // compile time error: size mismatch
(array[,] int) out of two one-dimensional array integer arrays (array[] int).
But it is not allowed because the two one-dimensional arrays are not the same size.
If the elements are array expressions, this can be diagnosed at compile time. If one
or both expressions is a variable, then that won’t be caught until runtime.
{ { 1, 2, 3 }, m } // runtime error if m not size 3
Because there is no way to infer the type of the result, the empty array expression ({
}) is not allowed. This does not sacrifice expressive power, because a declaration is
sufficient to initialize a zero-element array.
array[0] int a; // a is fully defined as zero element array
Integer arrays may not be assigned to real values. However, this problem is easily
sidestepped by using real literal expressions.
array[2] real a = { -3.0, 12.0 };
real value.
Stan provides elementwise matrix multiplication (e.g., a .* b) and division (e.g., a
./ b) operations. These provide a shorthand to replace loops, but are not intrinsically
more efficient than a version programmed with an elementwise calculations and
assignments in a loop. For example, given declarations,
vector[N] a;
vector[N] b;
vector[N] c;
the assignment,
c = a .* b;
produces the same result with roughly the same efficiency as the loop
for (n in 1:N) {
c[n] = a[n] * b[n];
}
Stan supports exponentiation (ˆ) of integer and real-valued expressions. The return
type of exponentiation is always a real-value. For example, assuming n and m are
integer variables and x and y real variables, the following expressions are legal.
3 ˆ 2
3.0 ˆ -2
3.0 ˆ 0.14
x ˆ n
n ˆ x
n ˆ m
x ˆ y
list the precedence of function application and array, matrix, and vector indexing. The
operators are listed in order of precedence, from least tightly binding to most tightly
binding. The full set of legal arguments and corresponding result types are provided
in the function documentation for the operators (i.e., operator*(int, int):int
indicates the application of the multiplication operator to two integers, which returns
an integer). Parentheses may be used to group expressions explicitly rather than relying
on precedence and associativity.
The precedence and associativity determine how expressions are interpreted. Be-
cause addition is left associative, the expression a + b + c is interpreted as (a +
b) + c. Similarly, a / b * c is interpreted as (a / b) * c.
Because multiplication has higher precedence than addition, the expression a * b
+ c is interpreted as (a * b) + c and the expression a + b * c is interpreted as a
+ (b * c). Similarly, 2 * x + 3 * - y is interpreted as (2 * x) + (3 * (-y)).
Transposition and exponentiation bind more tightly than any other arithmetic or
logical operation. For vectors, row vectors, and matrices, -u' is interpreted as
-(u'), u * v' as u* (v'), and u' * v as (u') * v. For integer and reals, -n ˆ 3
is interpreted as -(n ˆ 3).
The conditional operator is the most loosely binding operator, so its arguments rarely
require parentheses for disambiguation. For example,
a > 0 || b < 0 ? c + d : e - f
The latter is easier to read even if the parentheses are not strictly necessary.
58 CHAPTER 6. EXPRESSIONS
The key property of the conditional operator that makes it so useful in high-
performance computing is that it only evaluates the returned subexpression, not the
alternative expression. In other words, it is not like a typical function that evaluates
its argument expressions eagerly in order to pass their values to the function. As
usual, the saving is mostly in the derivatives that do not get computed rather than
the unnecessary function evaluation itself.
Promotion to parameter
If one return expression is a data value (an expression involving only constants and
variables defined in the data or transformed data block), and the other is not, then
the ternary operator will promote the data value to a parameter value. This can
cause needless work calculating derivatives in some cases and be less efficient than
a full if-then conditional statement. For example,
data {
array[10] real x;
6.7. INDEXING 59
// ...
}
parameters {
array[10] real z;
// ...
}
model {
y ~ normal(cond ? x : z, sigma);
// ...
}
The conditional statement, like the conditional operator, only evaluates one of the
result statements. In this case, the variable x will not be promoted to a parameter
and thus not cause any needless work to be carried out when propagating the chain
rule during derivative calculations.
6.7. Indexing
Stan arrays, matrices, vectors, and row vectors are all accessed using the same
array-like notation. For instance, if x is a variable of type array [] real (a one-
dimensional array of reals) then x[1] is the value of the first element of the array.
Subscripting has higher precedence than any of the arithmetic operations. For
example, alpha * x[1] is equivalent to alpha * (x[1]).
Multiple subscripts may be provided within a single pair of square brackets. If x is of
type array[,] real, a two-dimensional array, then x[2, 501] is of type real.
Accessing subarrays
The subscripting operator also returns subarrays of arrays. For example, if x is of
type array[„] real, then x[2] is of type array[,] real, and x[2, 3] is of type
array[] real. As a result, the expressions x[2, 3] and x[2][3] have the same
meaning.
60 CHAPTER 6. EXPRESSIONS
evaluate that index position in the result, the index is first passed to the multiple
index, and the resulting index used.
a[idxs, ...][i, ...] = a[idxs[i], ...][...]
On the other hand, if idx is a single index, it reduces the dimensionality of the
output, so that
a[idx, ...] = a[idx][...]
The only issue is what happens with matrices and vectors. Vectors work just like
arrays. Matrices with multiple row indexes and multiple column indexes produce
matrices. Matrices with multiple row indexes and a single column index become
(column) vectors. Matrices with a single row index and multiple column indexes
become row vectors. The types are summarized in the matrix indexing table.
Matrix Indexing Table. Special rules for reducing matrices based on whether the
argument is a single or multiple index. Examples are for a matrix a, with integer single
indexes i and j and integer array multiple indexes is and js. The same typing rules
apply for all multiple indexes.
Constants
Constants in Stan are nothing more than nullary (no-argument) functions. For
instance, the mathematical constants π and e are represented as nullary functions
named pi() and e(). See the built-in constants section for a list of built-in constants.
6.9. FUNCTION APPLICATION 63
The use of foo in the expression foo(1.0, 1.0) resolves to foo(real, real), and
thus the expression foo(1.0, 1.0) itself is assigned a type of real.
Because integers may be promoted to real values, the expression foo(1, 1) could
potentially match either foo(real, real) or foo(int, int). The former requires
two type promotions and the latter requires none, so foo(1, 1) is resolved to
function foo(int, int) and is thus assigned the type int.
The expression foo(1, 1.0) has argument types (int, real) and thus does not
explicitly match either function signature. By promoting the integer expression 1 to
type real, it is able to match foo(real, real), and hence the type of the function
expression foo(1, 1.0) is real.
In some cases (though not for any built-in Stan functions), a situation may arise in
which the function referred to by an expression remains ambiguous. For example,
consider a situation in which there are exactly two functions named bar with the
following signatures.
real bar(real, int);
real bar(int, real);
With these signatures, the expression bar(1.0, 1) and bar(1, 1.0) resolve to the
first and second of the above functions, respectively. The expression bar(1.0, 1.0)
is illegal because real values may not be demoted to integers. The expression bar(1,
1) is illegal for a different reason. If the first argument is promoted to a real value,
it matches the first signature, whereas if the second argument is promoted to a
real value, it matches the second signature. The problem is that these both require
one promotion, so the function name bar is ambiguous. If there is not a unique
function requiring fewer promotions than all others, as with bar(1, 1) given the
two declarations above, the Stan compiler will flag the expression as illegal.
64 CHAPTER 6. EXPRESSIONS
Posterior predictive checks typically use the parameters of the model to generate
simulated data (at the individual and optionally at the group level for hierarchical
models), which can then be compared informally using plots and formally by means
of test statistics, to the actual data in order to assess the suitability of the model;
see Chapter 6 of (Gelman et al. 2013) for more information on posterior predictive
checks.
Implementation types
The primitive implementation types for Stan are
int, real, vector, row_vector, matrix.
Every basic declared type corresponds to a primitive type; see the primitive type
table for the mapping from types to their primitive types.
6.10. TYPE INFERENCE 65
Primitive Type Table. The table shows the variable declaration types of Stan and
their corresponding primitive implementation type. Stan functions, operators, and
probability functions have argument and result types declared in terms of primitive
types plus array dimensionality.
The matrix a is indexed as a[i, j, k, m, n] with the array indices first, followed
by the matrix indices, with a[i, j, k] being a matrix and a[i, j, k, m] being a
row vector.
66 CHAPTER 6. EXPRESSIONS
Promotion
There are two promotion rules. First, integer expressions of type int may be used
anywhere an expression of type real is used. An integer is promoted to real by
casting it in the underlying C++.
The second promotion rule is that expressions of type real may be used anywhere
an expression of type complex is required. A real value is promoted to a complex
number with that real component and a zero imaginary component.
Promotion is transitive, so that integers may be promoted to complex numbers in
two stages, first converting the integer to real, then converting the real value to a
complex type.
Literals
An integer literal expression such as 42 is of type int. Real literals such as 42.0 are
of type real.
Variables
Indexing
scope to hide (take precedence over for evaluation) a variable defined in a containing scope.
6.11. HIGHER-ORDER FUNCTIONS 67
Function application
2 Internally, they are implemented as their own expression types because Stan doesn’t have object-level
return
function parameter or data args data args type
algebra_solver vector, vector array [] real, vector
array [] real
algebra_solver_newton
vector, vector array [] real, vector
array [] real
integrate_1d, real, real, array [] array [] real, real
real array [] real
integrate_ode_X, real, array [] real, array [] real, array []
array [] real array [] real real
map_rect vector, vector array [] real, vector
array [] real
The function argument is foo, the name of the user-defined function; as shown
in the higher-order functions table, integrate_ode_rk45 takes a real array, a real,
6.11. HIGHER-ORDER FUNCTIONS 69
three more real arrays, and an integer array as arguments and returns 2D real array.
Variadic Higher-order Functions Table. Variadic Higher-order functions in Stan
with their argument function types. The first group of arguments are restricted in type.
The sequence of trailing arguments can be of any length with any types.
The function argument is foo, the name of the user-defined function. As shown in
the variadic higher-order functions table, ode_rk45 takes a real, a vector, a real, a
real array, and a sequence of arguments whos types match those at the end of foo
and returns an array of vectors.
70 CHAPTER 6. EXPRESSIONS
Data-restricted arguments
Some of the arguments to higher-order functions are restricted to data. This means
they must be expressions containing only data variables, transformed data variables,
or literals; the may contain arbitrary functions applied to data variables or literals,
but must not contain parameters, transformed parameters, or local variables from
any block other than transformed data.
For user-defined functions the qualifier data may be prepended to the type to restrict
the argument to data-only variables.
and it would seem the model should produce unit normal draws for x. But rather
6.12. CHAIN RULE AND DERIVATIVES 71
than canceling, the expression sqrt(x - x) causes a problem for derivatives. The
cause is the mechanistic evaluation of the chain rule,
d
√ d
x−x = √1 × − x)
dx 2 x−x dx (x
1
= 0 × (1 − 1)
= ∞×0
= NaN.
Rather than the x − x canceling out, it introduces a 0 into the numerator and
denominator of the chain-rule evaluation.
The only way to avoid this kind problem is to be careful to do the necessary algebraic
reductions as part of the model and not introduce expressions like sqrt(x - x) for
which the chain rule produces not-a-number values.
Log probability=-0.393734
Executing this statement assigns the value of the expression 0, which is the integer
zero, to the variable n. For an assignment to be well formed, the type of the
1 In versions of Stan before 2.18.0, the operator <- was used for assignment rather than using the
equal sign =. The old operator <- is now deprecated and will print a warning. In the future, it will be
removed.
72
7.2. ASSIGNMENT STATEMENTS 73
expression on the right-hand side should be compatible with the type of the (indexed)
variable on the left-hand side. For the above example, because 0 is an expression of
type int, the variable n must be declared as being of type int or of type real. If the
variable is of type real, the integer zero is promoted to a floating-point zero and
assigned to the variable. After the assignment statement executes, the variable n will
have the value zero (either as an integer or a floating-point value, depending on its
type).
Syntactically, every assignment statement must be followed by a semicolon. Oth-
erwise, whitespace between the tokens does not matter (the tokens here being
the left-hand-side (indexed) variable, the assignment operator, the right-hand-side
expression and the semicolon).
Because the right-hand side is evaluated first, it is possible to increment a variable in
Stan just as in C++ and other programming languages by writing
n = n + 1;
Such self assignments are not allowed in BUGS, because they induce a cycle into the
directed graphical model.
The left-hand side of an assignment may contain indices for array, matrix, or vector
data structures. For instance, if Sigma is of type matrix, then
Sigma[1, 1] = 1.0;
sets the value in the first column of the first row of Sigma to one.
Assignments to subcomponents of larger multi-variate data structures are supported
by Stan. For example, a is an array of type array[,] real and b is an array of type
array[] real, then the following two statements are both well-formed.
a[3] = b;
b = a[4];
Lvalue summary
The expressions that are legal left-hand sides of assignment statements are known
as “lvalues.” In Stan, there are only two kinds of legal lvalues,
• a variable, or
• a variable with one or more indices.
To be used as an lvalue, an indexed variable must have at least as many dimensions
as the number of indices provided. An array of real or integer types has as many
dimensions as it is declared for. A matrix has two dimensions and a vector or row
vector one dimension; this also holds for the constrained types, covariance and
correlation matrices and their Cholesky factors and ordered, positive ordered, and
simplex vectors. An array of matrices has two more dimensions than the array and
an array of vectors or row vectors has one more dimension than the array. Note that
the number of indices can be less than the number of dimensions of the variable,
meaning that the right hand side must itself be multidimensional to match the
remaining dimensions.
Multiple indexes
Multiple indexes, as described in the multi-indexing section, are also permitted on
the left-hand side of assignments. Indexing on the left side works exactly as it does
for expressions, with multiple indexes preserving index positions and single indexes
reducing them. The type on the left side must still match the type on the right side.
Aliasing
All assignment is carried out as if the right-hand side is copied before the assignment.
This resolves any potential aliasing issues arising from he right-hand side changing
in the middle of an assignment statement’s execution.
will be equivalent to
x = x op y;
The compound statement will be legal whenever the long form is legal. This requires
that the operation x op y must itself be well formed and that the result of the
operation be assignable to x. For the expression x to be assignable, it must be an
indexed variable where the variable is defined in the current block. For example,
the following compound addition and assignment statement will increment a single
element of a vector by two.
vector[N] x;
x[3] += 2;
The supported compound arithmetic and assignment operations are listed in the
compound arithmetic/assignment table; they are also listed in the index prefaced by
operator, e.g., operator+=.
Compound Arithmetic/Assignment Table. Stan allows compound arithmetic and
assignment statements of the forms listed in the table. The compound form is legal
whenever the corresponding long form would be legal and it has the same effect.
The keyword target here is actually not a variable, and may not be accessed as such
(though see below on how to access the value of target through a special function).
In this example, the unnormalized log probability of a unit normal variable y is
added to the total log probability. In the general case, the argument can be any
expression.3
An entire Stan model can be implemented this way. For instance, the following
model will draw a single variable according to a unit normal probability.
parameters {
real y;
}
model {
2 The current notation replaces two previous versions. Originally, a variable lp__ was directly ex-
posed and manipulated; this is no longer allowed. The original statement syntax for target += u was
increment_log_prob(u), but this form has been deprecated and will be removed in Stan 3.
3 Writing this model with the expression -0.5 * y * y is more efficient than with the equivalent
expression y * y / -2 because multiplication is more efficient than division; in both cases, the negation
is rolled into the numeric literal (-0.5 and -2). Writing square(y) instead of y * y would be even more
efficient because the derivatives can be precomputed, reducing the memory and number of operations
required for automatic differentiation.
7.3. INCREMENT LOG DENSITY 77
target += -0.5 * y * y;
}
y2
log p(y) = − − log Z
2
Stan only requires models to be defined up to a constant that does not depend on the
parameters. This is convenient because often the normalizing constant Z is either
time-consuming to compute or intractable to evaluate.
Built in distributions
The built in distribution functions in Stan are all available in normalized and unnor-
malized form. The normalized forms include all of the terms in the log density, and
the unnormalized forms drop terms which are not directly or indirectly a function of
the model parameters.
For instance, the normal_lpdf function returns the log density of a normal distribu-
tion:
√ 1 x − µ 2
normal_lpdf(x|µ, σ) = − log σ 2π −
2 σ
to make the calculation fast. Dropping a constant sigma term, normal_lupdf would
be equivalent to:
2
1 x−µ
normal_lupdf(x|µ, σ) = −
2 σ
All functions ending in _lpdf have a corresponding _lupdf version which evaluates
and returns the unnormalized density. The same is true for _lpmf and _lupmf.
The increment log density statement looks syntactically like compound addition
and assignment (see the compound arithmetic/assignment section, it is treated as a
primitive statement because target is not itself a variable. So, even though
target += lp;
Vectorization
The target += ... statement accepts an argument in place of ... for any expres-
sion type, including integers, reals, vectors, row vectors, matrices, and arrays of any
dimensionality, including arrays of vectors and matrices. For container arguments,
their sum will be added to the total log density.
modeled data, is being declared to have the distribution indicated by the right-hand
side of the sampling statement.
Executing such a statement does not perform any sampling. In Stan, a sampling
statement is merely a notational convenience. The above sampling statement could
be expressed as a direct increment on the total log probability as
target += normal_lpdf(y | mu, sigma);
involving subexpressions y and theta1 through thetaN (including the case where N
is zero) will be well formed if and only if the corresponding assignment statement is
well-formed. For densities allowing real y values, the log probability density function
is used,
target += dist_lpdf(y | theta1, ..., thetaN);
For those restricted to integer y values, the log probability mass function is used,
target += dist_lpmf(y | theta1, ..., thetaN);
This will be well formed if and only if dist_lpdf(y | theta1, ..., thetaN) or
dist_lpmf(y | theta1, ..., thetaN) is a well-formed expression of type real.
The sampling statement drops all the terms in the log probability function that
are constant, whereas the explicit call to normal_lpdf adds all of the terms in
the definition of the log normal probability function, including all of the constant
normalizing terms. Therefore, the explicit increment form can be used to recreate
the exact log probability values for the model. Otherwise, the sampling statement
form will be faster if any of the input expressions, y, mu, or sigma, involve only
constants, data variables, and transformed data variables.
80 CHAPTER 7. STATEMENTS
User-transformed variables
The left-hand side of a sampling statement may be a complex expression. For
instance, it is legal syntactically to write
parameters {
real<lower=0> beta;
}
// ...
model {
log(beta) ~ normal(mu, sigma);
}
Truncated distributions
Stan supports truncating distributions with lower bounds, upper bounds, or both.
p(x)
p[a,b] (x) = R b .
a
p(u) du
A probability mass function p(x) for a discrete distribution may be truncated to the
closed interval [a, b] by
4 Because d
log | dy log y| = log |1/y| = − log |y|.
7.4. SAMPLING STATEMENTS 81
p(x)
p[a,b] (x) = Pb .
u=a p(u)
p(x)
p[a,∞] (x) = R ∞ .
a
p(u) du
p(x)
p[a,∞] (x) = P .
a<=u p(u)
p(x)
p[−∞,b] (x) = R b .
−∞
p(u) du
p(x)
p[−∞,b] (x) = P .
u<=b p(u)
Given a probability function pX (x) for a random variable X, its cumulative distribu-
tion function (cdf) FX (x) is defined to be the probability that X ≤ x,
The upper-case variable X is the random variable whereas the lower-case variable x
is just an ordinary bound variable. For continuous random variables, the definition
of the cdf works out to
Z x
FX (x) = pX (u) du,
−∞
For discrete variables, the cdf is defined to include the upper bound given by the
argument,
X
FX (x) = pX (u).
u≤x
C
FX (x) = Pr[X > x] = 1 − FX (x).
Unlike the cdf, the ccdf is exclusive of the bound, hence the event X > x rather than
the cdf’s event X ≤ x.
For continuous distributions, the ccdf works out to
Z x Z ∞
C
FX (x) = 1− pX (u) du = pX (u) du.
−∞ x
The lower boundary can be included in the integration bounds because it is a single
point on a line and hence has no probability mass. For the discrete case, the lower
bound must be excluded in the summation explicitly by summing over u > x,
X X
C
FX (x) = 1 − pX (u) = pX (u).
u≤x u>x
denominator is defined by
Z b
p(u) du = FX (b) − FX (a).
a
pX (x)
p[a,b] (x) = .
FX (b) − FX (a) + pX (a)
Stan allows probability functions to be truncated. For example, a truncated unit nor-
mal distributions restricted to [−0.5, 2.1] can be coded with the following sampling
statement.
y ~ normal(0, 1) T[-0.5, 2.1];
Because a Stan program defines a log density function, all calculations are on the log
scale. The function normal_lcdf is the log of the cumulative normal distribution
function and the function log_diff_exp(a, b) is a more arithmetically stable form
of log(exp(a) - exp(b)).
For a discrete distribution, another term is necessary in the denominator to account
for the excluded boundary. The truncated discrete distribution
y ~ poisson(3.7) T[2, 10];
For truncating with only a lower bound, the upper limit is left blank.
y ~ normal(0, 1) T[-0.5, ];
This truncated sampling statement has the same behavior as the following code.
y ~ normal(0, 1);
if (y < -0.5) {
target += negative_infinity();
} else {
target += -normal_lccdf(-0.5 | 0, 1);
}
As with lower and upper truncation, the discrete case requires a more complicated
denominator to add back in the probability mass for the lower bound. Thus
y ~ poisson(3.7) T[2, ];
To truncate with only an upper bound, the lower bound is left blank. The upper
truncated sampling statement
y ~ normal(0, 1) T[ , 2.1];
With only an upper bound, the discrete case does not need a boundary adjustment.
The upper-truncated sampling statement
y ~ poisson(3.7) T[ , 10];
In all cases, the truncation is only well formed if the appropriate log density or mass
function and necessary log cumulative distribution functions are defined. Not every
distribution built into Stan has log cdf and log ccdfs defined, nor will every user-
defined distribution. The discrete probability function documentations describes the
available discrete and continuous cumulative distribution functions; most univariate
distributions have log cdf and log ccdf functions.
For a truncated sampling statement, if the value sampled is not within the bounds
specified by the truncation expression, the result is zero probability and the entire
statement adds −∞ to the total log probability, which in turn results in the sample
being rejected.
Stan does not (yet) support vectorization of distribution functions with truncation.
for (n in 1:N) {
y[n] ~ normal(mu, sigma);
}
The loop variable is n, the loop bounds are the values in the range 1:N, and the body
is the statement following the loop bounds.
Such reassignment is not permitted in BUGS. In BUGS, for loops are declarative,
defining plates in directed graphical model notation, which can be thought of as
repeated substructures in the graphical model. Therefore, it is illegal in BUGS or
JAGS to have a for loop that repeatedly reassigns a value to a variable.5
In Stan, assignments are executed in the order they are encountered. As a conse-
quence, the following Stan program has a very different interpretation than the
previous one.
5 A programming idiom in BUGS code simulates a local variable by replacing theta in the above
example with theta[n], effectively creating N different variables, theta[1], . . . , theta[N]. Of course,
this is not a hack if the value of theta[n] is required for all n.
88 CHAPTER 7. STATEMENTS
for (n in 1:N) {
y[n] ~ bernoulli(theta);
theta = inv_logit(alpha + x[n] * beta);
}
In this program, theta is assigned after it is used in the probability statement. This
presupposes it was defined before the first loop iteration (otherwise behavior is
undefined), and then each loop uses the assignment from the previous iteration.
Stan loops may be used to accumulate values. Thus it is possible to sum the values
of an array directly using code such as the following.
total = 0.0;
for (n in 1:N) {
total = total + x[n];
}
After the for loop is executed, the variable total will hold the sum of the elements
in the array x. This example was purely pedagogical; it is easier and more efficient
to write
total = sum(x);
A variable inside (or outside) a loop may even be reassigned multiple times, as in
the following legal code.
for (n in 1:100) {
y += y * epsilon;
epsilon = 0.5 * epsilon;
y += y * epsilon;
}
The order in which elements of ys are visited is defined for container types as
7.7. CONDITIONAL STATEMENTS 89
follows.
• vector, row_vector: elements visited in order, y is of type double
• matrix: elements visited in column-major order, y is of type double
• array[] T: elements visited in order, y is of type T.
Consequently, if ys is a two dimensional array array[,] real, y will be a
one-dimensional array of real values (type array[] real). If ’ysis a matrix,
thenywill be a real value (typereal‘). To loop over all values of a two-
dimensional array using foreach statements would require a doubly-nested loop,
array[2, 3] real yss;
for (ys in yss) {
for (y in ys) {
// ... do something with y ...
}
}
In both cases, the loop variable y is of type real. The elements of the matrix
are visited in column-major order (e.g.,y[1, 1],y[2, 1],y[1, 2], ...,y[2,
3]), whereas the elements of the two-dimensional array are visited in
row-major order (e.g.,y[1, 1],y[1, 2],y[1, 3],y[2, 1], ...,y[2, 3]‘).
else
statementN
There must be a single leading if clause, which may be followed by any number of
else if clauses, all of which may be optionally followed by an else clause. Each
condition must be a real or integer value, with non-zero values interpreted as true
and the zero value as false.
The entire sequence of if-then-else clauses forms a single conditional statement
for evaluation. The conditions are evaluated in order until one of the conditions
evaluates to a non-zero value, at which point its corresponding statement is executed
and the conditional statement finishes execution. If none of the conditions evaluate
to a non-zero value and there is a final else clause, its statement is executed.
The condition must be an integer or real expression and the body can be any
statement (or sequence of statements in curly braces).
Evaluation of a while loop starts by evaluating the condition. If the condition
evaluates to a false (zero) value, the execution of the loop terminates and control
moves to the position after the loop. If the loop’s condition evaluates to a true
(non-zero) value, the body statement is executed, then the whole loop is executed
again. Thus the loop is continually executed as long as the condition evaluates to a
true value.
The rest of the body of a while loop may be skipped using a continue. The loop will
be exited with a break statement. See the section on continue and break statements
for more details.
To put multiple statements inside the body of a for loop, a block is used, as in the
following example.
for (n in 1:N) {
lambda[n] ~ gamma(alpha, beta);
y[n] ~ poisson(lambda[n]);
}
The open curly bracket ({) is the first character of the block and the close curly
bracket (}) is the last character.
Because whitespace is ignored in Stan, the following program will not compile.
for (n in 1:N)
y[n] ~ normal(mu, sigma);
z[n] ~ normal(mu, sigma); // ERROR!
The problem is that the body of the for loop is taken to be the statement directly
following it, which is y[n] ~ normal(mu, sigma). This leaves the probability
statement for z[n] hanging, as is clear from the following equivalent program.
for (n in 1:N) {
y[n] ~ normal(mu, sigma);
}
z[n] ~ normal(mu, sigma); // ERROR!
Neither of these programs will compile. If the loop variable n was defined before the
for loop, the for-loop declaration will raise an error. If the loop variable n was not
defined before the for loop, then the use of the expression z[n] will raise an error.
instance, the for loop example of repeated assignment should use a local variable for
maximum clarity and efficiency, as in the following example.
for (n in 1:N) {
real theta;
theta = inv_logit(alpha + x[n] * beta);
y[n] ~ bernoulli(theta);
}
The local variable theta is declared here inside the for loop. The scope of a local
variable is just the block in which it is defined. Thus theta is available for use inside
the for loop, but not outside of it. As in other situations, Stan does not allow variable
hiding. So it is illegal to declare a local variable theta if the variable theta is already
defined in the scope of the for loop. For instance, the following is not legal.
for (m in 1:M) {
real theta;
for (n in 1:N) {
real theta; // ERROR!
theta = inv_logit(alpha + x[m, n] * beta);
y[m, n] ~ bernoulli(theta);
// ...
The compiler will flag the second declaration of theta with a message that it is
already defined.
}
for (n in 1:N) {
sum += x[m, n];
}
}
The variable declaration int n; is the first element of an embedded block and so
has scope within that block. The for loop defines its own local block implicitly over
the statement following it in which the loop variable is defined. As far as Stan is
concerned, these two uses of n are unrelated.
Break statements
When a break statement is executed, the most deeply nested loop currently being
executed is ended and execution picks up with the next statement after the loop. For
example, consider the following program:
while (1) {
if (n < 0) {
break;
}
foo(n);
n = n - 1;
}
The while~(1) loop is a “forever” loop, because 1 is the true value, so the test
always succeeds. Within the loop, if the value of n is less than 0, the loop terminates,
otherwise it executes foo(n) and then decrements n. The statement above does
exactly the same thing as
94 CHAPTER 7. STATEMENTS
while (n >= 0) {
foo(n);
n = n - 1;
}
This case is simply illustrative of the behavior; it is not a case where a break simplifies
the loop.
Continue statements
The continue statement ends the current operation of the loop and returns to the
condition at the top of the loop. Such loops are typically used to exclude some values
from calculations. For example, we could use the following loop to sum the positive
values in the array x,
real sum;
sum = 0;
for (n in 1:size(x)) {
if (x[n] <= 0) {
continue;
}
sum += x[n];
}
When the continue statement is executed, control jumps back to the conditional part
of the loop. With while and for loops, this causes control to return to the conditional
of the loop. With for loops, this advances the loop variable, so the the above program
will not go into an infinite loop when faced with an x[n] less than zero. Thus the
above program could be rewritten with deeper nesting by reversing the conditional,
real sum;
sum = 0;
for (n in 1:size(x)) {
if (x[n] > 0) {
sum += x[n];
}
}
While the latter form may seem more readable in this simple case, the former has
the main line of execution nested one level less deep. Instead, the conditional at
the top finds cases to exclude and doesn’t require the same level of nesting for code
that’s not excluded. When there are several such exclusion conditions, the break or
7.11. PRINT STATEMENTS 95
If the break is triggered by cond3 being true, execution will continue after the nested
loop.
As with break statements, continue statements go back to the top of the most deeply
nested loop in which the continue appears.
Although break and continue must appear within loops, they may appear in nested
statements within loops, such as within the conditionals shown above or within
nested statements. The break and continue statements jump past any control
structure other than while-loops and for-loops.
The print statement will execute every time the body of the loop does. Each time
the loop body is executed, it will print the string “loop iteration:” (with the trailing
space), followed by the value of the expression n, followed by a new line.
96 CHAPTER 7. STATEMENTS
Print content
The text printed by a print statement varies based on its content. A literal (i.e.,
quoted) string in a print statement always prints exactly that string (without the
quotes). Expressions in print statements result in the value of the expression being
printed. But how the value of the expression is formatted will depend on its type.
Printing a simple real or int typed variable always prints the variable’s value.6
For array, vector, and matrix variables, the print format uses brackets. For example,
a 3-vector will print as
[1, 2, 3]
and a 2 × 3-matrix as
[[1, 2, 3], [4, 5, 6]]
will print as (1.2,-3.5), with no space after the comma or within the parentheses.
Printing a more readable version of arrays or matrices can be done with loops. An
example is the print statement in the following transformed data block.
transformed data {
matrix[2, 2] u;
u[1, 1] = 1.0; u[1, 2] = 4.0;
u[2, 1] = 9.0; u[2, 2] = 16.0;
for (n in 1:2) {
print("u[", n, "] = ", u[n]);
}
}
This print statement executes twice, printing the following two lines of output.
u[1] = [1, 4]
u[2] = [9, 16]
6 The adjoint component is always zero during execution for the algorithmic differentiation variables
used to implement parameters, transformed parameters, and local variables in the model.
7.11. PRINT STATEMENTS 97
Non-void input
The input type to a print function cannot be void. In particular, it can’t be the result
of a user-defined void function. All other types are allowed as arguments to the print
function.
Print frequency
Printing for a print statement happens every time it is executed. The transformed
data block is executed once per chain, the transformed parameter and model
blocks once per leapfrog step, and the generated quantities block once per itera-
tion.
String literals
String literals begin and end with a double quote character ("). The characters
between the double quote characters may be any byte sequence, with the exception
of the double quote character.
The Stan interfaces preserve the byte sequences which they receive. The encoding of
these byte sequences as characters and their rendering as glyphs will be handled by
whatever display mechanism is being used to monitor Stan’s output (e.g., a terminal,
a Jupyter notebook, RStudio, etc.). Stan does not enforce a character encoding for
strings, and no attempt is made to validate the bytes as legal ASCII, UTF-8, etc.
Debug by print
Because Stan is an imperative language, print statements can be very useful for
debugging. They can be used to display the values of variables or expressions at
various points in the execution of a program. They are particularly useful for spotting
problematic not-a-number of infinite values, both of which will be printed.
It is particularly useful to print the value of the target log density accumulator
(through the target() function), as in the following example.
vector[2] y;
y[1] = 1;
print("log density before =", target());
y ~ normal(0,1); // bug! y[2] not defined
print("log density after =", target());
The example has a bug in that y[2] is not defined before the vector y is used in the
sampling statement. By printing the value of the log probability accumulator before
98 CHAPTER 7. STATEMENTS
and after each sampling statement, it’s possible to isolate where the log probability
becomes ill-defined (i.e., becomes not-a-number).
Rejections in functions
Rejections in user-defined functions are just passed to the calling function or program
block. Reject statements can be used in functions to validate the function arguments,
allowing user-defined functions to fully emulate built-in function behavior. It is
better to find out earlier rather than later when there is a problem.
In both the transformed data block and generated quantities block, rejections are
fatal. This is because if initialization fails or if generating output fails, there is no
7.12. REJECT STATEMENTS 99
Rejections in the transformed parameters and model blocks are not in and of them-
selves instantly fatal. The result has the same effect as assigning a −∞ log probability,
which causes rejection of the current proposal in MCMC samplers and adjustment of
search parameters in optimization.
If the log probability function results in a rejection every time it is called, the con-
taining application (MCMC sampler or optimization) should diagnose this problem
and terminate with an appropriate error message. To aid in diagnosing problems,
the message for each reject statement will be printed as a result of executing it.
This program is wrong because its truncation bounds on theta depend on pa-
rameters, and thus need to be accounted for using an explicit truncation on the
distribution. This is the right way to do it.
theta ~ normal(0, 1) T[a, b];
100 CHAPTER 7. STATEMENTS
The conceptual issue is that the prior does not integrate to one over the admissible
parameter space; it integrates to one over all real numbers and integrates to some-
thing less than one over [a, b]; in these simple univariate cases, we can overcome that
with the T[ , ] notation, which essentially divides by whatever the prior integrates
to over [a, b].
This problem is exactly the same problem as you would get using reject statements to
enforce complicated inequalities on multivariate functions. In this case, it is wrong
to try to deal with truncation through constraints.
if (theta < a || theta > b) {
reject("theta not in (a, b)");
}
// still **wrong**, needs T[a,b]
theta ~ normal(0, 1);
In this case, the prior integrates to something less than one over the region of the
parameter space where the complicated inequalities are satisfied. But we don’t
generally know what value the prior integrates to, so we can’t increment the log
probability function to compensate.
Even if this adjustment to a proper probability model may seem minor in particular
models where the amount of truncated posterior density is negligible or constant,
we can’t sample from that truncated posterior efficiently. Programs need to use
one-to-one mappings that guarantee the constraints are satisfied and only use reject
statements to raise errors or help with debugging.
8. Program Blocks
A Stan program is organized into a sequence of named blocks, the bodies of which
consist of variable declarations, followed in the case of some blocks with statements.
The function-definition block contains user-defined functions. The data block de-
clares the required data for the model. The transformed data block allows the
definition of constants and transforms of the data. The parameters block declares
the model’s parameters — the unconstrained version of the parameters is what’s
sampled or optimized. The transformed parameters block allows variables to be
101
102 CHAPTER 8. PROGRAM BLOCKS
defined in terms of data and parameters that may be used later and will be saved.
The model block is where the log probability function is defined. The generated
quantities block allows derived quantities based on parameters, data, and optionally
(pseudo) random number generation.
Variable scope
The variables declared in each block have scope over all subsequent statements.
Thus a variable declared in the transformed data block may be used in the model
block. But a variable declared in the generated quantities block may not be used
in any earlier block, including the model block. The exception to this rule is that
variables declared in the model block are always local to the model block and may
not be accessed in the generated quantities block; to make a variable accessible in
the model and generated quantities block, it must be declared as a transformed
parameter.
Variables declared as function parameters have scope only within that function
definition’s body, and may not be assigned to (they are constant).
Function scope
Functions defined in the function block may be used in any appropriate block. Most
functions can be used in any block and applied to a mixture of parameters and data
(including constants or program literals).
Random-number-generating functions are restricted to transformed data and gen-
erated quantities blocks, and within user-defined functions ending in _rng; such
functions are suffixed with _rng. Log-probability modifying functions to blocks
where the log probability accumulator is in scope (transformed parameters and
model); such functions are suffixed with _lp.
Density functions defined in the program may be used in sampling statements.
8.1. OVERVIEW OF STAN’S PROGRAM BLOCKS 103
Transformed variables
The transformed data and transformed parameters block behave similarly to
each other. Both allow new variables to be declared and then defined through a
sequence of statements. Because variables scope over every statement that follows
them, transformed data variables may be defined in terms of the data variables.
Before generating any draws, data variables are read in, then the transformed data
variables are declared and the associated statements executed to define them. This
means the statements in the transformed data block are only ever evaluated once.1
Transformed parameters work the same way, being defined in terms of the param-
eters, transformed data, and data variables. The difference is the frequency of
evaluation. Parameters are read in and (inverse) transformed to constrained repre-
sentations on their natural scales once per log probability and gradient evaluation.
This means the inverse transforms and their log absolute Jacobian determinants are
evaluated once per leapfrog step. Transformed parameters are then declared and
their defining statements executed once per leapfrog step.
Generated quantities
The generated quantity variables are defined once per sample after all the leapfrog
steps have been completed. These may be random quantities, so the block must
be rerun even if the Metropolis adjustment of HMC or NUTS rejects the update
proposal.
1 If the C++ code is configured for concurrent threads, the data and transformed data blocks can be
Variable Declaration Table. This table indicates where variables that are not basic
data or parameters should be declared, based on whether it is defined in terms of
parameters, whether it is used in the log probability function defined in the model block,
and whether it is printed. The two lines marked with asterisks (∗) should not be used
as there is no need to print a variable every iteration that does not depend on the value
of any parameters.
Another way to look at the variables is in terms of their function. To decide which
variable to use, consult the charts in the variable declaration table. The last line has
no corresponding location, as there is no need to print a variable every iteration that
does not depend on parameters.2
The rest of this chapter provides full details on when and how the variables and
statements in each block are executed.
Page 366 of (Gelman and Hill 2007) provides a taxonomy of the kinds of variables
used in Bayesian models. The table of kinds of variables contains Gelman and Hill’s
taxonomy along with a missing-data kind along with the corresponding locations of
declarations and definitions in Stan.
Constants can be built into a model as literals, data variables, or as transformed data
variables. If specified as variables, their definition must be included in data files. If
they are specified as transformed data variables, they cannot be used to specify the
sizes of elements in the data block.
2 It is possible to print a variable every iteration that does not depend on parameters—just define it (or
The following program illustrates various variables kinds, listing the kind of each
variable next to its declaration.
data {
int<lower=0> N; // unmodeled data
array[N] real y; // modeled data
real mu_mu; // config. unmodeled param
real<lower=0> sigma_mu; // config. unmodeled param
}
transformed data {
real<lower=0> alpha; // const. unmodeled param
real<lower=0> beta; // const. unmodeled param
alpha = 0.1;
beta = 0.1;
}
parameters {
real mu_y; // modeled param
real<lower=0> tau_y; // modeled param
}
transformed parameters {
real<lower=0> sigma_y; // derived quantity (param)
sigma_y = pow(tau_y, -0.5);
}
model {
tau_y ~ gamma(alpha, beta);
mu_y ~ normal(mu_mu, sigma_mu);
for (n in 1:N) {
y[n] ~ normal(mu_y, sigma_y);
}
}
generated quantities {
real variance_y; // derived quantity (transform)
variance_y = sigma_y * sigma_y;
}
In this example, y[N] is a modeled data vector. Although it is specified in the data
block, and thus must have a known value before the program may be run, it is
modeled as if it were generated randomly as described by the model.
The variable N is a typical example of unmodeled data. It is used to indicate a size
that is not part of the model itself.
8.3. PROGRAM BLOCK: DATA 107
The other variables declared in the data and transformed data block are examples of
unmodeled parameters, also known as hyperparameters. Unmodeled parameters are
parameters to probability densities that are not themselves modeled probabilistically.
In Stan, unmodeled parameters that appear in the data block may be specified on a
per-model execution basis as part of the data read. In the above model, mu_mu and
sigma_mu are configurable unmodeled parameters.
Unmodeled parameters that are hard coded in the model must be declared in the
transformed data block. For example, the unmodeled parameters alpha and beta
are both hard coded to the value 0.1. To allow such variables to be configurable
based on data supplied to the program at run time, they must be declared in the
data block, like the variables mu_mu and sigma_mu.
This program declares two modeled parameters, mu and tau_y. These are the
location and precision used in the normal model of the values in y. The heart of
the model will be sampling the values of these parameters from their posterior
distribution.
The modeled parameter tau_y is transformed from a precision to a scale param-
eter and assigned to the variable sigma_y in the transformed parameters block.
Thus the variable sigma_y is considered a derived quantity — its value is entirely
determined by the values of other variables.
The generated quantities block defines a value variance_y, which is defined
as a transform of the scale or deviation parameter sigma_y. It is defined in the
generated quantities block because it is not used in the model. Making it a generated
quantity allows it to be monitored for convergence (being a non-linear transform,
it will have different autocorrelation and hence convergence properties than the
deviation itself).
In later versions of Stan which have random number generators for the distributions,
the generated quantities block will be usable to generate replicated data for
model checking.
Finally, the variable n is used as a loop index in the model block.
Statements
The data block does not allow statements.
Statements
The statements in a transformed data block are used to define (provide values for)
variables declared in the transformed data block. Assignments are only allowed to
variables declared in the transformed data block.
3 With multiple threads, or even running chains sequentially in a single thread, data could be read
only once per set of chains. Stan was designed to be thread safe and future versions will provide a
multithreading option for Markov chains.
8.5. PROGRAM BLOCK: PARAMETERS 109
These statements are executed once, in order, right after the data is read into the
data variables. This means they are executed once per chain.
Variables declared in the data block may be used in statements in the transformed
data block.
The statements in the transformed data block are designed to be executed once
and have a deterministic result. Therefore, log probability is not accumulated and
sampling statements may not be used.
the parameters defined in the parameters block must be transformed so they are
unconstrained.
In practice, the samplers keep an unconstrained parameter vector in memory repre-
senting the current state of the sampler. The model defined by the compiled Stan
program defines an (unnormalized) log probability function over the unconstrained
parameters. In order to do this, the log probability function must apply the inverse
transform to the unconstrained parameters to calculate the constrained parameters
defined in Stan’s parameters program block. The log Jacobian of the inverse trans-
form is then added to the accumulated log probability function. This then allows the
Stan model to be defined in terms of the constrained parameters.
In some cases, the number of parameters is reduced in the unconstrained space.
K − 1 unconstrained parameters, and a
For instance, a K-simplex only requires
K-correlation matrix only requires K2 unconstrained parameters. This means that
the probability function defined by the compiled Stan program may have fewer
parameters than it would appear from looking at the declarations in the parameters
program block.
The probability function on the unconstrained parameters is defined in such a way
that the order of the parameters in the vector corresponds to the order of the
variables defined in the parameters program block. The details of the specific
transformations are provided in the variable transforms chapter.
Gradient calculation
Hamiltonian Monte Carlo requires the gradient of the (unnormalized) log probability
function with respect to the unconstrained parameters to be evaluated during every
leapfrog step. There may be one leapfrog step per sample or hundreds, with more
being required for models with complex posterior distribution geometries.
Gradients are calculated behind the scenes using Stan’s algorithmic differentiation
library. The time to compute the gradient does not depend directly on the number
of parameters, only on the number of subexpressions in the calculation of the log
probability. This includes the expressions added from the transforms’ Jacobians.
The amount of work done by the sampler does depend on the number of uncon-
strained parameters, but this is usually dwarfed by the gradient calculations.
Writing draws
In the basic Stan compiled program, there is a file to which the values of variables
are written for each draw. The constrained versions of the variables are written in the
8.6. PROGRAM BLOCK: TRANSFORMED PARAMETERS 111
order they are defined in the parameters block. In order to do this, the transformed
parameter, model, and generated quantities statements must also be executed.
{
int b; // not added to the output
8.8. PROGRAM BLOCK: GENERATED QUANTITIES 113
}
}
9. User-Defined Functions
Stan allows users to define their own functions. The basic syntax is a simplified
version of that used in C and C++. This chapter specifies how functions are declared,
defined, and used in Stan.
Function definitions and declarations may appear in any order, subject to the con-
dition that a function must be declared before it is used. Forward declarations are
allowed in order to support recursive functions.
declares a function named foo with two argument variables of types real and real.
The arguments are named mu and sigma, but that is not part of the declaration.
Two user-defined functions may not have the same name even if they have different
sequences of argument types.
114
9.4. ARGUMENT TYPES AND QUALIFIERS 115
Functions as expressions
Functions with non-void return types are called just like any other built-in function in
Stan—they are applied to appropriately typed arguments to produce an expression,
which has a value when executed.
Functions as statements
Functions with void return types may be applied to arguments and used as state-
ments. These act like sampling statements or print statements. Such uses are only
appropriate for functions that act through side effects, such as incrementing the log
probability accumulator, printing, or raising exceptions.
Restrictions on placement
Functions of certain types are restricted on scope of usage. Functions whose names
end in _lp assume access to the log probability accumulator and are only available
in the transformed parameter and model blocks.
Functions whose names end in _rng assume access to the random number generator
and may only be used within the generated quantities block, transformed data block,
and within user-defined functions ending in _rng.
Functions whose names end in _lpdf and _lpmf can be used anywhere. However,
_lupdf and _lupmf functions can only be used in the model block or user-defined
probability functions.
See the section on function bodies for more information on these special types of
function.
The type void may not be used as an argument type, only a return type for a function
with side effects.
Dimensionality declaration
Arguments and return types may be arrays, and these are indicated with optional
brackets and commas as would be used for indexing. For example, int denotes a
single integer argument or return, whereas real[ ] indicates a one-dimensional
array of reals, array[,] real a two-dimensional array and array[„] real a three-
dimensional array; whitespace is optional, as usual.
The dimensions for vectors and matrices are not included, so that matrix is the type
of a single matrix argument or return type. Thus if a variable is declared as matrix
a, then a has two indexing dimensions, so that a[1] is a row vector and a[1, 1] a
real value. Matrices implicitly have two indexing dimensions. The type declaration
matrix[ , ] b specifies that b is a two-dimensional array of matrices, for a total of
four indexing dimensions, with b[1, 1, 1, 1] picking out a real value.
Data-only qualifiers
Some of Stan’s built-in functions, like the differential equation solvers, have argu-
ments that must be data. Such data-only arguments must be expressions involving
only data, transformed data, and generated quantity variables.
In user-defined functions, the qualifier data may be placed before an argument type
declaration to indicate that the argument must be data only. For example,
real foo(data real x) {
return xˆ2;
}
Note that for function definitions, the comma is used rather than the vertical bar.
For every custom _lpdf and _lpmf defined there is a corresponding _lupdf and
_lupmf defined automatically. The _lupdf and _lupmf versions of the functions
cannot be defined directly (to do so will produce an error). The difference in the
_lpdf and _lpmf and the corresponding _lupdf and _lupmf functions is that if any
other unnormalized density functions are used inside the user-defined function, the
_lpdf and _lpmf forms of the user-defined function will change these densities to be
normalized. The _lupdf and _lupmf forms of the user-defined functions will instead
allow other unnormalized density functions to drop additive constants.
The sampling shorthand
z ~ foo(phi);
will have the same effect as incrementing the target with the log of the unnormalized
density:
target += foo_lupdf(z | phi);
Other _lupdf and _lupmf functions used in the definition of foo_lpdf will drop
9.6. PARAMETERS ARE CONSTANT 119
additive constants when foo_lupdf is called and will not drop additive constants
when foo_lpdf is called.
If there are _lupdf and _lupmf functions used inside the following call to foo_lpdf,
they will be forced to normalize (return the equivalent of their _lpdf and _lpmf
forms):
target += foo_lpdf(z | phi);
If there are no _lupdf or _lupmf functions used in the definition of foo_lpdf, then
there will be no difference between a foo_lpdf or foo_lupdf call.
The unnormalized _lupdf and _lupmf functions can only be used in the model block
or in user-defined probability functions (those ending in _lpdf or _lpmf).
The same syntax and shorthand that works for _lpdf also works for log probability
mass functions with suffixes _lpmf.
A function that is going to be accessed as distributions must return the log of the
density or mass function it defines.
because the return statement is not the last statement in the while loop. A bogus
dummy return could be placed after the while loop in this case. The rules for returns
allow
9.8. VOID FUNCTIONS AS STATEMENTS 121
real log_fancy(real x) {
if (x < 1e-30) {
return x;
} else if (x < 1e-14) {
return x * x;
} else {
return log(x);
}
}
because there’s a default else clause and each condition body has return as its final
statement.
Usage as statement
A void function may be used as a statement after the function is declared; see the
section on forward declarations for rules on declaration.
Because there is no return, such a usage is only for side effects, such as incrementing
the log probability function, printing, or raising an error.
9.9. Declarations
In general, functions must be declared before they are used. Stan supports forward
declarations, which look like function definitions without bodies. For example,
real unit_normal_lpdf(real y);
122 CHAPTER 9. USER-DEFINED FUNCTIONS
A user-defined Stan function may be declared and then later defined, or just defined
without being declared. No other combination of declaration and definition is legal,
so that, for instance, a function may not be declared more than once, nor may it be
defined more than once. If there is a declaration, there must be a definition. These
rules together ensure that all the declared functions are eventually defined.
Recursive functions
Forward declarations allow the definition of self-recursive or mutually recursive
functions. For instance, consider the following code to compute Fibonacci numbers.
int fib(int n);
int fib(int n) {
if (n < 2) {
return n;
} else {
return fib(n-1) + fib(n-2);
}
}
Without the forward declaration in the first line, the body of the definition would
not compile.
10. Constraint Transforms
To avoid having to deal with constraints while simulating the Hamiltonian dynamics
during sampling, every (multivariate) parameter in a Stan model is transformed to
an unconstrained variable behind the scenes by the model compiler. The transform
is based on the constraints, if any, in the parameter’s definition. Scalars or the scalar
values in vectors, row vectors or matrices may be constrained with lower and/or
upper bounds. Vectors may alternatively be constrained to be ordered, positive
ordered, or simplexes. Matrices may be constrained to be correlation matrices or
covariance matrices. This chapter provides a definition of the transforms used for
each type of variable.
Stan converts models to C++ classes which define probability functions with support
on all of RK , where K is the number of unconstrained parameters needed to define
the constrained parameters defined in the program. The C++ classes also include
code to transform the parameters from unconstrained to constrained and apply the
appropriate Jacobians.
d −1
pY (y) = pX (f −1 (y)) f (y) .
dy
123
124 CHAPTER 10. CONSTRAINT TRANSFORMS
The absolute derivative of the inverse transform measures how the scale of the
transformed variable changes with respect to the underlying variable.
where det is the matrix determinant operation and Jf −1 (y) is the Jacobian matrix
of f −1 evaluated at y. Taking x = f −1 (y), the Jacobian matrix is defined by
∂x ∂x1
1
···
∂y1 ∂yK
. .. ..
..
Jf −1 (y) = .
. .
∂x ∂xK
K
···
∂y1 ∂yK
If the Jacobian matrix is triangular, the determinant reduces to the product of the
diagonal entries,
K
Y ∂xk
det Jf −1 (y) = .
∂yk
k=1
Triangular matrices naturally arise in situations where the variables are ordered, for
instance by dimension, and each variable’s transformed value depends on the previ-
ous variable’s transformed values. Diagonal matrices, a simple form of triangular
matrix, arise if each transformed variable only depends on a single untransformed
variable.
10.2. LOWER BOUNDED SCALAR 125
Y = log(X − a).
X = exp(Y ) + a.
d
(exp(y) + a) = exp(y).
dy
Y = log(b − X).
126 CHAPTER 10. CONSTRAINT TRANSFORMS
X = b − exp(Y ).
d
(b − exp(y)) = exp(y).
dy
u
logit(u) = log .
1−u
The inverse of the log odds function is the logistic sigmoid, defined for v ∈ (−∞, ∞)
by
1
logit−1 (v) = .
1 + exp(−v)
d
logit−1 (y) = logit−1 (y) · 1 − logit−1 (y) .
dy
10.5. AFFINELY TRANSFORMED SCALAR 127
X = a + (b − a) · logit−1 (Y ).
d
a + (b − a) · logit−1 (y) = (b − a) · logit−1 (y) · 1 − logit−1 (y) .
dy
Despite the apparent complexity of this expression, most of the terms are repeated
and thus only need to be evaluated once. Most importantly, logit−1 (y) only needs to
be evaluated once, so there is only one call to exp(−y).
Affine transform
For variables with expected offset µ and/or (positive) multiplier σ, Stan uses an
affine transform. Such a variable X is transformed to a new variable Y , where
X −µ
Y = .
σ
The default value for the offset µ is 0 and for the multiplier σ is 1 in case not both
are specified.
X = µ + σ · Y.
d
(µ + σ · y) = σ.
dy
pY (y) = pX (µ + σ · y) · σ.
xk < xk+1
Ordered transform
Stan’s transform follows the constraint directly. It maps an increasing vector x ∈ RK
to an unconstrained vector y ∈ RK by setting
x1 if k = 1, and
yk =
log (xk − xk−1 ) if 1 < k ≤ K.
y1 if k = 1, and
xk =
xk−1 + exp(yk ) if 1 < k ≤ K.
k
X
xk = y1 + exp(yk0 ).
k0 =2
1 if k = 1, and
Jk,k =
exp(yk ) if 1 < k ≤ K.
K
Y K
Y
| det J | = Jk,k = exp(yk ).
k=1 k=2
Putting this all together, if pX is the density of X, then the transformed variable Y
has density pY given by
K
Y
pY (y) = pX (f −1 (y)) exp(yk ).
k=2
130 CHAPTER 10. CONSTRAINT TRANSFORMS
xk > 0,
and
K
X
xk = 1.
k=1
An alternative definition is to take the convex closure of the vertices. For instance,
in 2-dimensions, the simplex vertices are the extreme values (0, 1), and (1, 0) and
the unit 2-simplex is the line connecting these two points; values such as (0.3, 0.7)
and (0.99, 0.01) lie on the line. In 3-dimensions, the basis is (0, 0, 1), (0, 1, 0) and
(1, 0, 0) and the unit 3-simplex is the boundary and interior of the triangle with
these vertices. Points in the 3-simplex include (0.5, 0.5, 0), (0.2, 0.7, 0.1) and all other
triplets of non-negative values summing to 1.
As these examples illustrate, the simplex always picks out a subspace of K − 1
dimensions from RK . Therefore a point x in the K-simplex is fully determined by its
first K − 1 elements x1 , x2 , . . . , xK−1 , with
K−1
X
xK = 1 − xk .
k=1
2010).
10.7. UNIT SIMPLEX 131
3. Next, break a piece off what’s left, label it x2 , and set it aside, keeping what’s
left.
4. Continue breaking off pieces of what’s left, labeling them, and setting them
aside for pieces x3 , . . . , xK−1 .
5. Label what’s left xK .
The resulting vector x = [x1 , . . . , xK ]> is a unit simplex because each piece has
non-negative length and the sum of the stick lengths is one by construction.
This full inverse mapping requires the breaks to be represented as the fraction in
(0, 1) of the original stick that is broken off. These break ratios are themselves
derived from unconstrained values in (−∞, ∞) using the inverse logit transform as
described above for unidimensional variables with lower and upper bounds.
More formally, an intermediate vector z ∈ RK−1 , whose coordinates zk represent the
proportion of the stick broken off in step k, is defined elementwise for 1 ≤ k < K by
−1 1
zk = logit yk + log .
K −k
1 1
The logit term log K−k (i.e., logit K−k+1 ) in the above definition adjusts the
transform so that a zero vector y is mapped to the simplex x = (1/K, . . . , 1/K). For
instance, if y1 = 0, then z1 = 1/K; if y2 = 0, then z2 = 1/(K − 1); and if yK−1 = 0,
then zK−1 = 1/2.
The break proportions z are applied to determine the stick sizes and resulting value
of xk for 1 ≤ k < K by
k−1
!
X
xk = 1− xk 0 zk .
k0 =1
The summation term represents the length of the original stick left at stage k. This
is multiplied by the break proportion zk to yield xk . Only K − 1 unconstrained
parameters are required, with the last dimension’s value xK set to the length of the
remaining piece of the original stick,
K−1
X
xK = 1 − xk .
k=1
132 CHAPTER 10. CONSTRAINT TRANSFORMS
where
∂zk ∂ −1 1
= logit yk + log = zk (1 − zk ),
∂yk ∂yk K −k
and
k−1
!
∂xk X
= 1− xk 0 .
∂zk
k0 =1
K−1 k−1
!
Y X
−1
pY (y) = pX (f (y)) zk (1 − zk ) 1− xk 0 .
k=1 k0 =1
The break proportions zk are defined to be the ratio of xk to the length of stick left
after the first k − 1 pieces have been broken off,
xk
zk = Pk−1 .
1− k0 =1 xk 0
√ q
kxk = x> x = x21 + x22 + · · · + x2n = 1 .
y
x= .
kyk
The above mapping from Rn to S n is not defined at zero. While this point outcome
has measure zero during sampling, and may thus be ignored, it is the default
initialization point and thus unit vector parameters cannot be initialized at zero. A
simple workaround is to initialize from a very small interval around zero, which is
an option built into all of the Stan interfaces.
xk,k0 = xk0 ,k
xk,k = 1
for all k ∈ {1, . . . , K}, and it must be positive definite, so that for every non-zero
K-vector a,
a> xa > 0.
exp(2y) − 1
tanh y = .
exp(2y) + 1
Then, define a K × K matrix z, the upper triangular values of which are filled by
row with the transformed values. For example, in the 4 × 4 case, there are 42 values
arranged as
0 tanh y1 tanh y2 tanh y4
0 0 tanh y3 tanh y5
z= .
0 0 0 tanh y6
0 0 0 0
Lewandowski, Kurowicka and Joe (LKJ) show how to bijectively map the array z to
a correlation matrix x. The entry zi,j for i < j is interpreted as the canonical partial
correlation (CPC) between i and j, which is the correlation between i’s residuals and
j’s residuals when both i and j are regressed on all variables i0 such that i0 < i. In
the case of i = 1, there are no earlier variables, so z1,j is just the Pearson correlation
between i and j.
In Stan, the LKJ transform is reformulated in terms of a Cholesky factor w of the
final correlation matrix, defined for 1 ≤ i, j ≤ K by
0 if i > j,
1 if 1 = i = j,
Qi−1 2
1/2
wi,j = i0 =1 1 − zi0, j if 1 < i = j,
zi,j if 1 = i < j, and
Qi−1 1/2
zi,j i0 =1 1 − zi20, j if 1 < i < j.
This does not require as much computation per matrix entry as it may appear;
calculating the rows in terms of earlier rows yields the more manageable expression
136 CHAPTER 10. CONSTRAINT TRANSFORMS
if i > j,
0
if 1 = i = j,
1
wi,j = zi,j if 1 = i < j, and
2
1/2
zi,j wi−1,j 1 − zi−1,j if 1 < i ≤ j.
x = w> w.
Lewandowski, Kurowicka, and Joe (2009) show that the determinant of the correla-
tion matrix can be defined in terms of the canonical partial correlations as
K−1
Y K
Y Y
2 2
det x = (1 − zi,j )= (1 − zi,j ),
i=1 j=i+1 1≤i<j≤K
v
uK−1 K K−1 K
uY Y
2
K−i−1 Y Y ∂zi,j
t 1 − zi,j ×
i=1 j=i+1 i=1 j=i+1
∂yi,j
w = chol(x).
The next step from the Cholesky factor w back to the array z of canonical partial
correlations (CPCs) is simplified by the ordering of the elements in the definition of
w, which when inverted yields
10.10. COVARIANCE MATRICES 137
0 if i ≤ j,
zi,j = wi,j if 1 = i < j, and
Qi−1 2
−1/2
i0 =1 1 − zi0,j if 1 < i < j.
w
i,j
The final stage of the transform reverses the hyperbolic tangent transform, which is
defined by
1 1+z
y = tanh−1 z = log .
2 1−z
The inverse hyperbolic tangent function, tanh−1 , is also called the Fisher transfor-
mation.
x = z z>.
2 An alternative to the transform in this section, which can be coded directly in Stan, is to parameterize
The off-diagonal entries of the Cholesky factor z are unconstrained, but the diagonal
entries zk,k must be positive for 1 ≤ k ≤ K.
To complete the transform, the diagonal is log-transformed to produce a fully
unconstrained lower-triangular matrix y defined by
0 if m < n,
ym,n = log zm,m if m = n, and
zm,n if m > n.
x = z z>.
K K K
Y ∂ Y Y
exp(yk,k ) = exp(yk,k ) = zk,k .
∂yk,k
k=1 k=1 k=1
The Jacobian matrix of the second transform from the Cholesky factor z to the
covariance matrix x is also triangular, with diagonal entries corresponding to pairs
(m, n) with m ≥ n, defined by
10.11. CHOLESKY FACTORS OF COVARIANCE MATRICES 139
K
!
∂ ∂ X 2 zn,n if m = n and
z z > m,n =
zm,k zn,k =
∂zm,n ∂zm,n zn,n if m > n.
k=1
K
Y m
Y K
Y K
Y K
Y
K−k+1
2K zn,n = zn,n = 2K zk,k .
m=1 n=1 n=1 m=n k=1
Finally, the full absolute Jacobian determinant of the inverse of the covariance
matrix transform from the unconstrained lower-triangular y to a symmetric, positive
definite matrix x is the product of the Jacobian determinants of the exponentiation
and product transforms,
K
! K
! K
Y Y Y
K K−k+1 K−k+2
zk,k 2 zk,k = 2K zk,k .
k=1 k=1 k=1
K
Y
pY (y) = pX (f −1 (y)) 2K K−k+2
zk,k .
k=1
parameters.
140 CHAPTER 10. CONSTRAINT TRANSFORMS
N N N
Y ∂ Y Y
exp(yn,n ) = exp(yn,n ) = xn,n .
n=1
∂yn,n n=1 n=1
N
Y
pY (y) = pX (f −1 (y)) xn,n .
N =1
10.12. CHOLESKY FACTORS OF CORRELATION MATRICES 141
Ωk,k = Lk L>
k = 1,
each row vector Lk of the Cholesky factor is of unit length. The length and positivity
constraint allow the diagonal elements of L to be calculated from the off-diagonal
elements, so that a Cholesky factor for a K × K correlation matrix requires only K 2
unconstrained parameters.
0 0 0
z = tanh y1 0 0
tanh y2 tanh y3 0
The matrix z, with entries in the range (−1, 1), is then transformed to the Cholesky
factor x, by taking3
3
P
For convenience, a summation with no terms, such as j 0 <1
xi,j 0 , is defined to be 0. This implies
x1,1 = 1 and that xi,1 = zi,1 for i > 1.
142 CHAPTER 10. CONSTRAINT TRANSFORMS
0 if i < j [above diagonal]
q P 2
xi,j = 1− j 0 <j xi,j 0 if i = j [on diagonal]
q
zi,j 1 − P 0 x2 0
if i > j [below diagonal]
j <j i,j
1 q 0 0
z2,1 1 − x22,1 0
x=
q q
,
z3,1 z3,2 1 − x23,1 1 − (x23,1 + x23,2 )
xi,j
zi,j = q .
x2i,j 0
P
1− j 0 <j
1
y = tanh−1 z = (log(1 + z) − log(1 − z)) .
2
d 1
tanh y = .
dy (cosh y)2
Y d
| det J | = xi,j ,
i>j
dzi,j
where
s
d X
xi,j = 1− x2i,j 0 .
dzi,j
j 0 <j
1/2
Y 1 Y X
pY (y) = pX (f −1 (y)) 1 − x2i,j 0 ,
(cosh y)2
n<( K
) i>j j 0 <j
2
where x = f −1 (y) is used for notational convenience. The log Jacobian determinant
of the complete inverse transform x = f −1 (y) is given by
X 1 X X
log | det J | = −2 log cosh y + log 1 − x2i,j 0 .
2 i>j 0
n≤(K j <j
2)
11. Language Syntax
This chapter defines the basic syntax of the Stan modeling language using a Backus-
Naur form (BNF) grammar plus extra-grammatical constraints on function typing
and operator precedence and associativity.
Programs
<program> ::= [<function_block>] [<data_block>] [<transformed_data_block>]
[<parameters_block>] [<transformed_parameters_block>]
[<model_block>] [<generated_quantities_block>] EOF
144
11.1. BNF GRAMMARS 145
Expressions
<expression> ::= <lhs>
| <non_lhs>
| IDIVIDE
| MODULO
| LDIVIDE
| ELTTIMES
| ELTDIVIDE
| HAT
| ELTPOW
<logicalBinOp> ::= OR
| AND
| EQUALS
| NEQUALS
| LABRACK
| LEQ
| RABRACK
| GEQ
Statements
<statement> ::= <atomic_statement>
| <nested_statement>
| PLUSASSIGN
| MINUSASSIGN
| TIMESASSIGN
| DIVIDEASSIGN
| ELTTIMESASSIGN
| ELTDIVIDEASSIGN
STRINGLITERAL = ".*"
Forms of numbers
Integer literals longer than one digit may not start with 0 and real literals cannot
consist of only a period or only an exponent.
Conditional arguments
Both the conditional if-then-else statement and while-loop statement require the
expression denoting the condition to be a primitive type, integer or real.
11.3. EXTRA-GRAMMATICAL CONSTRAINTS 151
Print arguments
The arguments to a print statement cannot be void.
references. The final free arguments must be assignable to types real, real, and
int, respectively.
Indexes
Standalone expressions used as indexes must denote either an integer (int) or an
integer array (array[] int). Expressions participating in range indexes (e.g., a and
b in a : b) must denote integers (int).
11.3. EXTRA-GRAMMATICAL CONSTRAINTS 153
A second condition is that there not be more indexes provided than dimensions of
the underlying expression (in general) or variable (on the left side of assignments)
being indexed. A vector or row vector adds 1 to the array dimension and a matrix
adds 2. That is, the type matrix[ , , ], a three-dimensional array of matrices, has
five index positions: three for the array, one for the row of the matrix and one for
the column.
12. Program Execution
This chapter provides a sketch of how a compiled Stan model is executed using
sampling. Optimization shares the same data reading and initialization steps, but
then does optimization rather than sampling.
This sketch is elaborated in the following chapters of this part, which cover variable
declarations, expressions, statements, and blocks in more detail.
Read data
The first step of execution is to read data into memory. Data may be read in through
file (in CmdStan) or through memory (RStan and PyStan); see their respective
manuals for details.1
All of the variables declared in the data block will be read. If a variable cannot be
read, the program will halt with a message indicating which data variable is missing.
After each variable is read, if it has a declared constraint, the constraint is validated.
For example, if a variable N is declared as int<lower=0>, after N is read, it will
be tested to make sure it is greater than or equal to zero. If a variable violates its
declared constraint, the program will halt with a warning message indicating which
variable contains an illegal value, the value that was read, and the constraint that
was declared.
from R, for instance, can be configured to read data from file or directly from R’s memory.
154
12.2. INITIALIZATION 155
After the statements are executed, all declared constraints on transformed data
variables are validated. If the validation fails, execution halts and the variable’s
name, value and constraints are displayed.
12.2. Initialization
Initialization is the same for sampling, optimization, and diagnosis
Because of the way Stan defines its transforms from the constrained to the un-
constrained space, initializing parameters on the boundaries of their constraints is
usually problematic. For instance, with a constraint
parameters {
real<lower=0, upper=1> theta;
// ...
}
12.3. Sampling
Sampling is based on simulating the Hamiltonian of a particle with a starting position
equal to the current parameter values and an initial momentum (kinetic energy)
generated randomly. The potential energy at work on the particle is taken to be the
negative log (unnormalized) total probability function defined by the model. In the
usual approach to implementing HMC, the Hamiltonian dynamics of the particle is
12.3. SAMPLING 157
simulated using the leapfrog integrator, which discretizes the smooth path of the
particle into a number of small time steps called leapfrog steps.
Leapfrog steps
For each leapfrog step, the negative log probability function and its gradient need
to be evaluated at the position corresponding to the current parameter values (a
more detailed sketch is provided in the next section). These are used to update the
momentum based on the gradient and the position based on the momentum.
For simple models, only a few leapfrog steps with large step sizes are needed. For
models with complex posterior geometries, many small leapfrog steps may be needed
to accurately model the path of the parameters.
If the user specifies the number of leapfrog steps (i.e., chooses to use standard HMC),
that number of leapfrog steps are simulated. If the user has not specified the number
of leapfrog steps, the No-U-Turn sampler (NUTS) will determine the number of
leapfrog steps adaptively (Hoffman and Gelman 2014).
transformed parameter operations and all of the Jacobian adjustments. This tree
is then used to evaluate the gradients by propagating partial derivatives backward
along the expression graph. The gradient calculations account for the majority of
the cycles consumed by a Stan program.
Metropolis accept/reject
A standard Metropolis accept/reject step is required to retain detailed balance and
ensure draws are marginally distributed according to the probability function defined
by the model. This Metropolis adjustment is based on comparing log probabilities,
here defined by the Hamiltonian, which is the sum of the potential (negative log
probability) and kinetic (squared momentum) energies. In theory, the Hamiltonian
is invariant over the path of the particle and rejection should never occur. In
practice, the probability of rejection is determined by the accuracy of the leapfrog
approximation to the true trajectory of the parameters.
If step sizes are small, very few updates will be rejected, but many steps will be
required to move the same distance. If step sizes are large, more updates will be
rejected, but fewer steps will be required to move the same distance. Thus a balance
between effort and rejection rate is required. If the user has not specified a step size,
Stan will tune the step size during warmup sampling to achieve a desired rejection
rate (thus balancing rejection versus number of steps).
If the proposal is accepted, the parameters are updated to their new values. Other-
wise, the sample is the current set of parameter values.
12.4. Optimization
Optimization runs very much like sampling in that it starts by reading the data and
then initializing parameters. Unlike sampling, it produces a deterministic output
which requires no further analysis other than to verify that the optimizer itself
converged to a posterior mode. The output for optimization is also similar to that
for sampling.
the result of Stan’s sampling routines can also be used for variational inference.
12.7. Output
For each final draw (not counting draws during warmup or draws that are thinned),
there is an output stage of writing the draw.
Generated quantities
Before generating any output, the statements in the generated quantities block are
executed. This can be used for any forward simulation based on parameters of the
model. Or it may be used to transform parameters to an appropriate form for output.
After the generated quantities statements execute, the constraints declared on gener-
ated quantities variables are validated. If these constraints are violated, the program
will terminate with a diagnostic message.
Write
The final step is to write the actual values. The values of all variables declared as
parameters, transformed parameters, or generated quantities are written. Local
variables are not written, nor is the data or transformed data. All values are written
in their constrained forms, that is the form that is used in the model definitions.
In the executable form of a Stan models, parameters, transformed parameters,
and generated quantities are written to a file in comma-separated value (CSV)
notation with a header defining the names of the parameters (including indices for
multivariate parameters).2
2 In the R version of Stan, the values may either be written to a CSV file or directly back to R’s memory.
13. Deprecated Features
This appendix lists currently deprecated functionality along with how to replace it.
These deprecated features are likely to be removed in the next major release.
Replacement: The new syntax uses the operator = for assignment, e.g.,
a = b;
160
13.5. CDF_LOG AND CCDF_LOG CUMULATIVE DISTRIBUTION FUNCTIONS 161
Replacement: Replace the _log suffix with _lpdf for density functions or _lpmf for
mass functions in the user-defined function.
162 CHAPTER 13. DEPRECATED FEATURES
with
corr_matrix[K] Omega;
vector<lower=0>[K] sigma;
// ...
Omega ~ lkj_corr(eta);
sigma ~ lognormal(mu, tau);
// ...
cov_matrix[K] Sigma;
Sigma <- quad_form_diag(Omega, sigma);
The variable Sigma may be defined as a local variable in the model block or as a
transformed parameter. An even more efficient transform would use Cholesky factors
rather than full correlation matrix types.
with
x = a ? b : c;
13.10. ABS(REAL X) FUNCTION 163
Replacement: The use of the array keyword, which replaces the above examples
with
array[5] int n;
array[3, 4] real a;
array[5, 4, 2] real<lower=0> z;
array[3] vector[7] mu;
array[15, 12] matrix[7, 2] mu;
array[2, 3, 4] cholesky_factor_cov[5, 6] mu;
14. Removed Features
This chapter lists functionalities that were once present in the language but have
since been removed, along with how to replace them.
164
Algorithms
This part of the manual specifies the inference algorithms and posterior inference
tools.
165
15. MCMC Sampling
This chapter presents the two Markov chain Monte Carlo (MCMC) algorithms used
in Stan, the Hamiltonian Monte Carlo (HMC) algorithm and its adaptive variant
the no-U-turn sampler (NUTS), along with details of their implementation and
configuration.
Target density
The goal of sampling is to draw from a density p(θ) for parameters θ. This is typically
a Bayesian posterior p(θ|y) given data y, and in particular, a Bayesian posterior
coded as a Stan program.
p(ρ, θ) = p(ρ|θ)p(θ).
ρ ∼ MultiNormal(0, M ).
166
15.1. HAMILTONIAN MONTE CARLO 167
The Hamiltonian
The joint density p(ρ, θ) defines a Hamiltonian
is called the “potential energy.” The potential energy is specified by the Stan program
through its definition of a log density.
Generating transitions
Starting from the current value of the parameters θ, a transition to a new state is
generated in two stages before being subjected to a Metropolis accept step.
First, a value for the momentum is drawn independently of the current parameter
values,
ρ ∼ MultiNormal(0, M ).
dθ ∂H ∂T
= + = +
dt ∂ρ ∂ρ
dρ ∂H ∂T ∂V
= − = − − .
dt ∂θ ∂θ ∂θ
With the momentum density being independent of the target density, i.e., p(ρ|θ) =
p(ρ), the first term in the momentum time derivative, ∂T /∂θ is zero, yielding the
pair time derivatives
dθ
dt = + ∂T
∂ρ
dρ
dt = − ∂V
∂θ .
Leapfrog integrator
The last section leaves a two-state differential equation to solve. Stan, like most other
HMC implementations, uses the leapfrog integrator, which is a numerical integration
algorithm that’s specifically adapted to provide stable results for Hamiltonian systems
of equations.
Like most numerical integrators, the leapfrog algorithm takes discrete steps of some
small time interval . The leapfrog algorithm begins by drawing a fresh momentum
term independently of the parameter values θ or previous momentum value.
ρ ∼ MultiNormal(0, M ).
It then alternates half-step updates of the momentum and full-step updates of the
position.
∂V
ρ ← ρ− 2 ∂θ
θ ← θ + M −1 ρ
∂V
ρ ← ρ− 2 ∂θ .
If the proposal is not accepted, the previous parameter value is returned for the next
draw and used to initialize the next iteration.
Algorithm summary
The Hamiltonian Monte Carlo algorithm starts at a specified initial set of parameters
θ; in Stan, this value is either user-specified or generated randomly. Then, for a given
number of iterations, a new momentum vector is sampled and the current value of
the parameter θ is updated using the leapfrog integrator with discretization time
and number of steps L according to the Hamiltonian dynamics. Then a Metropolis
acceptance step is applied, and a decision is made whether to update to the new
state (θ∗ , ρ∗ ) or keep the existing state.
If L is too small, the trajectory traced out in each iteration will be too short and
sampling will devolve to a random walk. If L is too large, the algorithm will do too
much work on each iteration.
If the inverse metric M −1 is a poor estimate of the posterior covariance, the step size
must be kept small to maintain arithmetic precision. This would lead to a large L
to compensate.
Integration time
The actual integration time is L , a function of number of steps. Some interfaces to
Stan set an approximate integration time t and the discretization interval (step size)
. In these cases, the number of steps will be rounded down as
t
L= .
When adaptation is engaged (it may be turned off by fixing a step size and metric),
the warmup period is split into three stages, as illustrated in the warmup adaptation
figure, with two fast intervals surrounding a series of growing slow intervals. Here
fast and slow refer to parameters that adapt using local and global information,
respectively; the Hamiltonian Monte Carlo samplers, for example, define the step
size as a fast parameter and the (co)variance as a slow parameter. The size of the
the initial and final fast intervals and the initial size of the slow interval are all
customizable, although user-specified values may be modified slightly in order to
ensure alignment with the warmup period.
The motivation behind this partitioning of the warmup period is to allow for more
robust adaptation. The stages are as follows.
1. In the initial fast interval the chain is allowed to converge towards the typical
set,1 with only parameters that can learn from local information adapted.
2. After this initial stage parameters that require global information, for example
(co)variances, are estimated in a series of expanding, memoryless windows;
often fast parameters will be adapted here as well.
3. Lastly, the fast parameters are allowed to adapt to the final update of the slow
parameters.
These intervals may be controlled through the following configuration parameters,
all of which must be positive integers:
Adaptation Parameters Table. The parameters controlling adaptation and their
default values.
1 The typical set is a concept borrowed from information theory and refers to the neighborhood (or
neighborhoods in multimodal models) of substantial posterior probability mass through which the Markov
chain will travel in equilibrium.
172 CHAPTER 15. MCMC SAMPLING
By setting the target acceptance parameter δ to a value closer to 1 (its value must
be strictly less than 1 and its default value is 0.8), adaptation will be forced to use
smaller step sizes. This can improve sampling efficiency (effective sample size per
iteration) at the cost of increased iteration times. Raising the value of δ will also
allow some models that would otherwise get stuck to overcome their blockages.
Step-size jitter
All implementations of HMC use numerical integrators requiring a step size (equiva-
lently, discretization time interval). Stan allows the step size to be adapted or set
explicitly. Stan also allows the step size to be “jittered” randomly during sampling to
avoid any poor interactions with a fixed step size and regions of high curvature. The
jitter is a proportion that may be added or subtracted, so the maximum amount of
jitter is 1, which will cause step sizes to be selected in the range of 0 to twice the
adapted step size. The default value is 0, producing no jitter.
2 This optimization of step size during adaptation of the sampler should not be confused with running
Small step sizes can get HMC samplers unstuck that would otherwise get stuck
with higher step sizes. The downside is that jittering below the adapted value
will increase the number of leapfrog steps required and thus slow down iterations,
whereas jittering above the adapted value can cause premature rejection due to
simulation error in the Hamiltonian dynamics calculation. See Neal (2011) for
further discussion of step-size jittering.
Euclidean metric
All HMC implementations in Stan utilize quadratic kinetic energy functions which
are specified up to the choice of a symmetric, positive-definite matrix known as a
mass matrix or, more formally, a metric Betancourt (2017).
If the metric is constant then the resulting implementation is known as Euclidean
HMC. Stan allows a choice among three Euclidean HMC implementations,
• a unit metric (diagonal matrix of ones),
• a diagonal metric (diagonal matrix with positive diagonal entries), and
• a dense metric (a dense, symmetric positive definite matrix)
to be configured by the user.
If the metric is specified to be diagonal, then regularized variances are estimated
based on the iterations in each slow-stage block (labeled II in the warmup adaptation
stages figure). Each of these estimates is based only on the iterations in that block.
This allows early estimates to be used to help guide warmup and then be forgotten
later so that they do not influence the final covariance estimate.
If the metric is specified to be dense, then regularized covariance estimates will be
carried out, regularizing the estimate to a diagonal matrix, which is itself regularized
toward a unit matrix.
Variances or covariances are estimated using Welford accumulators to avoid a loss of
precision over many floating point operations.
The metric can compensate for linear (i.e. global) correlations in the posterior which
can dramatically improve the performance of HMC in some problems. This requires
knowing the global correlations.
In complex models, the global correlations are usually difficult, if not impossible, to
derive analytically; for example, nonlinear model components convolve the scales of
174 CHAPTER 15. MCMC SAMPLING
the data, so standardizing the data does not always help. Therefore, Stan estimates
these correlations online with an adaptive warmup. In models with strong nonlinear
(i.e. local) correlations this learning can be slow, even with regularization. This is
ultimately why warmup in Stan often needs to be so long, and why a sufficiently
long warmup can yield such substantial performance improvements.
Nonlinearity
Statistical models for which sampling is problematic are not typically dominated
by linear correlations for which a dense metric can adjust. Rather, they are gov-
erned by more complex nonlinear correlations that are best tackled with better
parameterizations or more advanced algorithms, such as Riemannian HMC.
Poor behavior in the tails is the kind of pathology that can be uncovered by running
only a few warmup iterations. By looking at the acceptance probabilities and
step sizes of the first few iterations provides an idea of how bad the problem is
and whether it must be addressed with modeling efforts such as tighter priors or
reparameterizations.
tively. Because the final subtree may only be partially constructed, these two will
always satisfy
Tree depth is an important diagnostic tool for NUTS. For example, a tree depth of
zero occurs when the first leapfrog step is immediately rejected and the initial state
returned, indicating extreme curvature and poorly-chosen step size (at least relative
to the current position). On the other hand, a tree depth equal to the maximum
depth indicates that NUTS is taking many leapfrog steps and being terminated
prematurely to avoid excessively long execution time. Taking very many steps may
be a sign of poor adaptation, may be due to targeting a very high acceptance rate, or
may simply indicate a difficult posterior from which to sample. In the latter case,
reparameterization may help with efficiency. But in the rare cases where the model
is correctly specified and a large number of steps is necessary, the maximum depth
should be increased to ensure that that the NUTS tree can grow as large as necessary.
For this model, the sampler must be configured to use the fixed-parameters setting
because there are no parameters. Without parameter sampling there is no need for
adaptation and the number of warmup iterations should be set to zero.
Most models that are written to be sampled without parameters will not declare any
parameters, instead putting anything parameter-like in the data block. Nevertheless,
it is possible to include parameters for fixed-parameters sampling and initialize them
in any of the usual ways (randomly, fixed to zero on the unconstrained scale, or with
user-specified values). For example, theta in the example above could be declared
as a parameter and initialized as a parameter.
Replication
Together, the seed and chain identifier determine the behavior of the underlying
random number generator. For complete reproducibility, every aspect of the environ-
ment needs to be locked down from the OS and version to the C++ compiler and
version to the version of Stan and all dependent libraries.
178 CHAPTER 15. MCMC SAMPLING
Initialization
The initial parameter values for Stan’s algorithms (MCMC, optimization, or diag-
nostic) may be either specified by the user or generated randomly. If user-specified
values are provided, all parameters must be given initial values or Stan will abort
with an error message.
User-defined initialization
If the user specifies initial values, they must satisfy the constraints declared in the
model (i.e., they are on the constrained scale).
Random initialization by default initializes the parameter values with values drawn
at random from a Uniform(−2, 2) distribution. Alternatively, a value other than 2
may be specified for the absolute bounds. These values are on the unconstrained
scale, so must be inverse transformed back to satisfy the constraints declared for
parameters.
Because zero is chosen to be a reasonable default initial value for most parameters,
the interval around zero provides a fairly diffuse starting point. For instance, un-
constrained variables are initialized randomly in (−2, 2), variables constrained to be
positive are initialized roughly in (0.14, 7.4), variables constrained to fall between 0
and 1 are initialized with values roughly in (0.12, 0.88).
15.5. DIVERGENT TRANSITIONS 179
properly, the divergences will be around 10−7 and do not compound due to the symplectic nature of the
leapfrog integrator.
180 CHAPTER 15. MCMC SAMPLING
imprecise; both are defined by the way they cause traditional stepwise algorithms to
diverge from where they should be.
The primary cause of divergent transitions in Euclidean HMC (other than bugs in
the code) is highly varying posterior curvature, for which small step sizes are too
inefficient in some regions and diverge in other regions. If the step size is too small,
the sampler becomes inefficient and halts before making a U-turn (hits the maximum
tree depth in NUTS); if the step size is too large, the Hamiltonian simulation diverges.
N
Y
p(θ) = p(θ(1) ) p(θ(n) |θ(n−1) ).
n=2
Stan uses Hamiltonian Monte Carlo to generate a next state in a manner described
in the Hamiltonian Monte Carlo chapter.
The Markov chains Stan and other MCMC samplers generate are ergodic in the sense
required by the Markov chain central limit theorem, meaning roughly that there is a
reasonable chance of reaching one value of θ from another. The Markov chains are
also stationary, meaning that the transition probabilities do not change at different
positions in the chain, so that for n, n0 ≥ 0, the probability function p(θ(n+1) |θ(n) ) is
0 0
the same as p(θ(n +1) |θ(n ) ) (following the convention of overloading random and
bound variables and picking out a probability function by its arguments).
Stationary Markov chains have an equilibrium distribution on states in which each
has the same marginal probability function, so that p(θ(n) ) is the same probability
function as p(θ(n+1) ). In Stan, this equilibrium distribution p(θ(n) ) is the target
181
182 CHAPTER 16. POSTERIOR ANALYSIS
density p(θ) defined by a Stan program, which is typically a proper Bayesian posterior
density p(θ|y) defined on the log scale up to a constant.
Using MCMC methods introduces two difficulties that are not faced by independent
sample Monte Carlo methods. The first problem is determining when a randomly
initialized Markov chain has converged to its equilibrium distribution. The second
problem is that the draws from a Markov chain may be correlated or even anti-
correlated, and thus the central limit theorem’s bound on estimation error no longer
applies. These problems are addressed in the next two sections.
Stan’s posterior analysis tools compute a number of summary statistics, estimates,
and diagnostics for Markov chain Monte Carlo (MCMC) samples. Stan’s estimators
and diagnostics are more robust in the face of non-convergence, antithetical sampling,
and long-term Markov chain correlations than most of the other tools available. The
algorithms Stan uses to achieve this are described in this chapter.
16.2. Convergence
By definition, a Markov chain generates samples from the target distribution only
after it has converged to equilibrium (i.e., equilibrium is defined as being achieved
when p(θ(n) ) is the target density). The following point cannot be expressed strongly
enough:
• In theory, convergence is only guaranteed asymptotically as the number of draws
grows without bound.
• In practice, diagnostics must be applied to monitor convergence for the finite
number of draws actually available.
M
N X (•)
B= (θ̄(•) − θ̄• )2 ,
M − 1 m=1 m
where
N
(•) 1 X (n)
θ̄m = θ
N n=1 m
and
M
(•) 1 X (•)
θ̄• = θ̄ .
M m=1 m
M
1 X 2
W = s ,
M m=1 m
where
184 CHAPTER 16. POSTERIOR ANALYSIS
N
1 X
s2m = (θ(n) − θ̄m
(•) 2
) .
N − 1 n=1 m
+ N −1 1
var
c (θ|y) = W + B.
N N
s
+
var
c (θ|y)
R̂ = .
W
Before Stan calculating the potential-scale-reduction statistic R̂, each chain is split
into two halves. This provides an additional means to detect non-stationarity in
the individual chains. If one chain involves gradually increasing values and one
involves gradually decreasing values, they have not mixed well, but they can have R̂
values near unity. In this case, splitting each chain into two parts leads to R̂ values
substantially greater than 1 because the first half of each chain has not mixed with
the second half.
Convergence is global
A question that often arises is whether it is acceptable to monitor convergence of
only a subset of the parameters or generated quantities. The short answer is “no,”
but this is elaborated further in this section.
For example, consider the value lp__, which is the log posterior density (up to a
constant).3
It is thus a mistake to declare convergence in any practical sense if lp__ has not
converged, because different chains are really in different parts of the space. Yet
measuring convergence for lp__ is particularly tricky, as noted below.
3 The lp__ value also represents the potential energy in the Hamiltonian system and is rate bounded
by the randomly supplied kinetic energy each iteration, which follows a Chi-square distribution in the
number of parameters.
16.3. NOTATION FOR SAMPLES, CHAINS, AND DRAWS 185
Markov chain convergence is a global property in the sense that it does not depend
on the choice of function of the parameters that is monitored. There is no hard
cutoff between pre-convergence “transience” and post-convergence “equilibrium.”
What happens is that as the number of states in the chain approaches infinity, the
distribution of possible states in the chain approaches the target distribution and in
that limit the expected value of the Monte Carlo estimator of any integrable function
converges to the true expectation. There is nothing like warmup here, because in
the limit, the effects of initial state are completely washed out.
The R̂ statistic considers the composition of a Markov chain and a function, and if
the Markov chain has converged then each Markov chain and function composition
will have converged. Multivariate functions converge when all of their margins have
converged by the Cramer-Wold theorem.
The transformation from unconstrained space to constrained space is just another
function, so does not effect convergence.
Different functions may have different autocorrelations, but if the Markov chain has
equilibrated then all Markov chain plus function compositions should be consistent
with convergence. Formally, any function that appears inconsistent is of concern and
although it would be unreasonable to test every function, lp__ and other measured
quantities should at least be consistent.
The obvious difference in lp__ is that it tends to vary quickly with position and is
consequently susceptible to outliers.
The question is what happens for finite numbers of states? If we can prove a
strong geometric ergodicity property (which depends on the sampler and the target
distribution), then one can show that there exists a finite time after which the chain
forgets its initial state with a large probability. This is both the autocorrelation time
and the warmup time. But even if you can show it exists and is finite (which is nigh
impossible) you can’t compute an actual value analytically.
186 CHAPTER 16. POSTERIOR ANALYSIS
So what we do in practice is hope that the finite number of draws is large enough for
the expectations to be reasonably accurate. Removing warmup iterations improves
the accuracy of the expectations but there is no guarantee that removing any finite
number of samples will be enough.
Firstly, as noted above, for any finite number of draws, there will always be some
residual effect of the initial state, which typically manifests as some small (or large
if the autocorrelation time is huge) probability of having a large outlier. Functions
robust to such outliers (say, quantiles) will appear more stable and have better R̂.
Functions vulnerable to such outliers may show fragility.
Secondly, use of the R̂ statistic makes very strong assumptions. In particular, it
assumes that the functions being considered are Gaussian or it only uses the first
two moments and assumes some kind of independence. The point is that strong
assumptions are made that do not always hold. In particular, the distribution for the
log posterior density (lp__) almost never looks Gaussian, instead it features long
tails that can lead to large R̂ even in the large N limit. Tweaks to R̂, such as using
quantiles in place of raw values, have the flavor of making the samples of interest
more Gaussian and hence the R̂ statistic more accurate.
“Convergence” is a global property and holds for all integrable functions at once, but
employing the R̂ statistic requires additional assumptions and thus may not work
for all functions equally well.
Note that if you just compare the expectations between chains then we can rely on
the Markov chain asymptotics for Gaussian distributions and can apply the standard
tests.
Stan estimates an effective sample size for each parameter, which plays the role in
the Markov chain Monte Carlo central limit theorem (MCMC CLT) as the number of
independent draws plays in the standard central limit theorem (CLT).
Unlike most packages, the particular calculations used by Stan follow those for
split-R̂, which involve both cross-chain (mean) and within-chain calculations (auto-
correlation); see Gelman et al. (2013).
This is the correlation between the two chains offset by t positions (i.e., a lag in
time-series terminology). Because we know θ(n) and θ(n+t) have the same marginal
distribution in an MCMC setting, multiplying the two difference terms and reducing
yields
µ2
Z
1
ρt = θ(n) θ(n+t) p(θ) dθ − .
σ2 Θ σ2
Effective sample size Neff can be larger than N in case of antithetic Markov chains,
which have negative autocorrelations on odd lags. The no-U-turn sampling (NUTS)
188 CHAPTER 16. POSTERIOR ANALYSIS
algorithm used in Stan can produce Neff > N for parameters which have close to
Gaussian posterior and little dependency on other parameters.
1
PM 2
W− M m=1 sm ρ̂t,m
ρ̂t = 1 − + .
var
c
+
If the chains have not converged, the variance estimator var
c will overestimate vari-
ance, leading to an overestimate of autocorrelation and an underestimate effective
sample size.
Because of the noise in the correlation estimates ρ̂t as t increases, a typical truncated
sum of ρ̂t is used. Negative autocorrelations may occur only on odd lags and by
summing over pairs starting from lag 0, the paired autocorrelation is guaranteed to
be positive, monotone and convex modulo estimator noise Charles J. Geyer (1992),
Charles J. Geyer (2011). Stan uses Geyer’s initial monotone sequence criterion. The
effective sample size estimator is defined as
M ·N
N̂eff = ,
τ̂
where
16.4. EFFECTIVE SAMPLE SIZE 189
2m+1
X m
X
τ̂ = 1 + 2 ρ̂t = −1 + 2 P̂t0 ,
t=1 t0 =0
where P̂t0 = ρ̂2t0 + ρ̂2t0 +1 . Initial positive sequence estimators is obtained by choosing
the largest m such that P̂t0 > 0, t0 = 1, . . . , m. The initial monotone sequence is
obtained by further reducing P̂t0 to the minimum of the preceding ones so that the
estimated sequence is monotone.
The previous section showed how to estimate Neff for a parameter θn based on
multiple chains of posterior draws.
The mean of the posterior draws of θn
Thinning samples
In the typical situation, the autocorrelation, ρt , decreases as the lag, t, increases.
When this happens, thinning the samples will reduce the autocorrelation.
For instance, consider generating one thousand posterior draws in one of the follow-
ing two ways.
• Generate 1000 draws after convergence and save all of them.
• Generate 10,000 draws after convergence and save every tenth draw.
Even though both produce a sample consisting one thousand draws, the second
approach with thinning can produce a higher effective sample size. That’s because
the autocorrelation ρt for the thinned sequence is equivalent to ρ10t in the unthinned
sequence, so the sum of the autocorrelations will be lower and thus the effective
sample size higher.
Now contrast the second approach above with the unthinned alternative,
• Generate 10,000 draws after convergence and save every draw.
This will have a higher effective sample than the thinned sample consisting of every
tenth drawn. Therefore, it should be emphasized that the only reason to thin a sample
is to reduce memory requirements.
17. Optimization
Stan provides optimization algorithms which find modes of the density specified by
a Stan program. Such modes may be used as parameter estimates or as the basis of
approximations to a Bayesian posterior.
Stan provides three different optimizers, a Newton optimizer, and two related quasi-
Newton algorithms, BFGS and L-BFGS; see Nocedal and Wright (2006) for thorough
description and analysis of all of these algorithms. The L-BFGS algorithm is the
default optimizer. Newton’s method is the least efficient of the three, but has the
advantage of setting its own stepsize.
Parameter convergence
191
192 CHAPTER 17. OPTIMIZATION
Density convergence
The (unnormalized) log density log p(θi |y) for the parameters θi in iteration i given
data y is considered to have converged with respect to tolerance tol_obj if
Gradient convergence
giT Ĥi−1 gi
< tol_rel_grad ∗ ,
max (|log p(θi |y)| , 1.0)
where Ĥi is the estimate of the Hessian at iteration i, |u| is the absolute value (L1
norm) of u, ||u|| is the vector length (L2 norm) of u, and ≈ 2e − 16 is machine
precision.
too large or too small depending on the objective function and initialization. Being
too big or too small just means that the first iteration will take longer (i.e., require
more gradient evaluations) before the line search finds a good step length. It’s not a
critical parameter, but for optimizing the same model multiple times (as you tweak
things or with different data), being able to tune α can save some real time.
and using the default random initialization which is Uniform(−2, 2) on the uncon-
strained scale means that there is only a 2−M chance that the initialization will be
within support.
194 CHAPTER 17. OPTIMIZATION
For any given optimization problem, it is probably worthwhile trying the program
both ways, with and without the constraint, to see which one is more efficient.
18. Variational Inference
Stan implements an automatic variational inference algorithm, called Automatic
Differentiation Variational Inference (ADVI) Kucukelbir et al. (2015). In this chapter,
we describe the specifics of how ADVI maximizes the variational objective.
195
196 CHAPTER 18. VARIATIONAL INFERENCE
Assessing convergence
ADVI tracks the progression of the ELBO through the stochastic optimization. Specif-
ically, ADVI heuristically determines a rolling window over which it computes the
average and the median change of the ELBO. Should either number fall below a
threshold, denoted by tol_rel_obj, we consider the algorithm to have converged.
The change in ELBO is calculated the same way as in Stan’s optimization module.
19. Diagnostic Mode
Stan’s diagnostic mode runs a Stan program with data, initializing parameters either
randomly or with user-specified initial values, and then evaluates the log probability
and its gradients. The gradients computed by the Stan program are compared to
values calculated by finite differences.
Diagnostic mode may be configured with two parameters.
Diagnostic Mode Configuration Table. The diagnostic model configuration parame-
ters, constraints, and default values.
If the difference between the Stan program’s gradient value and that calculated by
finite difference is higher than the specified threshold, the argument will be flagged.
Unconstrained scale
The output is for the variable values and their gradients are on the unconstrained
scale, which means each variable is a vector of size corresponding to the number of
unconstrained variables required to define it. For example, an N × N correlation
matrix, requires N2 unconstrained parameters. The transformations from con-
197
198 CHAPTER 19. DIAGNOSTIC MODE
Includes Jacobian
The log density includes the Jacobian adjustment implied by the constraints declared
on variables. The Jacobian adjustment for constrained parameter transforms will be
turned off if optimization is used in practice, but there is as of yet no way to turn it
off in diagnostic mode.
199
20. Reproducibility
Floating point operations on modern computers are notoriously difficult to replicate
because the fundamental arithmetic operations, right down to the IEEE 754 encoding
level, are not fully specified. The primary problem is that the precision of operations
varies across different hardware platforms and software implementations.
Stan is designed to allow full reproducibility. However, this is only possible up to the
external constraints imposed by floating point arithmetic.
Stan results will only be exactly reproducible if all of the following components are
identical:
• Stan version
• Stan interface (RStan, PyStan, CmdStan) and version, plus version of interface
language (R, Python, shell)
• versions of included libraries (Boost and Eigen)
• operating system version
• computer hardware including CPU, motherboard and memory
• C++ compiler, including version, compiler flags, and linked libraries
• same configuration of call to Stan, including random seed, chain ID, initializa-
tion and data
It doesn’t matter if you use a stable release version of Stan or the version with a
particular Git hash tag. The same goes for all of the interfaces, compilers, and so on.
The point is that if any of these moving parts changes in some way, floating point
results may change.
Concretely, if you compile a single Stan program using the same CmdStan code
base, but changed the optimization flag (-O3 vs. -O2 or -O0), the two programs
may not return the identical stream of results. Thus it is very hard to guarantee
reproducibility on externally managed hardware, like in a cluster or even a desktop
managed by an IT department or with automatic updates turned on.
If, however, you compiled a Stan program today using one set of flags, took the
computer away from the internet and didn’t allow it to update anything, then came
back in a decade and recompiled the Stan program in the same way, you should get
the same results.
The data needs to be the same down to the bit level. For example, if you are running
200
201
in RStan, Rcpp handles the conversion between R’s floating point numbers and C++
doubles. If Rcpp changes the conversion process or use different types, the results
are not guaranteed to be the same down to the bit level.
The compiler and compiler settings can also be an issue. There is a nice discussion
of the issues and how to control reproducibility in Intel’s proprietary compiler by
Corden and Kreitzer (2014).
21. Licenses and Dependencies
Stan and its dependent libraries, are distributed under generous, freedom-respecting
licenses approved by the Open Source Initiative.
In particular, the licenses for Stan and its dependent libraries have no “copyleft”
provisions requiring applications of Stan to be open source if they are redistributed.
This chapter specifies the licenses for the libraries on which Stan’s math library,
language, and algorithms depend. The last tool mentioned, Google Test, is only used
for testing and is not needed to run Stan.
202
21.4. SUNDIALS LICENSE 203
204
21.5. GOOGLE TEST LICENSE 205
abs/1506.03431.
Leimkuhler, Benedict, and Sebastian Reich. 2004. Simulating Hamiltonian Dynamics.
Cambridge: Cambridge University Press.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random
Correlation Matrices Based on Vines and Extended Onion Method.” Journal of
Multivariate Analysis 100: 1989–2001.
Marsaglia, George. 1972. “Choosing a Point from the Surface of a Sphere.” The
Annals of Mathematical Statistics 43 (2): 645–46.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. 1953. “Equa-
tions of State Calculations by Fast Computing Machines.” Journal of Chemical
Physics 21: 1087–92.
Neal, Radford. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov
Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin L. Jones, and
Xiao-Li Meng, 116–62. Chapman; Hall/CRC.
Nesterov, Y. 2009. “Primal-Dual Subgradient Methods for Convex Problems.” Mathe-
matical Programming 120 (1): 221–59.
Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical Optimization. Second.
Berlin: Springer-Verlag.
Roberts, G. O., Andrew Gelman, and Walter R. Gilks. 1997. “Weak Convergence
and Optimal Scaling of Random Walk Metropolis Algorithms.” Annals of Applied
Probability 7 (1): 110–20.