Probability and Statistics: Cookbook
Probability and Statistics: Cookbook
Cookbook
Copyright c Matthias Vallentin, 2011
[email protected]
19
th
July, 2011
This cookbook integrates a variety of topics in probability the-
ory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
versity of California in Berkeley but also inuenced by other
sources [4, 5]. If you nd errors or have suggestions for further
topics, I would appreciate if you send me an email. The most re-
cent version of this document is available at http://matthias.
vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . . . . . . . . 3
1.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 6
3.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment Generating
Functions 9
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9
9.2 Bivariate Normal . . . . . . . . . . . . . 9
9.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 9
10.1 Law of Large Numbers (LLN) . . . . . . 10
10.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 10
11.1 Point Estimation . . . . . . . . . . . . . 10
11.2 Normal-Based Condence Interval . . . 11
11.3 Empirical distribution . . . . . . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 11
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 12
12.3.1 Multiparameter delta method . . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of parameters . . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Condence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 22
20.1 Markov Chains . . . . . . . . . . . . . . 22
20.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 23
21.1 Stationary Time Series . . . . . . . . . . 23
21.2 Estimation of Correlation . . . . . . . . 24
21.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 24
21.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
22 Math 26
22.1 Gamma Function . . . . . . . . . . . . . 26
22.2 Beta Function . . . . . . . . . . . . . . . 26
22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
Notation
1
FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif {a, . . . , b}
0 x < a
xa+1
ba
a x b
1 x > b
I(a < x < b)
b a + 1
a +b
2
(b a + 1)
2
1
12
e
as
e
(b+1)s
s(b a)
Bernoulli Bern (p) (1 p)
1x
p
x
(1 p)
1x
p p(1 p) 1 p +pe
s
Binomial Bin (n, p) I1p(n x, x + 1)
n
x
p
x
(1 p)
nx
np np(1 p) (1 p +pe
s
)
n
Multinomial Mult (n, p)
n!
x1! . . . xk!
p
x1
1
p
x
k
k
k
i=1
xi = n npi npi(1 pi)
i=0
pie
si
n
Hypergeometric Hyp (N, m, n)
x np
np(1 p)
m
x
mx
nx
N
x
nm
N
nm(N n)(N m)
N
2
(N 1)
Negative Binomial NBin (n, p) Ip(r, x + 1)
x +r 1
r 1
p
r
(1 p)
x
r
1 p
p
r
1 p
p
2
p
1 (1 p)e
s
r
Geometric Geo (p) 1 (1 p)
x
x N
+
p(1 p)
x1
x N
+
1
p
1 p
p
2
p
1 (1 p)e
s
Poisson Po () e
i=0
i
i!
x
e
x!
e
(e
s
1)
q q q q q q
Uniform (discrete)
x
P
M
F
a b
1
n
0 10 20 30 40
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
Binomial
x
P
M
F
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0 2 4 6 8 10
0
.0
0
.2
0
.4
0
.6
0
.8
Geometric
x
P
M
F
p = 0.2
p = 0.5
p = 0.8
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
Poisson
x
P
M
F
= 1
= 4
= 10
1
We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif (a, b)
0 x < a
xa
ba
a < x < b
1 x > b
I(a < x < b)
b a
a +b
2
(b a)
2
12
e
sb
e
sa
s(b a)
Normal N
,
2
(x) =
(t) dt (x) =
1
2
exp
(x )
2
2
2
2
exp
s +
2
s
2
2
Log-Normal ln N
,
2
1
2
+
1
2
erf
ln x
2
2
1
x
2
2
exp
(ln x )
2
2
2
e
+
2
/2
(e
2
1)e
2+
2
Multivariate Normal MVN(, ) (2)
k/2
||
1/2
e
1
2
(x)
T
1
(x)
exp
T
s +
1
2
s
T
s
Students t Student() Ix
2
,
2
+1
2
1 +
x
2
(+1)/2
0 0
Chi-square
2
k
1
(k/2)
k
2
,
x
2
1
2
k/2
(k/2)
x
k/2
e
x/2
k 2k (1 2s)
k/2
s < 1/2
F F(d1, d2) I d
1
x
d
1
x+d
2
d1
2
,
d1
2
(d1x)
d
1d
d
2
2
(d1x+d2)
d
1
+d
2
xB
d1
2
,
d1
2
d2
d2 2
2d
2
2
(d1 +d2 2)
d1(d2 2)
2
(d2 4)
Exponential Exp () 1 e
x/
1
e
x/
2
1
1 s
(s < 1/)
Gamma Gamma (, )
(, x/)
()
1
()
x
1
e
x/
2
1
1 s
(s < 1/)
Inverse Gamma InvGamma (, )
,
x
()
()
x
1
e
/x
1
> 1
2
( 1)
2
( 2)
2
> 2
2(s)
/2
()
K
4s
Dirichlet Dir ()
k
i=1
i
k
i=1
(i)
k
i=1
x
i1
i
i
k
i=1
i
E [Xi] (1 E [Xi])
k
i=1
i + 1
Beta Beta (, ) Ix(, )
( +)
() ()
x
1
(1 x)
1
+
( +)
2
( + + 1)
1 +
k=1
k1
r=0
+r
+ +r
s
k
k!
Weibull Weibull(, k) 1 e
(x/)
k k
k1
e
(x/)
k
1 +
1
k
1 +
2
k
n=0
s
n
n
n!
1 +
n
k
Pareto Pareto(xm, ) 1
xm
x
x xm
x
m
x
+1
x xm
xm
1
> 1
x
m
( 1)
2
( 2)
> 2 (xms)
(, xms) s < 0
4
q q
Uniform (continuous)
x
P
D
F
a b
1
b a
q q
4 2 0 2 4
0
.0
0
.2
0
.4
0
.6
0
.8
Normal
x
(
x
)
= 0,
2
= 0.2
= 0,
2
= 1
= 0,
2
= 5
= 2,
2
= 0.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Lognormal
x
P
D
F
= 0,
2
= 3
= 2,
2
= 2
= 0,
2
= 1
= 0.5,
2
= 1
= 0.25,
2
= 1
= 0.125,
2
= 1
4 2 0 2 4
0
.0
0
.1
0
.2
0
.3
0
.4
Student's t
x
P
D
F
= 1
= 2
= 5
=
0 2 4 6 8
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
2
x
P
D
F
k = 1
k = 2
k = 3
k = 4
k = 5
0 1 2 3 4 5
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
3
.0
F
x
P
D
F
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
0 1 2 3 4 5
0
.0
0
.5
1
.0
1
.5
2
.0
Exponential
x
P
D
F
= 2
= 1
= 0.4
0 5 10 15 20
0
.0
0
.1
0
.2
0
.3
0
.4
0
.5
Gamma
x
P
D
F
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
0 1 2 3 4 5
0
1
2
3
4
Inverse Gamma
x
P
D
F
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
3
.0
Beta
x
P
D
F
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.0 0.5 1.0 1.5 2.0 2.5
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
Weibull
x
P
D
F
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
0 1 2 3 4 5
0
1
2
3
Pareto
x
P
D
F
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
5
2 Probability Theory
Denitions
Sample space
Outcome (point or element)
Event A
-algebra A
1. A
2. A
1
, A
2
, . . . , A =
i=1
A
i
A
3. A A = A A
Probability Distribution P
1. P [A] 0 A
2. P [] = 1
3. P
i=1
A
i
i=1
P [A
i
]
Probability space (, A, P)
Properties
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] +P [A B]
P [] = 1 P [] = 0
(
n
A
n
) =
n
A
n
(
n
A
n
) =
n
A
n
DeMorgan
P [
n
A
n
] = 1 P [
n
A
n
]
P [A B] = P [A] +P [B] P [A B]
= P [A B] P [A] +P [B]
P [A B] = P [A B] +P [A B] +P [A B]
P [A B] = P [A] P [A B]
Continuity of Probabilities
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =
i=1
A
i
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =
i=1
A
i
Independence
A B P [A B] = P [A] P [B]
Conditional Probability
P [A| B] =
P [A B]
P [B]
P [B] > 0
Law of Total Probability
P [B] =
n
i=1
P [B|A
i
] P [A
i
] =
n
i=1
A
i
Bayes Theorem
P [A
i
| B] =
P [B| A
i
] P [A
i
]
n
j=1
P [B| A
j
] P [A
j
]
=
n
i=1
A
i
Inclusion-Exclusion Principle
i=1
A
i
=
n
r=1
(1)
r1
ii1<<irn
j=1
A
ij
3 Random Variables
Random Variable (RV)
X : R
Probability Mass Function (PMF)
f
X
(x) = P [X = x] = P [{ : X() = x}]
Probability Density Function (PDF)
P [a X b] =
b
a
f(x) dx
Cumulative Distribution Function (CDF)
F
X
: R [0, 1] F
X
(x) = P [X x]
1. Nondecreasing: x
1
< x
2
= F(x
1
) F(x
2
)
2. Normalized: lim
x
= 0 and lim
x
= 1
3. Right-Continuous: lim
yx
F(y) = F(x)
P [a Y b | X = x] =
b
a
f
Y |X
(y | x)dy a b
f
Y |X
(y | x) =
f(x, y)
f
X
(x)
Independence
1. P [X x, Y y] = P [X x] P [Y y]
2. f
X,Y
(x, y) = f
X
(x)f
Y
(y)
6
3.1 Transformations
Transformation function
Z = (X)
Discrete
f
Z
(z) = P [(X) = z] = P [{x : (x) = z}] = P
X
1
(z)
x
1
(z)
f(x)
Continuous
F
Z
(z) = P [(X) z] =
Az
f(x) dx with A
z
= {x : (x) z}
Special case if strictly monotone
f
Z
(z) = f
X
(
1
(z))
d
dz
1
(z)
= f
X
(x)
dx
dz
= f
X
(x)
1
|J|
The Rule of the Lazy Statistician
E [Z] =
(x) dF
X
(x)
E [I
A
(x)] =
I
A
(x) dF
X
(x) =
A
dF
X
(x) = P [X A]
Convolution
Z := X +Y f
Z
(z) =
f
X,Y
(x, z x) dx
X,Y 0
=
z
0
f
X,Y
(x, z x) dx
Z := |X Y | f
Z
(z) = 2
0
f
X,Y
(x, z +x) dx
Z :=
X
Y
f
Z
(z) =
|x|f
X,Y
(x, xz) dx
=
xf
x
(x)f
X
(x)f
Y
(xz) dx
4 Expectation
Denition and properties
E [X] =
X
=
xdF
X
(x) =
x
xf
X
(x) X discrete
xf
X
(x) X continuous
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X +Y ] = E [X] +E [Y ]
E [XY ] =
X,Y
xyf
X,Y
(x, y) dF
X
(x) dF
Y
(y)
E [(Y )] = (E [X]) (cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
E [X] =
x=1
P [X x]
Sample mean
X
n
=
1
n
n
i=1
X
i
Conditional expectation
E [Y | X = x] =
yf(y | x) dy
E [X] = E [E [X| Y ]]
E[(X, Y ) | X = x] =
(x, y)f
Y |X
(y | x) dx
E [(Y, Z) | X = x] =
(y, z)f
(Y,Z)|X
(y, z | x) dy dz
E [Y +Z | X] = E [Y | X] +E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
5 Variance
Denition and properties
V [X] =
2
X
= E
(X E [X])
2
= E
X
2
E [X]
2
V
i=1
X
i
=
n
i=1
V [X
i
] + 2
i=j
Cov [X
i
, Y
j
]
V
i=1
X
i
=
n
i=1
V [X
i
] i X
i
X
j
Standard deviation
sd[X] =
V [X] =
X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
7
Cov [X +a, Y +b] = Cov [X, Y ]
Cov
i=1
X
i
,
m
j=1
Y
j
=
n
i=1
m
j=1
Cov [X
i
, Y
j
]
Correlation
[X, Y ] =
Cov [X, Y ]
V [X] V [Y ]
Independence
X Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
S
2
=
1
n 1
n
i=1
(X
i
X
n
)
2
Conditional variance
V [Y | X] = E
(Y E [Y | X])
2
| X
= E
Y
2
| X
E [Y | X]
2
V [Y ] = E [V [Y | X]] +V [E [Y | X]]
6 Inequalities
Cauchy-Schwarz
E [XY ]
2
E
X
2
Y
2
Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
V [X]
t
2
Chernoff
P [X (1 +)]
(1 +)
1+
> 1
Jensen
E [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
X
i
Bern (p) =
n
i=1
X
i
Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X +Y Bin (n +m, p)
lim
n
Bin (n, p) = Po (np) (n large, p small)
lim
n
Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geo (p)
X NBin (r, p) =
r
i=1
Geo (p)
X
i
NBin (r
i
, p) =
X
i
NBin (
r
i
, p)
X NBin (r, p) . Y Bin (s +r, p) = P [X s] = P [Y r]
Poisson
X
i
Po (
i
) X
i
X
j
=
n
i=1
X
i
Po
i=1
X
i
Po (
i
) X
i
X
j
= X
i
j=1
X
j
Bin
j=1
X
j
,
i
n
j=1
Exponential
X
i
Exp () X
i
X
j
=
n
i=1
X
i
Gamma (n, )
Memoryless property: P [X > x +y | X > y] = P [X > x]
Normal
X N
,
2
N (0, 1)
X N
,
2
Z = aX +b = Z N
a +b, a
2
X N
1
,
2
1
Y N
2
,
2
2
= X +Y N
1
+
2
,
2
1
+
2
2
X
i
N
i
,
2
i
=
i
X
i
N
i
,
2
i
P [a < X b] =
(x) = 1 (x)
(x) = x(x)
(x) = (x
2
1)(x)
Upper quantile of N (0, 1): z
=
1
(1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
Gamma (, )
i=1
Exp ()
X
i
Gamma (
i
, ) X
i
X
j
=
i
X
i
Gamma (
i
, )
()
0
x
1
e
x
dx
Beta
1
B(, )
x
1
(1 x)
1
=
( +)
()()
x
1
(1 x)
1
E
X
k
=
B( +k, )
B(, )
=
+k 1
+ +k 1
E
X
k1
t
X
|t| < 1
M
X
(t) = G
X
(e
t
) = E
e
Xt
= E
i=0
(Xt)
i
i!
i=0
E
X
i
i!
t
i
P [X = 0] = G
X
(0)
P [X = 1] = G
X
(0)
P [X = i] =
G
(i)
X
(0)
i!
E [X] = G
X
(1
)
E
X
k
= M
(k)
X
(0)
E
X!
(X k)!
= G
(k)
X
(1
)
V [X] = G
X
(1
) +G
X
(1
) (G
X
(1
))
2
G
X
(t) = G
Y
(t) = X
d
= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X, Y N (0, 1) X Z where Y = X +
1
2
Z
Joint density
f(x, y) =
1
2
1
2
exp
x
2
+y
2
2xy
2(1
2
)
Conditionals
(Y | X = x) N
x, 1
2
and (X| Y = y) N
y, 1
2
Independence
X Y = 0
9.2 Bivariate Normal
Let X N
x
,
2
x
and Y N
y
,
2
y
.
f(x, y) =
1
2
x
1
2
exp
z
2(1
2
)
z =
x
x
2
+
y
y
2
2
x
x
y
y
Y
(Y E [Y ])
V [X| Y ] =
X
1
2
9.3 Multivariate Normal
Covariance matrix (Precision matrix
1
)
=
V [X
1
] Cov [X
1
, X
k
]
.
.
.
.
.
.
.
.
.
Cov [X
k
, X
1
] V [X
k
]
If X N (, ),
f
X
(x) = (2)
n/2
||
1/2
exp
1
2
(x )
T
1
(x )
Properties
Z N (0, 1) X = +
1/2
Z = X N (, )
X N (, ) =
1/2
(X ) N (0, 1)
X N (, ) = AX N
A, AA
T
X N (, ) a = k = a
T
X N
a
T
, a
T
a
10 Convergence
Let {X
1
, X
2
, . . .} be a sequence of rvs and let X be another rv. Let F
n
denote
the cdf of X
n
and let F denote the cdf of X.
Types of convergence
1. In distribution (weakly, in law): X
n
D
X
lim
n
F
n
(t) = F(t) t where F continuous
2. In probability: X
n
P
X
( > 0) lim
n
P [|X
n
X| > ] = 0
3. Almost surely (strongly): X
n
as
X
P
lim
n
X
n
= X
= P
: lim
n
X
n
() = X()
= 1
9
4. In quadratic mean (L
2
): X
n
qm
X
lim
n
E
(X
n
X)
2
= 0
Relationships
X
n
qm
X = X
n
P
X = X
n
D
X
X
n
as
X = X
n
P
X
X
n
D
X (c R) P [X = c] = 1 = X
n
P
X
X
n
P
X Y
n
P
Y = X
n
+Y
n
P
X +Y
X
n
qm
X Y
n
qm
Y = X
n
+Y
n
qm
X +Y
X
n
P
X Y
n
P
Y = X
n
Y
n
P
XY
X
n
P
X = (X
n
)
P
(X)
X
n
D
X = (X
n
)
D
(X)
X
n
qm
b lim
n
E [X
n
] = b lim
n
V [X
n
] = 0
X
1
, . . . , X
n
iid E [X] = V [X] <
X
n
qm
Slutzkys Theorem
X
n
D
X and Y
n
P
c = X
n
+Y
n
D
X +c
X
n
D
X and Y
n
P
c = X
n
Y
n
D
cX
In general: X
n
D
X and Y
n
D
Y = X
n
+Y
n
D
X +Y
10.1 Law of Large Numbers (LLN)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] < .
Weak (WLLN)
X
n
P
n
Strong (WLLN)
X
n
as
n
10.2 Central Limit Theorem (CLT)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] =
2
.
Z
n
:=
X
n
X
n
n(
X
n
)
D
Z where Z N (0, 1)
lim
n
P [Z
n
z] = (z) z R
CLT notations
Z
n
N (0, 1)
X
n
N
,
2
n
X
n
N
0,
2
n
n(
X
n
) N
0,
2
n(
X
n
)
n
N (0, 1)
Continuity correction
P
X
n
x
x +
1
2
/
X
n
x
x
1
2
/
Delta method
Y
n
N
,
2
n
= (Y
n
) N
(), (
())
2
2
n
11 Statistical Inference
Let X
1
, , X
n
iid
F if not otherwise noted.
11.1 Point Estimation
Point estimator
n
of is a rv:
n
= g(X
1
, . . . , X
n
)
bias(
n
) = E
Consistency:
n
P
Sampling distribution: F(
n
)
Standard error: se(
n
) =
n
)
2
= bias(
n
)
2
+V
lim
n
bias(
n
) = 0 lim
n
se(
n
) = 0 =
n
is consistent
Asymptotic normality:
se
D
N (0, 1)
Slutzkys Theorem often lets us replace se(
n
) by some (weakly) consis-
tent estimator
n
.
10
11.2 Normal-Based Condence Interval
Suppose
n
N
, se
2
. Let z
/2
=
1
(1 (/2)), i.e., P
Z > z
/2
= /2
and P
z
/2
< Z < z
/2
n
z
/2
se
11.3 Empirical distribution
Empirical Distribution Function (ECDF)
F
n
(x) =
n
i=1
I(X
i
x)
n
I(X
i
x) =
1 X
i
x
0 X
i
> x
Properties (for any xed x)
E
F
n
= F(x)
V
F
n
=
F(x)(1 F(x))
n
mse =
F(x)(1 F(x))
n
D
0
F
n
P
F(x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X
1
, . . . , X
n
F)
P
sup
x
F(x)
F
n
(x)
>
= 2e
2n
2
Nonparametric 1 condence band for F
L(x) = max{
F
n
n
, 0}
U(x) = min{
F
n
+
n
, 1}
=
1
2n
log
n
= T(
F
n
)
Linear functional: T(F) =
(x) dF
X
(x)
Plug-in estimator for linear functional:
T(
F
n
) =
(x) d
F
n
(x) =
1
n
n
i=1
(X
i
)
Often: T(
F
n
) N
T(F), se
2
=T(
F
n
) z
/2
se
p
th
quantile: F
1
(p) = inf{x : F(x) p}
=
X
n
2
=
1
n 1
n
i=1
(X
i
X
n
)
2
=
1
n
n
i=1
(X
i
)
3
3
j
=
n
i=1
(X
i
X
n
)(Y
i
Y
n
)
n
i=1
(X
i
X
n
)
2
n
i=1
(Y
i
Y
n
)
12 Parametric Inference
Let F =
f(x; ) :
j
() = E
X
j
x
j
dF
X
(x)
j
th
sample moment
j
=
1
n
n
i=1
X
j
i
Method of moments estimator (MoM)
1
() =
1
2
() =
2
.
.
. =
.
.
.
k
() =
k
11
Properties of the MoM estimator
n
exists with probability tending to 1
Consistency:
n
P
Asymptotic normality:
n(
)
D
N (0, )
where = gE
Y Y
T
g
T
, Y = (X, X
2
, . . . , X
k
)
T
,
g = (g
1
, . . . , g
k
) and g
j
=
1
j
()
12.2 Maximum Likelihood
Likelihood: L
n
: [0, )
L
n
() =
n
i=1
f(X
i
; )
Log-likelihood
n
() = log L
n
() =
n
i=1
log f(X
i
; )
Maximum likelihood estimator (mle)
L
n
(
n
) = sup
L
n
()
Score function
s(X; ) =
log f(X; )
Fisher information
I() = V
[s(X; )]
I
n
() = nI()
Fisher information (exponential family)
I() = E
s(X; )
2
n
i=1
log f(X
i
; )
Properties of the mle
Consistency:
n
P
Equivariance:
n
is the mle =(
n
) ist the mle of ()
Asymptotic normality:
1. se
1/I
n
()
(
n
)
se
D
N (0, 1)
2. se
1/I
n
(
n
)
(
n
)
se
D
N (0, 1)
Asymptotic optimality (or eciency), i.e., smallest variance for large sam-
ples. If
n
is any other estimator, the asymptotic relative eciency is
are(
n
,
n
) =
V
n
1
Approximately the Bayes estimator
12.2.1 Delta Method
If = (
() = 0:
(
n
)
se( )
D
N (0, 1)
where = (
se(
n
)
12.3 Multiparameter Models
Let = (
1
, . . . ,
k
) and
= (
1
, . . . ,
k
) be the mle.
H
jj
=
2
2
H
jk
=
2
k
Fisher information matrix
I
n
() =
[H
11
] E
[H
1k
]
.
.
.
.
.
.
.
.
.
E
[H
k1
] E
[H
kk
]
) N (0, J
n
)
12
with J
n
() = I
1
n
. Further, if
j
is the j
th
component of , then
(
j
j
)
se
j
D
N (0, 1)
where se
2
j
= J
n
(j, j) and Cov
j
,
= J
n
(j, k)
12.3.1 Multiparameter delta method
Let = (
1
, . . . ,
k
) and let the gradient of be
=
1
.
.
.
Suppose
= 0 and = (
). Then,
( )
se( )
D
N (0, 1)
where
se( ) =
J
n
and
J
n
= J
n
(
) and
=
.
12.4 Parametric Bootstrap
Sample from f(x;
n
) instead of from
F
n
, where
n
could be the mle or method
of moments estimator.
13 Hypothesis Testing
H
0
:
0
versus H
1
:
1
Denitions
Null hypothesis H
0
Alternative hypothesis H
1
Simple hypothesis =
0
Composite hypothesis >
0
or <
0
Two-sided test: H
0
: =
0
versus H
1
: =
0
One-sided test: H
0
:
0
versus H
1
: >
0
Critical value c
Test statistic T
Rejection region R = {x : T(x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf
1
()
Test size: = P [Type I error] = sup
0
()
Retain H
0
Reject H
0
H
0
true
Type I Error ()
H
1
true Type II Error ()
(power)
p-value
p-value = sup
0
P
: T(x) R
p-value = sup
0
P
[T(X
) T(X)]
. .. .
1F(T(X)) since T(X
)F
= inf
: T(X) R
p-value evidence
< 0.01 very strong evidence against H
0
0.01 0.05 strong evidence against H
0
0.05 0.1 weak evidence against H
0
> 0.1 little or no evidence against H
0
Wald test
Two-sided test
Reject H
0
when |W| > z
/2
where W =
0
se
P
|W| > z
/2
p-value = P
0
[|W| > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T(X) =
sup
L
n
()
sup
0
L
n
()
=
L
n
(
n
)
L
n
(
n,0
)
13
(X) = 2 log T(X)
D
2
rq
where
k
i=1
Z
2
i
2
k
and Z
1
, . . . , Z
k
iid
N (0, 1)
p-value = P
0
[(X) > (x)] P
2
rq
> (x)
Multinomial LRT
mle: p
n
=
X
1
n
, . . . ,
X
k
n
T(X) =
L
n
( p
n
)
L
n
(p
0
)
=
k
j=1
p
j
p
0j
Xj
(X) = 2
k
j=1
X
j
log
p
j
p
0j
2
k1
The approximate size LRT rejects H
0
when (X)
2
k1,
Pearson Chi-square Test
T =
k
j=1
(X
j
E [X
j
])
2
E [X
j
]
where E [X
j
] = np
0j
under H
0
T
D
2
k1
p-value = P
2
k1
> T(x)
Faster
D
X
2
k1
than LRT, hence preferable for small n
Independence testing
I rows, J columns, X multinomial sample of size n = I J
mles unconstrained: p
ij
=
Xij
n
mles under H
0
: p
0ij
= p
i
p
j
=
Xi
n
Xj
n
LRT: = 2
I
i=1
J
j=1
X
ij
log
nXij
XiXj
PearsonChiSq: T =
I
i=1
J
j=1
(XijE[Xij])
2
E[Xij]
LRT and Pearson
D
2
k
, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( | x) =
f(x| )f()
f(x
n
)
=
f(x| )f()
f(x| )f() d
L
n
()f()
Denitions
X
n
= (X
1
, . . . , X
n
)
x
n
= (x
1
, . . . , x
n
)
Prior density f()
Likelihood f(x
n
| ): joint density of the data
In particular, X
n
iid =f(x
n
| ) =
n
i=1
f(x
i
| ) = L
n
()
Posterior density f( | x
n
)
Normalizing constant c
n
= f(x
n
) =
f(x| )f() d
Kernel: part of a density that depends on
Posterior mean
n
=
f( | x
n
) d =
Ln()f()
Ln()f() d
14.1 Credible Intervals
Posterior interval
P [ (a, b) | x
n
] =
b
a
f( | x
n
) d = 1
Equal-tail credible interval
f( | x
n
) d =
b
f( | x
n
) d = /2
Highest posterior density (HPD) region R
n
1. P [ R
n
] = 1
2. R
n
= { : f( | x
n
) > k} for some k
R
n
is unimodal =R
n
is an interval
14.2 Function of parameters
Let = () and A = { : () }.
Posterior CDF for
H(r | x
n
) = P [() | x
n
] =
A
f( | x
n
) d
Posterior density
h( | x
n
) = H
( | x
n
)
Bayesian delta method
| X
n
N
), se
14
14.3 Priors
Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.
Types
Flat: f() constant
Proper:
f() d = 1
Improper:
f() d =
Jeffreys prior (transformation-invariant):
f()
I() f()
det(I())
Conjugate: f() and f( | x
n
) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate prior Posterior hyperparameters
Bern (p) Beta (, ) +
n
i=1
x
i
, +n
n
i=1
x
i
Bin (p) Beta (, ) +
n
i=1
x
i
, +
n
i=1
N
i
i=1
x
i
NBin (p) Beta (, ) +rn, +
n
i=1
x
i
Po () Gamma (, ) +
n
i=1
x
i
, +n
Multinomial(p) Dir () +
n
i=1
x
(i)
Geo (p) Beta (, ) +n, +
n
i=1
x
i
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Unif (0, ) Pareto(x
m
, k) max
x
(n)
, x
m
, k +n
Exp () Gamma (, ) +n, +
n
i=1
x
i
N
,
2
c
0
,
2
0
2
0
+
n
i=1
x
i
2
c
2
0
+
n
2
c
2
0
+
n
2
c
1
N
c
,
2
n
i=1
(x
i
)
2
+n
N
,
2
Normal-
scaled Inverse
Gamma(, , , )
+n x
+n
, + n, +
n
2
,
+
1
2
n
i=1
(x
i
x)
2
+
( x )
2
2(n +)
MVN(,
c
) MVN(
0
,
0
)
1
0
+n
1
c
1
0
0
+n
1
x
1
0
+n
1
c
1
MVN(
c
, ) Inverse-
Wishart(, )
n +, +
n
i=1
(x
i
c
)(x
i
c
)
T
Pareto(x
mc
, k) Gamma (, ) +n, +
n
i=1
log
x
i
x
mc
Pareto(x
m
, k
c
) Pareto(x
0
, k
0
) x
0
, k
0
kn where k
0
> kn
Gamma (
c
, ) Gamma (
0
,
0
)
0
+n
c
,
0
+
n
i=1
x
i
14.4 Bayesian Testing
If H
0
:
0
:
Prior probability P [H
0
] =
0
f() d
Posterior probability P [H
0
| x
n
] =
0
f( | x
n
) d
Let H
0
, . . . , H
K1
be K hypotheses. Suppose f( | H
k
),
P [H
k
| x
n
] =
f(x
n
| H
k
)P [H
k
]
K
k=1
f(x
n
| H
k
)P [H
k
]
,
15
Marginal likelihood
f(x
n
| H
i
) =
f(x
n
| , H
i
)f( | H
i
) d
Posterior odds (of H
i
relative to H
j
)
P [H
i
| x
n
]
P [H
j
| x
n
]
=
f(x
n
| H
i
)
f(x
n
| H
j
)
. .. .
Bayes Factor BFij
P [H
i
]
P [H
j
]
. .. .
prior odds
Bayes factor
log
10
BF
10
BF
10
evidence
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
1 2 10 100 Strong
> 2 > 100 Decisive
p
=
p
1p
BF
10
1 +
p
1p
BF
10
where p = P [H
1
] and p
= P [H
1
| x
n
]
15 Exponential Family
Scalar parameter
f
X
(x| ) = h(x) exp {()T(x) A()}
= h(x)g() exp {()T(x)}
Vector parameter
f
X
(x| ) = h(x) exp
i=1
i
()T
i
(x) A()
T
T(x)
16 Sampling Methods
16.1 The Bootstrap
Let T
n
= g(X
1
, . . . , X
n
) be a statistic.
1. Estimate V
F
[T
n
] with V
Fn
[T
n
].
2. Approximate V
Fn
[T
n
] using simulation:
(a) Repeat the following B times to get T
n,1
, . . . , T
n,B
, an iid sample from
the sampling distribution implied by
F
n
i. Sample uniformly X
1
, . . . , X
n
F
n
.
ii. Compute T
n
= g(X
1
, . . . , X
n
).
(b) Then
v
boot
=
V
Fn
=
1
B
B
b=1
n,b
1
B
B
r=1
T
n,r
2
16.1.1 Bootstrap Condence Intervals
Normal-based interval
T
n
z
/2
se
boot
Pivotal interval
1. Location parameter = T(F)
2. Pivot R
n
=
3. Let H(r) = P [R
n
r] be the cdf of R
n
4. Let R
n,b
=
n,b
n
. Approximate H using bootstrap:
H(r) =
1
B
B
b=1
I(R
n,b
r)
5.
= sample quantile of (
n,1
, . . . ,
n,B
)
6. r
= sample quantile of (R
n,1
, . . . , R
n,B
), i.e., r
n
7. Approximate 1 condence interval C
n
=
a,
where
a =
H
1
1
2
n
r
1/2
= 2
1/2
b =
H
1
n
r
/2
= 2
/2
Percentile interval
C
n
=
/2
,
1/2
16
16.2 Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is dicult
We know h() up to a proportional constant: h() =
k()
k() d
Envelope condition: we can nd M > 0 such that k() Mg()
Algorithm
1. Draw
cand
g()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
k(
cand
)
Mg(
cand
)
4. Repeat until B values of
cand
have been accepted
Example
We can easily sample from the prior g() = f()
Target is the posterior h() k() = f(x
n
| )f()
Envelope condition: f(x
n
| ) f(x
n
|
n
) = L
n
(
n
) M
Algorithm
1. Draw
cand
f()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
L
n
(
cand
)
L
n
(
n
)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | x
n
]:
1. Sample from the prior
1
, . . . ,
n
iid
f()
2. w
i
=
L
n
(
i
)
B
i=1
L
n
(
i
)
i = 1, . . . , B
3. E [q() | x
n
]
B
i=1
q(
i
)w
i
17 Decision Theory
Denitions
Unknown quantity aecting our decision:
Decision rule: synonymous for an estimator
K
1
( a) a < 0
K
2
(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K
1
= K
2
)
L
p
loss: L(, a) = | a|
p
Zero-one loss: L(, a) =
0 a =
1 a =
17.1 Risk
Posterior risk
r(
| x) =
L(,
(x))f( | x) d = E
|X
L(,
(x))
(Frequentist) risk
R(,
) =
L(,
(x))f(x| ) dx = E
X|
L(,
(X))
Bayes risk
r(f,
) =
L(,
(x))f(x, ) dxd = E
,X
L(,
(X))
r(f,
) = E
E
X|
L(,
(X)
= E
R(,
r(f,
) = E
X
E
|X
L(,
(X)
= E
X
r(
| X)
17.2 Admissibility
dominates
if
: R(,
) R(,
)
: R(,
) < R(,
is inadmissible if there is at least one other estimator
that dominates
it. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes rule (or Bayes estimator)
r(f,
) = inf
r(f,
(x) = inf r(
| x) x = r(f,
) =
r(
| x)f(x) dx
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum risk
R(
) = sup
R(,
)
R(a) = sup
R(, a)
Minimax rule
sup
R(,
) = inf
R(
) = inf
sup
R(,
) = c
Least favorable prior
f
= Bayes rule R(,
f
) r(f,
f
)
18 Linear Regression
Denitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
Model
Y
i
=
0
+
1
X
i
+
i
E [
i
| X
i
] = 0, V [
i
| X
i
] =
2
Fitted line
r(x) =
0
+
1
x
Predicted (tted) values
Y
i
= r(X
i
)
Residuals
i
= Y
i
Y
i
= Y
i
0
+
1
X
i
0
,
1
) =
n
i=1
2
i
Least square estimates
T
= (
0
,
1
)
T
: min
0,
1
rss
0
=
Y
n
1
X
n
1
=
n
i=1
(X
i
X
n
)(Y
i
Y
n
)
n
i=1
(X
i
X
n
)
2
=
n
i=1
X
i
Y
i
n
XY
n
i=1
X
2
i
nX
2
E
| X
n
| X
n
=
2
ns
X
n
1
n
i=1
X
2
i
X
n
X
n
1
se(
0
) =
s
X
n
i=1
X
2
i
n
se(
1
) =
s
X
n
where s
2
X
= n
1
n
i=1
(X
i
X
n
)
2
and
2
=
1
n2
n
i=1
2
i
(unbiased estimate).
Further properties:
Consistency:
0
P
0
and
1
P
1
Asymptotic normality:
0
se(
0
)
D
N (0, 1) and
1
se(
1
)
D
N (0, 1)
Approximate 1 condence intervals for
0
and
1
:
0
z
/2
se(
0
) and
1
z
/2
se(
1
)
Wald test for H
0
:
1
= 0 vs. H
1
:
1
= 0: reject H
0
if |W| > z
/2
where
W =
1
/ se(
1
).
R
2
R
2
=
n
i=1
(
Y
i
Y )
2
n
i=1
(Y
i
Y )
2
= 1
n
i=1
2
i
n
i=1
(Y
i
Y )
2
= 1
rss
tss
18
Likelihood
L =
n
i=1
f(X
i
, Y
i
) =
n
i=1
f
X
(X
i
)
n
i=1
f
Y |X
(Y
i
| X
i
) = L
1
L
2
L
1
=
n
i=1
f
X
(X
i
)
L
2
=
n
i=1
f
Y |X
(Y
i
| X
i
)
n
exp
1
2
2
Y
i
(
0
1
X
i
)
Under the assumption of Normality, the least squares estimator is also the mle
2
=
1
n
n
i=1
2
i
18.2 Prediction
Observe X = x
0
+
1
x
= V
+x
2
+ 2x
Cov
0
,
Prediction interval
2
n
=
2
n
i=1
(X
i
X
)
2
n
i
(X
i
X)
2
j
+ 1
z
/2
n
18.3 Multiple Regression
Y = X +
where
X =
X
11
X
1k
.
.
.
.
.
.
.
.
.
X
n1
X
nk
1
.
.
.
1
.
.
.
Likelihood
L(, ) = (2
2
)
n/2
exp
1
2
2
rss
rss = (y X)
T
(y X) = Y X
2
=
N
i=1
(Y
i
x
T
i
)
2
If the (k k) matrix X
T
X is invertible,
= (X
T
X)
1
X
T
Y
V
| X
n
=
2
(X
T
X)
1
,
2
(X
T
X)
1
j=1
j
x
j
Unbiased estimate for
2
2
=
1
n k
n
i=1
2
i
= X
Y
mle
=
X
2
=
n k
n
2
1 Condence interval
j
z
/2
se(
j
)
18.4 Model Selection
Consider predicting a new observation Y
for covariates X
and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Undertting: too few covariates yields high bias
Overtting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to nd the one with the highest score
Hypothesis testing
H
0
:
j
= 0 vs. H
1
:
j
= 0 j J
Mean squared prediction error (mspe)
mspe = E
Y (S) Y
)
2
Prediction risk
R(S) =
n
i=1
mspe
i
=
n
i=1
E
Y
i
(S) Y
i
)
2
Training error
R
tr
(S) =
n
i=1
(
Y
i
(S) Y
i
)
2
19
R
2
R
2
(S) = 1
rss(S)
tss
= 1
R
tr
(S)
tss
= 1
n
i=1
(
Y
i
(S) Y )
2
n
i=1
(Y
i
Y )
2
The training error is a downward-biased estimate of the prediction risk.
E
R
tr
(S)
< R(S)
bias(
R
tr
(S)) = E
R
tr
(S)
R(S) = 2
n
i=1
Cov
Y
i
, Y
i
Adjusted R
2
R
2
(S) = 1
n 1
n k
rss
tss
Mallows C
p
statistic
R(S) =
R
tr
(S) + 2k
2
= lack of t + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) =
n
(
S
,
2
S
) k
Bayesian Information Criterion (BIC)
BIC(S) =
n
(
S
,
2
S
)
k
2
log n
Validation and training
R
V
(S) =
m
i=1
(
i
(S) Y
i
)
2
m = |{validation data}|, often
n
4
or
n
2
Leave-one-out cross-validation
R
CV
(S) =
n
i=1
(Y
i
Y
(i)
)
2
=
n
i=1
Y
i
Y
i
(S)
1 U
ii
(S)
2
U(S) = X
S
(X
T
S
X
S
)
1
X
S
(hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] =
A
f(x) dx.
Integrated square error (ise)
L(f,
f
n
) =
f(x)
f
n
(x)
2
dx = J(h) +
f
2
(x) dx
Frequentist risk
R(f,
f
n
) = E
L(f,
f
n
)
b
2
(x) dx +
v(x) dx
b(x) = E
f
n
(x)
f(x)
v(x) = V
f
n
(x)
19.1.1 Histograms
Denitions
Number of bins m
Binwidth h =
1
m
Bin B
j
has
j
observations
Dene p
j
=
j
/n and p
j
=
Bj
f(u) du
Histogram estimator
f
n
(x) =
m
j=1
p
j
h
I(x B
j
)
E
f
n
(x)
=
p
j
h
V
f
n
(x)
=
p
j
(1 p
j
)
nh
2
R(
f
n
, f)
h
2
12
(f
(u))
2
du +
1
nh
h
=
1
n
1/3
(f
(u))
2
du
1/3
R
f
n
, f)
C
n
2/3
C =
3
4
2/3
(f
(u))
2
du
1/3
Cross-validation estimate of E [J(h)]
J
CV
(h) =
f
2
n
(x) dx
2
n
n
i=1
f
(i)
(X
i
) =
2
(n 1)h
n + 1
(n 1)h
m
j=1
p
2
j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0
K(x) dx = 1
xK(x) dx = 0
x
2
K(x) dx
2
K
> 0
KDE
f
n
(x) =
1
n
n
i=1
1
h
K
x X
i
h
R(f,
f
n
)
1
4
(h
K
)
4
(f
(x))
2
dx +
1
nh
K
2
(x) dx
h
=
c
2/5
1
c
1/5
2
c
1/5
3
n
1/5
c
1
=
2
K
, c
2
=
K
2
(x) dx, c
3
=
(f
(x))
2
dx
R
(f,
f
n
) =
c
4
n
4/5
c
4
=
5
4
(
2
K
)
2/5
K
2
(x) dx
4/5
. .. .
C(K)
(f
)
2
dx
1/5
Epanechnikov Kernel
K(x) =
3
4
5(1x
2
/5)
|x| <
5
0 otherwise
Cross-validation estimate of E [J(h)]
J
CV
(h) =
f
2
n
(x) dx
2
n
n
i=1
f
(i)
(X
i
)
1
hn
2
n
i=1
n
j=1
K
X
i
X
j
h
+
2
nh
K(0)
K
(x) = K
(2)
(x) 2K(x) K
(2)
(x) =
K(x y)K(y) dy
19.2 Non-parametric Regression
Estimate f(x) where f(x) = E [Y | X = x]. Consider pairs of points
(x
1
, Y
1
), . . . , (x
n
, Y
n
) related by
Y
i
= r(x
i
) +
i
E [
i
] = 0
V [
i
] =
2
k-nearest Neighbor Estimator
r(x) =
1
k
i:xiNk(x)
Y
i
where N
k
(x) = {k values of x
1
, . . . , x
n
closest to x}
Nadaraya-Watson Kernel Estimator
r(x) =
n
i=1
w
i
(x)Y
i
w
i
(x) =
K
xxi
h
n
j=1
K
xxj
h
[0, 1]
R( r
n
, r)
h
4
4
x
2
K
2
(x) dx
4
r
(x) + 2r
(x)
f
(x)
f(x)
2
dx
+
K
2
(x) dx
nhf(x)
dx
h
c
1
n
1/5
R
( r
n
, r)
c
2
n
4/5
Cross-validation estimate of E [J(h)]
J
CV
(h) =
n
i=1
(Y
i
r
(i)
(x
i
))
2
=
n
i=1
(Y
i
r(x
i
))
2
1
K(0)
n
j=1
K
xx
j
h
2
19.3 Smoothing Using Orthogonal Functions
Approximation
r(x) =
j=1
j
(x)
J
i=1
j
(x)
Multivariate regression
Y = +
where
i
=
i
and =
0
(x
1
)
J
(x
1
)
.
.
.
.
.
.
.
.
.
0
(x
n
)
J
(x
n
)
= (
T
)
1
T
Y
1
n
T
Y (for equally spaced observations only)
21
Cross-validation estimate of E [J(h)]
R
CV
(J) =
n
i=1
Y
i
j=1
j
(x
i
)
j,(i)
2
20 Stochastic Processes
Stochastic Process
{X
t
: t T} T =
{0, 1, . . . } = Z discrete
[0, ) continuous
Notations X
t
, X(t)
State space X
Index set T
20.1 Markov Chains
Markov chain
P [X
n
= x| X
0
, . . . , X
n1
] = P [X
n
= x| X
n1
] n T, x X
Transition probabilities
p
ij
P [X
n+1
= j | X
n
= i]
p
ij
(n) P [X
m+n
= j | X
m
= i] n-step
Transition matrix P (n-step: P
n
)
(i, j) element is p
ij
p
ij
> 0
i
p
ij
= 1
Chapman-Kolmogorov
p
ij
(m+n) =
k
p
ij
(m)p
kj
(n)
P
m+n
= P
m
P
n
P
n
= P P = P
n
Marginal probability
n
= (
n
(1), . . . ,
n
(N)) where
i
(i) = P [X
n
= i]
0
initial distribution
n
=
0
P
n
20.2 Poisson Processes
Poisson process
{X
t
: t [0, )} = number of events up to and including time t
X
0
= 0
Independent increments:
t
0
< < t
n
: X
t1
X
t0
X
tn
X
tn1
Intensity function (t)
P [X
t+h
X
t
= 1] = (t)h +o(h)
P [X
t+h
X
t
= 2] = o(h)
X
s+t
X
s
Po (m(s +t) m(s)) where m(t) =
t
0
(s) ds
Homogeneous Poisson process
(t) = X
t
Po (t) > 0
Waiting times
W
t
:= time at which X
t
occurs
W
t
Gamma
t,
1
Interarrival times
S
t
= W
t+1
W
t
S
t
Exp
t Wt1 Wt
St
22
21 Time Series
Mean function
xt
= E [x
t
] =
xf
t
(x) dx
Autocovariance function
x
(s, t) = E [(x
s
s
)(x
t
t
)] = E [x
s
x
t
]
s
x
(t, t) = E
(x
t
t
)
2
= V [x
t
]
Autocorrelation function (ACF)
(s, t) =
Cov [x
s
, x
t
]
V [x
s
] V [x
t
]
=
(s, t)
(s, s)(t, t)
Cross-covariance function (CCV)
xy
(s, t) = E [(x
s
xs
)(y
t
yt
)]
Cross-correlation function (CCF)
xy
(s, t) =
xy
(s, t)
x
(s, s)
y
(t, t)
Backshift operator
B
k
(x
t
) = x
tk
Dierence operator
d
= (1 B)
d
White noise
w
t
wn(0,
2
w
)
Gaussian: w
t
iid
N
0,
2
w
E [w
t
] = 0 t T
V [w
t
] =
2
t T
w
(s, t) = 0 s = t s, t T
Random walk
Drift
x
t
= t +
t
j=1
w
j
E [x
t
] = t
Symmetric moving average
m
t
=
k
j=k
a
j
x
tj
where a
j
= a
j
0 and
k
j=k
a
j
= 1
21.1 Stationary Time Series
Strictly stationary
P [x
t1
c
1
, . . . , x
tk
c
k
] = P [x
t1+h
c
1
, . . . , x
tk+h
c
k
]
k N, t
k
, c
k
, h Z
Weakly stationary
E
x
2
t
< t Z
E
x
2
t
= m t Z
x
(s, t) =
x
(s +r, t +r) r, s, t Z
Autocovariance function
(h) = E [(x
t+h
)(x
t
)] h Z
(0) = E
(x
t
)
2
(0) 0
(0) |(h)|
(h) = (h)
Autocorrelation function (ACF)
x
(h) =
Cov [x
t+h
, x
t
]
V [x
t+h
] V [x
t
]
=
(t +h, t)
(t +h, t +h)(t, t)
=
(h)
(0)
Jointly stationary time series
xy
(h) = E [(x
t+h
x
)(y
t
y
)]
xy
(h) =
xy
(h)
x
(0)
y
(h)
Linear process
x
t
= +
j=
j
w
tj
where
j=
|
j
| <
(h) =
2
w
j=
j+h
j
23
21.2 Estimation of Correlation
Sample mean
x =
1
n
n
t=1
x
t
Sample variance
V [ x] =
1
n
n
h=n
1
|h|
n
x
(h)
Sample autocovariance function
(h) =
1
n
nh
t=1
(x
t+h
x)(x
t
x)
Sample autocorrelation function
(h) =
(h)
(0)
Sample cross-variance function
xy
(h) =
1
n
nh
t=1
(x
t+h
x)(y
t
y)
Sample cross-correlation function
xy
(h) =
xy
(h)
x
(0)
y
(0)
Properties
x(h)
=
1
n
if x
t
is white noise
xy(h)
=
1
n
if x
t
or y
t
is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
x
t
=
t
+s
t
+w
t
t
= trend
s
t
= seasonal component
w
t
= random noise term
21.3.1 Detrending
Least squares
1. Choose trend model, e.g.,
t
=
0
+
1
t +
2
t
2
2. Minimize rss to obtain trend estimate
t
=
0
+
1
t +
2
t
2
3. Residuals noise w
t
Moving average
The low-pass lter v
t
is a symmetric moving average m
t
with a
j
=
1
2k+1
:
v
t
=
1
2k + 1
k
i=k
x
t1
If
1
2k+1
k
i=k
w
tj
0, a linear trend function
t
=
0
+
1
t passes
without distortion
Dierencing
t
=
0
+
1
t = x
t
=
1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1
1
z
p
z
p
z C
p
= 0
Autoregressive operator
(B) = 1
1
B
p
B
p
Autoregressive model order p, AR(p)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
(B)x
t
= w
t
AR(1)
x
t
=
k
(x
tk
) +
k1
j=0
j
(w
tj
)
k,||<1
=
j=0
j
(w
tj
)
. .. .
linear process
E [x
t
] =
j=0
j
(E [w
tj
]) = 0
(h) = Cov [x
t+h
, x
t
] =
2
w
h
1
2
(h) =
(h)
(0)
=
h
(h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial
(z) = 1 +
1
z + +
q
z
q
z C
q
= 0
Moving average operator
(B) = 1 +
1
B + +
p
B
p
MA(q) (moving average model order q)
x
t
= w
t
+
1
w
t1
+ +
q
w
tq
x
t
= (B)w
t
E [x
t
] =
q
j=0
j
E [w
tj
] = 0
(h) = Cov [x
t+h
, x
t
] =
2
w
qh
j=0
j
j+h
0 h q
0 h > q
MA(1)
x
t
= w
t
+w
t1
(h) =
(1 +
2
)
2
w
h = 0
2
w
h = 1
0 h > 1
(h) =
(1+
2
)
h = 1
0 h > 1
ARMA(p, q)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
+
1
w
t1
+ +
q
w
tq
(B)x
t
= (B)w
t
Partial autocorrelation function (PACF)
x
h1
i
regression of x
i
on {x
h1
, x
h2
, . . . , x
1
}
hh
= corr(x
h
x
h1
h
, x
0
x
h1
0
) h 2
E.g.,
11
= corr(x
1
, x
0
) = (1)
ARIMA(p, d, q)
d
x
t
= (1 B)
d
x
t
is ARMA(p, q)
(B)(1 B)
d
x
t
= (B)w
t
Exponentially Weighted Moving Average (EWMA)
x
t
= x
t1
+w
t
w
t1
x
t
=
j=1
(1 )
j1
x
tj
+w
t
when || < 1
x
n+1
= (1 )x
n
+ x
n
Seasonal ARIMA
Denoted by ARIMA(p, d, q) (P, D, Q)
s
P
(B
s
)(B)
D
s
d
x
t
= +
Q
(B
s
)(B)w
t
21.4.1 Causality and Invertibility
ARMA(p, q) is causal (future-independent) {
j
} :
j=0
j
< such that
x
t
=
j=0
w
tj
= (B)w
t
ARMA(p, q) is invertible {
j
} :
j=0
j
< such that
(B)x
t
=
j=0
X
tj
= w
t
Properties
ARMA(p, q) causal roots of (z) lie outside the unit circle
(z) =
j=0
j
z
j
=
(z)
(z)
|z| 1
ARMA(p, q) invertible roots of (z) lie outside the unit circle
(z) =
j=0
j
z
j
=
(z)
(z)
|z| 1
Behavior of the ACF and PACF for causal and invertible ARMA models
AR(p) MA(q) ARMA(p, q)
ACF tails o cuts o after lag q tails o
PACF cuts o after lag p tails o q tails o
21.5 Spectral Analysis
Periodic process
x
t
= Acos(2t +)
= U
1
cos(2t) +U
2
sin(2t)
Frequency index (cycles per unit time), period 1/
25
Amplitude A
Phase
U
1
= Acos and U
2
= Asin often normally distributed rvs
Periodic mixture
x
t
=
q
k=1
(U
k1
cos(2
k
t) +U
k2
sin(2
k
t))
U
k1
, U
k2
, for k = 1, . . . , q, are independent zero-mean rvs with variances
2
k
(h) =
q
k=1
2
k
cos(2
k
h)
(0) = E
x
2
t
q
k=1
2
k
Spectral representation of a periodic process
(h) =
2
cos(2
0
h)
=
2
2
e
2i0h
+
2
2
e
2i0h
=
1/2
1/2
e
2ih
dF()
Spectral distribution function
F() =
0 <
0
2
/2 <
0
2
0
F() = F(1/2) = 0
F() = F(1/2) = (0)
Spectral density
f() =
h=
(h)e
2ih
1
2
1
2
Needs
h=
|(h)| < = (h) =
1/2
1/2
e
2ih
f() d h = 0, 1, . . .
f() 0
f() = f()
f() = f(1 )
(0) = V [x
t
] =
1/2
1/2
f() d
White noise: f
w
() =
2
w
ARMA(p, q) , (B)x
t
= (B)w
t
:
f
x
() =
2
w
|(e
2i
)|
2
|(e
2i
)|
2
where (z) = 1
p
k=1
k
z
k
and (z) = 1 +
q
k=1
k
z
k
Discrete Fourier Transform (DFT)
d(
j
) = n
1/2
n
i=1
x
t
e
2ijt
Fourier/Fundamental frequencies
j
= j/n
Inverse DFT
x
t
= n
1/2
n1
j=0
d(
j
)e
2ijt
Periodogram
I(j/n) = |d(j/n)|
2
Scaled Periodogram
P(j/n) =
4
n
I(j/n)
=
2
n
n
t=1
x
t
cos(2tj/n
2
+
2
n
n
t=1
x
t
sin(2tj/n
2
22 Math
22.1 Gamma Function
Ordinary: (s) =
0
t
s1
e
t
dt
Upper incomplete: (s, x) =
x
t
s1
e
t
dt
Lower incomplete: (s, x) =
x
0
t
s1
e
t
dt
( + 1) = () > 1
(n) = (n 1)! n N
(1/2) =
1
0
t
x1
(1 t)
y1
dt =
(x)(y)
(x +y)
Incomplete: B(x; a, b) =
x
0
t
a1
(1 t)
b1
dt
Regularized incomplete:
I
x
(a, b) =
B(x; a, b)
B(a, b)
a,bN
=
a+b1
j=a
(a +b 1)!
j!(a +b 1 j)!
x
j
(1 x)
a+b1j
26
I
0
(a, b) = 0 I
1
(a, b) = 1
I
x
(a, b) = 1 I
1x
(b, a)
22.3 Series
Finite
k=1
k =
n(n + 1)
2
k=1
(2k 1) = n
2
k=1
k
2
=
n(n + 1)(2n + 1)
6
k=1
k
3
=
n(n + 1)
2
k=0
c
k
=
c
n+1
1
c 1
c = 1
Binomial
k=0
n
k
= 2
n
k=0
r +k
k
r +n + 1
n
k=0
k
m
n + 1
m+ 1
Vandermondes Identity:
r
k=0
m
k
n
r k
m+n
r
Binomial Theorem:
n
k=0
n
k
a
nk
b
k
= (a +b)
n
Innite
k=0
p
k
=
1
1 p
,
k=1
p
k
=
p
1 p
|p| < 1
k=0
kp
k1
=
d
dp
k=0
p
k
=
d
dp
1
1 p
=
1
(1 p)
2
|p| < 1
k=0
r +k 1
k
x
k
= (1 x)
r
r N
+
k=0
p
k
= (1 +p)
|p| < 1 , C
22.4 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered n
k
=
k1
i=0
(n i) =
n!
(n k)!
n
k
unordered
n
k
=
n
k
k!
=
n!
k!(n k)!
n 1 +r
r
n 1 +r
n 1
Stirling numbers, 2
nd
kind
n
k
= k
n 1
k
n 1
k 1
1 k n
n
0
1 n = 0
0 else
Partitions
P
n+k,k
=
n
i=1
P
n,i
k > n : P
n,k
= 0 n 1 : P
n,0
= 0, P
0,0
= 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U| = m f arbitrary f injective f surjective f bijective
B : D, U : D m
n
m
n
m n
0 else
m!
n
m
n! m = n
0 else
B : D, U : D
n +n 1
n
m
n
n 1
m1
1 m = n
0 else
B : D, U : D
m
k=1
n
k
1 m n
0 else
n
m
1 m = n
0 else
B : D, U : D
m
k=1
P
n,k
1 m n
0 else
P
n,m
1 m = n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):4553, 2008.
[3] R. H. Shumway and D. S. Stoer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
U
n
i
v
a
r
i
a
t
e
d
i
s
t
r
i
b
u
t
i
o
n
r
e
l
a
t
i
o
n
s
h
i
p
s
,
c
o
u
r
t
e
s
y
L
e
e
m
i
s
a
n
d
M
c
Q
u
e
s
t
o
n
[
2
]
.
28