cs109 Final Cheat 3 PDF
cs109 Final Cheat 3 PDF
TG Sido
April 4, 2017
1 Fundamentals
1.1 DeMorgan’s Laws
n c n n c n
c
( ⋃ Ei ) = ⋂ Ei ( ⋂ Ei ) = ⋃ Eic
i=1 i=1 i=1 i=1
x1 + x2 + ⋯ + xr = n xi > 0, i = 1, . . . , r
x1 + x2 + ⋯ + xr = n
2 Conditional Probability
P (EF )
P (E∣F ) = ⇔ P (EF ) = P (E∣F )P (F )
P (F )
1
2.2 Bayes’ Theorem
The many shapes and forms of Bayes’ Theorem...
P (E) = P (E∣F )P (F ) + P (E∣F c )P (F c )
P (EF ) P (E∣F )P (F )
P (F ∣E) = =
P (E) P (E)
P (E∣F )P (F )
P (F ∣E) =
P (E∣F )P (F ) + P (E∣F c )P (F c )
Fully General Form:
If F1 , F2 , . . . , Fn comprise a set of mutually exclusive and exhaustive events, then
P (E∣Fj )P (Fj )
P (Fj ∣E) = n
∑i=1 P (E∣Fi )P (Fi )
That’s odd.
The odds of H given observed evidence E:
P (H∣E) P (H)P (E∣H)
=
P (H ∣E) P (H c )P (E∣H c )
c
3 Independence
3.1 Definition
Two events are independent if P (EF ) = P (E)P (F ). Otherwise they are dependent.
More generally, events E1 , E2 , . . . , En are independent if for every subset E1′ , E2′ , . . . , Er where r ≤ n it holds
that
P (E1′ E2′ . . . Er ) = P (E1′ )P (E2′ )⋯P (Er )
4 Random Distributions
4.1 Definitions and Properties
Probability Mass Function:
p(a) = P (X = a)
Probability Density Function:
b ∞
P (a ≤ X ≤ b) = ∫ f (x)dx P (−∞ < X < ∞) = ∫ f (x)dx = 1
a −∞
a
F (a) = ∑ p(x) F (a) = ∫ f (x)dx
all x≤a −∞
d
Density f is the derivative of CDF F : f (a) = da
F (a)
2
4.2 Joint distributions
Joint Probability Mass Function:
pX,Y (a, b) = P (X = a, Y = b)
Marginal distributions:
Marginal distributions:
a b ∂2
FX,Y (a, b) = ∫ ∫ fX,Y (x, y)dydx fX,Y (a, b) = FX,Y (a, b)
−∞ −∞ ∂a∂b
Marginal density functions:
∞ ∞
fx (a) = ∫ fX,Y (a, y)dy fy (b) = ∫ fX,Y (x, b)dx
−∞ −∞
4.4 Convolution
Let X and Y be independent random variables. The convolution of FX and FY is FX+Y :
∞
FX+Y (a) = P (X + Y ≤ a) = ∫ FX (a − y)fY (y)dy
y=−∞
∞
fX+Y (a) = ∫ fX (a − y)fY (y)dy
y=−∞
∞
In discrete case, replace ∫y=−∞ with ∑y , and f (y) with p(y).
3
4.5 Conditional Distributions
Conditional PMF of X given Y :
pX,Y (x, y)
pX∣Y (x∣y) = P (X = x∣Y = y) =
pY (y)
Conditional PDF of X given Y :
fX,Y (x, y)
fX∣Y (x∣y) =
fY (y)
Conditional CDF of X given Y :
FX∣Y (a∣y) = P (X ≤ a, Y = y) = ∑ pX∣Y (x∣y)
x≤a
a
=∫ fX∣Y (x∣y)dx
−∞
It is possible to mix continuous and discrete random variables in conditional distributions. For example let X
be a continuous random variable and N be a discrete random variable. Then the conditional PDF of X given
N and the conditional PMF of N given X are
pN ∣X (n∣x)fX (x)
fX∣N (x∣n) =
pN (n)
fX∣N (x∣n)pN (n)
PN ∣X (n∣x) =
fX (x)
5 Expectation
5.1 Definitions
The expected value for a discrete random variable X is defined as
E[X] = ∑ xp(x)
x∶p(x)>0
5.2 Properties
If I is an indicator variable for the event A, then
E[I] = P (A)
Let g(X) be a real-valued function of X.
∞
E[g(X)] = ∑ g(xi )p(xi ) E[g(X)] = ∫ g(x)f (x)dx
i −∞
4
Let g(X, Y ) be a real-valued function of two random variables.
∞ ∞
E[g(X, Y )] = ∑ ∑ g(x, y)pX,Y (x, y) E[g(X, Y )] = ∫ ∫ g(x, y)fX,Y (x, y)dxdy
y x −∞ −∞
Linearity:
E[aX + b] = aE[X] + b
N -th Moment of X:
E[X n ] = ∑ xn p(x)
x∶p(x)>0
Bounding Expectation:
If random variable X ≥ a then E[X] ≥ a.
If P (a ≤ X < ∞) = 1 then a ≤ E[X] < ∞.
If random variables X ≥ Y then E[X] ≥ E[Y ].
6 Variance
6.1 Definition
If X is a random variable with mean µ then the variance of X, denoted Var(X), is:
6.2 Properties
Var(aX + b) = a2 Var(X)
If X1 , X2 , . . . , Xn are independent random variables, then
n n
Var (∑ Xi ) = ∑ Var(Xi )
i=1 i=1
5
6.3 Covariance
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]
If X and Y are independent, Cov(X, Y ) = 0 Properties:
Cov(X, Y ) = Cov(Y, X)
Cov(X, X) = Var(X)
Cov(aX + b, Y ) = aCov(X, Y )
⎛n m ⎞ n m
Cov ∑ Xi , ∑ Yj = ∑ ∑ Cov(Xi , Xj )
⎝i=1 j=1 ⎠ i=1 j=1
6.4 Correlation
Cov(X, Y )
ρ(X, Y ) = √
Var(X)Var(Y )
Note: −1 ≤ ρ(X, Y ) ≤ 1.
Correlation measures linearity between X and Y .
If ρ(X, Y ) = 0, X and Y are uncorrelated.
M (t) = E[etX ]
7.2 Properties
dn
M n (t) = ( ) M (t) = E[X n enX ]
dtn
M n (0) = E[X n ]
MX (t) = MY (t) iff X ∼ Y
X1 , X2 , . . . , Xn independent if and only if:
6
8 Inequalities
8.1 Boole’s Inequality
Let E1 , E2 , . . . , En be events with indicator random variables Xi .
n n
∑ P (Ei ) ≥ P ( ⋃ Ei )
i=1 i=1
E[X]
P (X ≥ a) ≤ for all a > 0
a
σ2
P (∣X − µ∣ ≥ k) ≤ for all k > 0
k2
One-sided inequality:
σ2
P (X ≥ E[X] + a) ≤ for any a > 0
σ2 + a2
σ2
P (X ≤ E[X] − a) ≤ for any a > 0
σ 2 + a2
7
9.2 Biased or NonBiased?
Unbiased when :
E[Θ̂] = Θ
2
Var(X) = σn
for large n, 100(1 - α)% CI is:
S S
(X − z α2 √ , X + z α2 √ )
n n
Meaning: 100(1 - α)% of time that CI is computed from sample , true µ is in interval.
α
Φ(z α2 ) = 1 −
2
Ex: α = .05, α2 = .025, Φ(z α2 ) = .975, z α2 = 1.96
Confidence Level:
90% − − > 1.645
95% − − > 1.96
99% − − > 2.58
8
11 Laws of Large Numbers
Consider IID Random Variables
X ∼ Ber(p)
P (X = 0) = 1 − p
P (X = 1) = p
E[X] = p
Var(X) = p(1 − p)
M (t) = et p + 1 − p
12.2 Binomial
The number of successes in an experiment with n trials and p probability of success on each trial.
X ∼ Bin(n, p)
n
P (X = i) = p(i) = ( )pi (1 − p)n−i where i = 0, 1, . . . , n
i
E[X] = np
Var(X) = np(1 − p)
M (t) = (pet + 1 − p)n
Note that the binomial distribution is a generalization of the Bernoulli distribution, since Ber(p) ∼ Bin(1, p).
9
12.3 Poisson
Approximates the binomial random variable when n is large and p is small enough to make np ”moderate”—
generally when n > 20 and p < 0.05—and approaches the binomial distribution as n → ∞ and p → 0.
X ∼ Poi(λ) where λ = np
λi
P (X = i) = e−λ where i = 0, 1, 2, . . .
i!
E[X] = λ
Var(X) = λ
t
M (t) = eλ(e −1)
The approximations also works to a certain extent when the successes in the trials are not entirely independent,
and when the probability of success in each trial varies slightly.
12.4 Geometric
The number of independent trials until a success, where the probability of success is p.
X ∼ Geo(p)
P (X = n) = (1 − p)n−1 p where n = 1, 2, . . .
E[X] = 1/p
Var(X) = (1 − p)/p2
12.6 Hypergeometric
The number of white balls drawn after drawing n balls (without replacement) from an urn containing N balls,
with m white balls and N − m other (”black”) balls.
X ∼ HypG(n, N, m)
(mi)(Nn−i
−m
)
P (X = i) = where i = 0, 1, . . . , n
(N
n
)
E[X] = n(m/N )
nm(N − n)(N − m)
Var(X) =
N 2 (N − 1)
HypG(n, N, m) → Bin(n, m/N ) , as N → ∞ and m/N stays constant
10
12.7 Multinomial
The multinomial distribution further generalizes the binomial distribution: given an experiment with n inde-
pendent trials, where each trial results in one of m outcomes, with respective probabilities p1 , p2 , . . . , pm such
that ∑m
i=1 pi = 1, then if Xi denotes the number of trials with outcome i we have
n
P (X1 = c1 , X2 = c2 , . . . , Xm = cm ) = ( )pc1 pc2 ⋯pcmm
c1 , c2 , . . . , cm 1 2
n
where ∑m
i=1 ci = n and (c1 ,c2 ,...,cm ) =
n!
c1 !c2 !⋯cm !
.
13.1 Uniform
X ∼ Uni(α, β)
1
α≤x≤β
β−α
f (x) = {
0 otherwise
α+β
E[X] =
2
(β − α)2
Var(X) =
12
13.2 Normal
For values in common natural phenomena, especially when resulting from the sum of multiple variables.
X ∼ N(µ, σ 2 )
1 (x−µ)2
f (x) = √ e− 2σ2 where − ∞ < x < ∞
σ 2π
E[X] = µ
Var(X) = σ 2
2 2
( σ 2t +µt)
M (t) = e
Y ∼ N(aµ + b, a2 σ 2 )
x−b
FY (x) = FX ( )
a
The Standard (Unit) Normal Random Variable Z ∼ N (0, 1) has a cumulative distribution function (CDF)
commonly labeled Φ(z) = P (Z ≤ z) that has some useful properties.
1 2 z
√ e−x /2 dx
Φ(z) = ∫
−∞ 2π
Φ(−z) = 1 − Φ(z)
P (Z ≥ −z) = P (Z > z)
11
Given X ∼ N (µ, σ 2 ) where σ > 0, we can then compute the CDF of X using the CDF of the standard normal
variable.
x−µ
FX (x) = Φ( )
σ
By the de Moivre-Laplace Limit Theorem, the normal variable can approximate the binomial when Var(X) =
np(1 − p) ≥ 10. If we let Sn denote the number of successes (with probability p) in n independent trials, then
⎛ Sn − np ⎞ n→∞
P a≤ √ ≤ b → Φ(b) − Φ(a)
⎝ np(1 − p) ⎠
13.3 Exponential
Represents time until some event, with rate λ > 0.
X ∼ Exp(λ)
λe−λx if x ≥ 0
f (x) = {
0 if x < 0
1
E[X] =
λ
1
Var(X) = 2
λ
F (x) = 1 − e−λx where x ≥ 0
Exponentially distributed random variables are memoryless.
P (X > s + t∣X > s) = P (X > t)
13.4 Beta
X ∼ Beta(a, b)
1
B(a,b)
xa−1 (1 − x)b−1 0<x<1
f (x) = {
0 otherwise
1
B(a, b) = ∫ xa−1 (1 − x)b−1 dx
0
a
E[X] =
a+b
ab
Var(X) =
(a + b)2 (a + b + 1)
If X ∼ Uni(0, 1) and N denotes the number of heads resulting from a number of coin flips with some unknown
probability of getting heads, then
X∣(N = n, m + n trials) ∼ Beta(n + 1, m + 1)
14 Useful Definitions
14.1 Taylor Series
n
xn
ex = ∑
i=1 n!
12
14.2 Integration By Parts
b dv b du
∫ u(x) ∗ dx = u(x) ∗ v(x)∣ba − ∫ v(x) ∗
a dx a dx
13