100% found this document useful (1 vote)
107 views

Calculus, Probability, and Statistics Primers: Dave Goldsman

This document provides a primer on calculus concepts including: definitions of continuity, differentiability, and derivatives; properties and examples of derivatives including the product rule, quotient rule, and chain rule; methods for finding zeros of functions like bisection and Newton's method; and definitions and examples of antiderivatives and the fundamental theorem of calculus. The document is intended to review calculus concepts that will be used in subsequent primers on probability and statistics.

Uploaded by

banned miner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
107 views

Calculus, Probability, and Statistics Primers: Dave Goldsman

This document provides a primer on calculus concepts including: definitions of continuity, differentiability, and derivatives; properties and examples of derivatives including the product rule, quotient rule, and chain rule; methods for finding zeros of functions like bisection and Newton's method; and definitions and examples of antiderivatives and the fundamental theorem of calculus. The document is intended to review calculus concepts that will be used in subsequent primers on probability and statistics.

Uploaded by

banned miner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Calculus, Probability, and Statistics Primers

Dave Goldsman

Georgia Institute of Technology, Atlanta, GA, USA

12/30/18

1 / 104
Outline
1 Calculus Primer
2 Probability Primer
Basics
Simulating Random Variables
Great Expectations
Functions of a Random Variable
Jointly Distributed Random Variables
Covariance and Correlation
Some Probability Distributions
Limit Theorems
3 Statistics Primer
Intro to Estimation
Unbiased Estimation
Maximum Likelihood Estimation
Distributional Results and Confidence Intervals
2 / 104
Calculus Primer

Calculus Primer

Goal: This section provides a brief review of various calculus tidbits


that we’ll be using later on.

First of all, let’s suppose that f (x) is a function that maps values of x
from a certain domain X to a certain range Y , which we can denote
by the shorthand f : X → Y .

Example If f (x) = x2 , then the function takes x-values from the


real line R to the nonnegative portion of the real line R+ .

3 / 104
Calculus Primer

Definition We say that f (x) is a continuous function if, for any x0


and x ∈ X, we have limx→x0 f (x) = f (x0 ), where “lim” denotes a
limit and f (x) is assumed to exist for all x ∈ X.

Example The function f (x) = 3x2 is continuous for all x. The


function f (x) = bxc (round down to the nearest integer, e.g.,
b3.4c = 3) has a “jump” discontinuity at any integer x. 2

Definition If f (x) is continuous, then it is differentiable (has a


derivative) if

d f (x + h) − f (x)
f (x) ≡ f 0 (x) ≡ lim
dx h→0 h
exists and is well-defined for any given x. Think of the derivative as
the slope of the function.

4 / 104
Calculus Primer

Example Some well-known derivatives are:

[xk ]0 = kxk−1 ,

[ex ]0 = ex ,

[sin(x)]0 = cos(x),

[cos(x)]0 = − sin(x),

1
[`n(x)]0 = ,
x
1
[arctan(x)]0 = . 2
1 + x2
5 / 104
Calculus Primer

Theorem Some well-known properties of derivatives are:

[af (x) + b]0 = af 0 (x),

[f (x) + g(x)]0 = f 0 (x) + g 0 (x),

[f (x)g(x)]0 = f 0 (x)g(x) + f (x)g 0 (x) (product rule),

0
g(x)f 0 (x) − f (x)g 0 (x)

f (x)
= (quotient rule)1 ,
g(x) g 2 (x)

[f (g(x))]0 = f 0 (g(x))g 0 (x) (chain rule)2 .

1
Ho dee Hi minus Hi dee Ho over Ho Ho.
2
www.youtube.com/watch?v=gGAiW5dOnKo
6 / 104
Calculus Primer

Example Suppose that f (x) = x2 and g(x) = `n(x). Then

d 2
[f (x)g(x)]0 = x `n(x) = 2x`n(x) + x,
dx
0
d x2

f (x) 2x`n(x) − x
= = ,
g(x) dx `n(x) `n2 (x)

2`n(x)
[f (g(x))]0 = 2g(x)g 0 (x) = . 2
x

7 / 104
Calculus Primer

Remark The second derivative f 00 (x) ≡ dx d 0


f (x) and is the “slope of
0
the slope.” If f (x) is “position,” then f (x) can be regarded as
“velocity,” and as f 00 (x) as “acceleration.”

The minimum or maximum of f (x) can only occur when the slope of
f (x) is zero, i.e., only when f 0 (x) = 0, say at x = x0 . Exception:
Check the endpoints of your interval of interest as well.

Then if f 00 (x0 ) < 0, you get a max; if f 00 (x0 ) > 0, you get a min; and
if f 00 (x0 ) = 0, you get a point of inflection.

Example Find the value of x that minimizes f (x) = e2x + e−x . The
minimum can only occur when f 0 (x) = 2e2x − e−x = 0. After a little
algebra, we find that this occurs at x0 = −(1/3)`n(2) ≈ −0.231. It’s
also easy to show that f 00 (x) > 0 for all x; and so x0 yields a
minimum. 2
8 / 104
Calculus Primer

Finding Zeroes: Speaking of solving for a 0, how might you do it if


a continuous function g(x) is a complicated nonlinear fellow?
Trial-and-error (not so great).
Bisection (divide-and-conquer).
Newton’s method (or some variation)
Fixed-point method (we’ll do this later).

9 / 104
Calculus Primer

Bisection: Suppose you can find x1 and x2 such that g(x1 ) < 0 and
g(x2 ) > 0. (We’ll follow similar logic if the inequalities are both
reversed.) By the Intermediate Value Theorem (which you may
remember), there must be a zero in [x1 , x2 ], that is, x? ∈ [x1 , x2 ] such
that g(x? ) = 0.

Thus, take x3 = (x1 + x2 )/2. If g(x3 ) < 0, then there must be a zero
in [x3 , x2 ]. Otherwise, if g(x3 ) > 0, then there must be a zero in
[x1 , x3 ]. In either case, you’ve reduced the length of the search
interval.

Continue in this same manner until the length of the search interval is
as small as desired.

Exercise: Try this√out for g(x) = x2 − 2, and come up with an


approximation for 2.
10 / 104
Calculus Primer

Newton’s Method: Suppose you can find a reasonable first guess


for the zero, say, xi , where we start off at iteration i = 0. If g(x) has a
nice, well-behaved derivative (which doesn’t happen to be too flat
near the zero of g(x)), then iterate your guess as follows:

g(xi )
xi+1 = xi − .
g 0 (xi )

Keep going until things appear to converge.

This makes sense since for xi and xi+1 close to each other and the
zero x? , we have
g(x? ) − g(xi )
g 0 (xi ) ≈ .
x? − xi

11 / 104
Calculus Primer

Exercise: Try Newton out for g(x) = x2 − 2, noting that the


iteration step is to set

x2i − 2 xi 1
xi+1 = xi − = + .
2xi 2 xi
Let’s start with a bad guess of x1 = 1. Then
x1 1 1
x2 = + = + 1 = 1.5
2 x1 2
x2 1 1.5 1
x3 = + ≈ + = 1.4167
2 x2 2 1.5
x3 1
x4 = + ≈ 1.4142 Wow! 2
2 x3

12 / 104
Calculus Primer

Integration

Definition The function F (x) having derivative f (x) is called the


antiderivative
R (or indefinite integral). It is denoted by
F (x) = f (x) dx.

Fundamental Theorem of Calculus: If f (x) is continuous, then


the area under the curve for x ∈ [a, b] is denoted and given by the
definite integral 3
Z b
b

f (x) dx ≡ F (x) ≡ F (b) − F (a).
a a

3
“I’m really an integral!”
13 / 104
Calculus Primer

Example Some well-known indefinite integrals are:

xk+1
Z
xk dx = + C for k 6= −1
k+1
Z
dx
= `n|x| + C,
x
Z
ex dx = ex + C,
Z
cos(x) dx = sin(x) + C,
Z
dx
= arctan(x) + C,
1 + x2
where C is an arbitrary constant. 2

14 / 104
Calculus Primer

Example It is easy to see that


Z
d cabin
= `n|cabin| + C = houseboat. 2
cabin
Theorem Some well-known properties of definite integrals are:
Z a
f (x) dx = 0,
a

Z b Z a
f (x) dx = − f (x) dx,
a b
Z b Z c Z b
f (x) dx = f (x) dx + f (x) dx.
a a c

15 / 104
Calculus Primer

Theorem Some other properties of general integrals are:


Z Z Z
[f (x) + g(x)] dx = f (x) dx + g(x) dx,

Z Z
f (x)g (x) dx = f (x)g(x)− g(x)f 0 (x) dx
0
(integration by parts)4 ,

Z Z
0
f (g(x))g (x) dx = f (u) du (substitution rule)5 .

4
www.youtube.com/watch?v=OTzLVIc-O5E
5
www.youtube.com/watch?v=eswQl-hcvU0
16 / 104
Calculus Primer

Example Using integration by parts with f (x) = x and g 0 (x) = e2x


and the chain rule, we have

xe2x 1 e2 e2x 1
Z 1 Z 1 2x
e2 + 1

e
2x
xe dx = − dx = − = . 2
0 2 0 0 2 2 4 0 4

Definition Derivatives of arbitrary order k can be written as f (k) (x)


dk (0) (x) = f (x).
or dx k f (x). By convention, f

The Taylor series expansion of f (x) about a point a is given by



X f (k) (a)(x − a)k
f (x) = .
k!
k=0

The Maclaurin series is simply Taylor expanded around a = 0.

17 / 104
Calculus Primer

Example Here are some famous Maclaurin series.



X (−1)k+1 x2k+1
sin(x) = ,
(2k + 1)!
k=0

X (−1)k x2k
cos(x) = ,
(2k)!
k=0

X xk
ex = .
k!
k=0

18 / 104
Calculus Primer

Example And while we’re at it, here are some miscellaneous sums
that you should know.
n
X n(n + 1)
k = ,
2
k=1
n
X n(n + 1)(2n + 1)
k2 = ,
6
k=1

X 1
pk = (for −1 < p < 1).
1−p
k=0

19 / 104
Calculus Primer

Theorem Occasionally, we run into trouble when taking


indeterminate ratios of the form 0/0 or ∞/∞. In such cases,
L’Hôspital’s Rule6 is useful: If the limits limx→a f (x) and
limx→a g(x) both go to 0 or both go to ∞, then

f (x) f 0 (x)
lim = lim 0 .
x→a g(x) x→a g (x)

Example L’Hôspital shows that

sin(x) cos(x)
lim = lim = 1. 2
x→0 x x→0 1

6
This rule makes me sick.
20 / 104
Calculus Primer

Computer Exercise: Let’s do some easy integration via Riemann


sums. Simply approximate the area under the nice, continuous
function f (x) from a to b by adding up the areas of n adjacent
rectangles of width ∆x = (b − a)/n and height f (xi ), where
xi = a + i∆x is the right-hand endpoint of the ith rectangle. Thus,
b n n  
b−aX i(b − a)
Z X
f (x) dx ≈ f (xi )∆x = f a+ .
a n n
i=1 i=1

In fact, as n → ∞, this result becomes an equality.


R1
Try it out on 0 sin(πx/2) dx (which secretly equals 2/π) for
different values of n, and see for yourself.

21 / 104
Calculus Primer

Riemann (cont’d): Since I’m such a nice guy, I’ve made things
easy for you. In this problem, I’ve thoughtfully taken a = 0 and
b = 1, so that ∆x = 1/n and xi = i/n, which simplifies the notation
a bit. Then
Z b Z 1
f (x) dx = f (x) dx
a 0
n
X
≈ f (xi )∆x
i=1
n
1X  πi 
= sin .
n 2n
i=1

For n = 100, this calculates out to a value of 0.6416, which is pretty


close to the true answer of 2/π ≈ 0.6366. 2

22 / 104
Calculus Primer

Computer Exercise, Trapezoid version: Same numerical


integration via the Trapezoid Rule (which usually works a little better
than Riemann). Now we have
n−1
Z b " #
f (x0 ) X f (xn )
f (x) dx ≈ + f (xi ) + ∆x
a 2 2
i=1
n−1 
"  #
b − a f (a) X i(b − a) f (b)
= + f a+ + .
n 2 n 2
i=1

R1
Again try it out on 0 sin(πx/2) dx.

23 / 104
Calculus Primer

Computer Exercise, Monte Carlo version: You will soon learn


a Monte Carlo method to accomplish approximate integration. Just
take my word for it for now. Let U1 , U2 , . . . , Un denote a sequence of
Unif(0,1) random numbers, which can be obtained from Excel using
RAND(). It can be shown that
b n
b−aX
Z
f (x) dx ≈ f (a + (b − a)Ui ),
a n
i=1

with the result becoming an equality as n → ∞.


R1
Yet again try it out on 0 sin(πx/2) dx.

24 / 104
Probability Primer

Outline
1 Calculus Primer
2 Probability Primer
Basics
Simulating Random Variables
Great Expectations
Functions of a Random Variable
Jointly Distributed Random Variables
Covariance and Correlation
Some Probability Distributions
Limit Theorems
3 Statistics Primer
Intro to Estimation
Unbiased Estimation
Maximum Likelihood Estimation
Distributional Results and Confidence Intervals
25 / 104
Probability Primer
Basics

Basics

Will assume that you know about sample spaces, events, and the
definition of probability.

Definition: P (A|B) ≡ P (A ∩ B)/P (B) is the conditional


probability of A given B.

Example: Toss a fair die. Let A = {1, 2, 3} and B = {3, 4, 5, 6}.


Then
P (A ∩ B) 1/6
P (A|B) = = = 1/4. 2
P (B) 4/6

26 / 104
Probability Primer
Basics

Definition: If P (A ∩ B) = P (A)P (B), then A and B are


independent events.

Theorem: If A and B are independent, then P (A|B) = P (A).

Example: Toss two dice. Let A = “Sum is 7” and


B = “First die is 4”. Then

P (A) = 1/6, P (B) = 1/6, and

P (A ∩ B) = P ((4, 3)) = 1/36 = P (A)P (B).


So A and B are independent. 2

27 / 104
Probability Primer
Basics

Definition: A random variable (RV) X is a function from the


sample space Ω to the real line, i.e., X : Ω → R.

Example: Let X be the sum of two dice rolls. Then X((4, 6)) = 10.
In addition,


 1/36 if x = 2


 2/36 if x = 3



..
P (X = x) = . 2


1/36 if x = 12





0 otherwise

28 / 104
Probability Primer
Basics

Definition: If the set of possible values of a RV X is finite or


countably infinite, then X is a discrete RV. Its probability
P mass
function (pmf) is f (x) ≡ P (X = x). Note that x f (x) = 1.

Example: Flip 2 coins. Let X be the number of heads.



 1/4 if x = 0 or 2


f (x) = 1/2 if x = 1 2


 0 otherwise

Examples: Here are some well-known discrete RV’s that you may
know: Bernoulli(p), Binomial(n, p), Geometric(p), Negative
Binomial, Poisson(λ), etc.

29 / 104
Probability Primer
Basics

Definition: A continuous RV is one with probability zero at every


individual point, and for which there exists aR probability density
function (pdf)R f (x) such that P (X ∈ A) = A f (x) dx for every set
A. Note that R f (x) dx = 1.

Example: Pick a random number between 3 and 7. Then


(
1/4 if 3 ≤ x ≤ 7
f (x) = 2
0 otherwise

Examples: Here are some well-known continuous RV’s:


Uniform(a, b), Exponential(λ), Normal(µ, σ 2 ), etc.

Notation: “∼” means “is distributed as.” For instance,


X ∼ Unif(0, 1) means that X has the uniform distribution on [0,1].

30 / 104
Probability Primer
Basics

Definition: For any RV X (discrete or continuous), the cumulative


distribution function (cdf) is
( P
f (y) if X is discrete
F (x) ≡ P (X ≤ x) = R x y≤x
−∞ f (y) dy if X is continuous
Note that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. In addition, if
d
X is continuous, then dx F (x) = f (x).

Example: Flip 2 coins. Let X be the number of heads.




 0 if x < 0


 1/4 if 0 ≤ x < 1
F (x) = 2


 3/4 if 1 ≤ x < 2

1 if x ≥ 2

Example: if X ∼ Exp(λ) (i.e., X is exponential with parameter λ),


then f (x) = λe−λx and F (x) = 1 − e−λx , x ≥ 0. 2
31 / 104
Probability Primer
Simulating Random Variables

Simulating Random Variables

We’ll make a brief aside here to show how to simulate some very
simple random variables.

Example (Discrete Uniform): Consider a D.U. on {1, 2, . . . , n},


i.e., X = i with probability 1/n for i = 1, 2, . . . , n. (Think of this as
an n-sided dice toss for you Dungeons and Dragons fans.)

If U ∼ Unif(0, 1), we can obtain a D.U. random variate simply by


setting X = dnU e, where d·e is the “ceiling” (or “round up”)
function.

For example, if n = 10 and we sample a Unif(0,1) random variable


U = 0.73, then X = d7.3e = 8. 2

32 / 104
Probability Primer
Simulating Random Variables

Example (Another Discrete Random Variable):





 0.25 if x = −2

 0.10 if x = 3
P (X = x) =


 0.65 if x = 4.2

 0 otherwise

Can’t use a die toss to simulate this random variable. Instead, use
what’s called the inverse transform method.

x f (x) P (X ≤ x) Unif(0,1)’s
−2 0.25 0.25 [0.00, 0.25]
3 0.10 0.35 (0.25, 0.35]
4.2 0.65 1.00 (0.35, 1.00)

Sample U ∼ Unif(0, 1). Choose the corresponding x-value, i.e.,


X = F −1 (U ). For example, U = 0.46 means that X = 4.2. 2
33 / 104
Probability Primer
Simulating Random Variables

Now we’ll use the inverse transform method to generate a continuous


random variable. We’ll talk about the following result a little later. . .

Theorem: If X is a continuous random variable with cdf F (x), then


the random variable F (X) ∼ Unif(0, 1).

This suggests a way to generate realizations of the RV X. Simply set


F (X) = U ∼ Unif(0, 1) and solve for X = F −1 (U ).

Example: Suppose X ∼ Exp(λ). Then F (x) = 1 − e−λx for x > 0.


Set F (X) = 1 − e−λX = U . Solve for X,
−1
X = `n(1 − U ) ∼ Exp(λ). 2
λ

34 / 104
Probability Primer
Simulating Random Variables

Example (Generating Uniforms): All of the above RV generation


examples relied on our ability to generate a Unif(0,1) RV. For now,
let’s assume that we can generate numbers that are “practically” iid
Unif(0,1).

If you don’t like programming, you can use Excel function RAND()
or something similar to generate Unif(0,1)’s.

Here’s an algorithm to generate pseudo-random numbers (PRN’s),


i.e., a series R1 , R2 , . . . of deterministic numbers that appear to be iid
Unif(0,1). Pick a seed integer X0 , and calculate

Xi = 16807Xi−1 mod(231 − 1), i = 1, 2, . . . .

Then set Ri = Xi /(231 − 1), i = 1, 2, . . ..

35 / 104
Probability Primer
Simulating Random Variables

Here’s an easy FORTRAN implementation of the above algorithm


(from Bratley, Fox, and Schrage).

FUNCTION UNIF(IX)
K1 = IX/127773 (this division truncates, e.g., 5/3 = 1.)
IX = 16807*(IX - K1*127773) - K1*2836 (update seed)
IF(IX.LT.0)IX = IX + 2147483647
UNIF = IX * 4.656612875E-10
RETURN
END

In the above function, we input a positive integer IX and the function


returns the PRN UNIF, as well as an updated IX that we can use
again. 2

36 / 104
Probability Primer
Simulating Random Variables

Some Exercises: In the following, I’ll assume that you can use
Excel (or whatever) to simulate independent Unif(0,1) RV’s. (We’ll
review independence in a little while.)
1 Make a histogram of Xi = −`n(Ui ), for i = 1, 2, . . . , 10000,
where the Ui ’s are independent Unif(0,1) RV’s. What kind of
distribution does it look like?
2 Suppose Xi and Yi are independentp Unif(0,1) RV’s,
i = 1, 2, . . . , 10000. Let Zi = −2`n(Xi ) sin(2πYi ), and
make a histogram of the Zi ’s based on the 10000 replications.
3 Suppose Xi and Yi are independent Unif(0,1) RV’s,
i = 1, 2, . . . , 10000. Let Zi = Xi /(Xi − Yi ), and make a
histogram of the Zi ’s based on the 10000 replications. This may
be somewhat interesting. It’s possible to derive the distribution
analytically, but it takes a lot of work.

37 / 104
Probability Primer
Great Expectations

Great Expectations

Definition: The expected value (or mean) of a RV X is


( P
x xf (x) if X is discrete
Z
E[X] ≡ R = x dF (x).
R xf (x) dx if X is continuous R

Example: Suppose that X ∼ Bernoulli(p). Then


(
1 with prob. p
X =
0 with prob. 1 − p (= q)

2
P
and we have E[X] = x xf (x) = p.

38 / 104
Probability Primer
Great Expectations

Example: Suppose that X ∼ Uniform(a, b). Then


(
1
b−a if a < x < b
f (x) =
0 otherwise

2
R
and we have E[X] = R xf (x) dx = (a + b)/2.

Example: Suppose that X ∼ Exponential(λ). Then


(
λe−λx if x > 0
f (x) =
0 otherwise

and we have (after integration by parts and L’Hôspital’s Rule)


Z Z ∞
1
E[X] = xf (x) dx = xλe−λx dx = . 2
R 0 λ

39 / 104
Probability Primer
Great Expectations

Def/Thm: (the “Law of the Unconscious Statistician” or “LOTUS”):


Suppose that h(X) is some function of the RV X. Then
( P
h(x)f (x) if X is disc
Z
E[h(X)] = R x = h(x) dF (x).
R h(x)f (x) dx if X is cts R

The function h(X) can be anything “nice”, e.g., h(X) = X 2 or 1/X


or sin(X) or `n(X).

40 / 104
Probability Primer
Great Expectations

Example: Suppose X is the following discrete RV:

x 2 3 4
f (x) 0.3 0.6 0.1

Then E[X 3 ] = 3 2
P
x x f (x) = 8(0.3) + 27(0.6) + 64(0.1) = 25.

Example: Suppose X ∼ Unif(0, 2). Then


Z
n
E[X ] = xn f (x) dx = 2n /(n + 1). 2
R

41 / 104
Probability Primer
Great Expectations

Definitions: E[X n ] is the nth moment of X.

E[(X − E[X])n ] is the nth central moment of X.

Var(X) ≡ E[(X − E[X])2 ] is the variance of X.

p
The standard deviation of X is Var(X).

Theorem: Var(X) = E[X 2 ] − (E[X])2 (sometimes easier to


calculate this way).

42 / 104
Probability Primer
Great Expectations

Example: Suppose X ∼ Bern(p). Recall that E[X] = p. Then


X
E[X 2 ] = x2 f (x) = p and
x

Var(X) = E[X 2 ] − (E[X])2 = p(1 − p). 2

Example: Suppose X ∼ Exp(λ). By LOTUS,


Z ∞
n
E[X ] = xn λe−λx dx = n!/λn .
0

2  1 2
Var(X) = E[X 2 ] − (E[X])2 = − = 1/λ2 . 2
λ2 λ

43 / 104
Probability Primer
Great Expectations

Theorem: E[aX + b] = aE[X] + b and Var(aX + b) = a2 Var(X).

Example: If X ∼ Exp(3), then


2
E[−2X + 7] = −2E[X] + 7 = − + 7.
3
4
Var(−2X + 7) = (−2)2 Var(X) = . 2
9

44 / 104
Probability Primer
Great Expectations

Definition: MX (t) ≡ E[etX ] is the moment generating function


(mgf) of the RV X. (MX (t) is a function of t, not of X!)

Example: X ∼ Bern(p). Then


X
MX (t) = E[etX ] = etx f (x) = et·1 p + et·0 q = pet + q. 2
x

Example: X ∼ Exp(λ). Then


Z Z ∞
λ
MX (t) = tx
e f (x) dx = λ e(t−λ)x dx = if λ > t. 2
< 0 λ−t
Theorem: Under certain technical conditions,
dk

k

E[X ] = k
MX (t) , k = 1, 2, . . . .
dt t=0

Thus, you can generate the moments of X from the mgf.


45 / 104
Probability Primer
Great Expectations

λ
Example: X ∼ Exp(λ). Then MX (t) = λ−t for λ > t. So

d λ
E[X] = MX (t)
= = 1/λ.
dt t=0 (λ − t)2 t=0

Further,

d2

2 2λ
= 2/λ2 .

E[X ] = MX (t)
=
dt2 t=0 (λ − t)3
t=0

Thus,
2 1
Var(X) = E[X 2 ] − (E[X])2 = − = 1/λ2 . 2
λ2 λ2
Moment generating functions have many other important uses, some
of which we’ll talk about in this course.

46 / 104
Probability Primer
Functions of a Random Variable

Functions of a Random Variable

Problem: Suppose we have a RV X with pmf/pdf f (x). Let


Y = h(X). Find g(y), the pmf/pdf of Y .

Examples (take my word for it for now):

If X ∼ Nor(0, 1), then Y = X 2 ∼ χ2 (1).

If U ∼ Unif(0, 1), then Y = − λ1 `n(U ) ∼ Exp(λ).

47 / 104
Probability Primer
Functions of a Random Variable

Discrete Example: Let X denote the number of H’s from two coin
tosses. We want the pmf for Y = X 3 − X.

x 0 1 2
f (x) 1/4 1/2 1/4
y= x3 −x 0 0 6

This implies that g(0) = P (Y = 0) = P (X = 0 or 1) = 3/4 and


g(6) = P (Y = 6) = 1/4. In other words,
(
3/4 if y = 0
g(y) = . 2
1/4 if y = 6

48 / 104
Probability Primer
Functions of a Random Variable

Continuous Example: Suppose X has pdf f (x) = |x|,


−1 ≤ x ≤ 1. Find the pdf of Y = X 2 .

First of all, the cdf of Y is

G(y) = P (Y ≤ y)
= P (X 2 ≤ y)
√ √
= P (− y ≤ X ≤ y)
Z √y
= √
|x| dx = y, 0 < y < 1.
− y

Thus, the pdf of Y is g(y) = G0 (y) = 1, 0 < y < 1, indicating that


Y ∼ Unif(0, 1). 2

49 / 104
Probability Primer
Functions of a Random Variable

Inverse Transform Theorem: Suppose X is a continuous random


variable having cdf F (x). Then, amazingly, F (X) ∼ Unif(0, 1).

Proof: Let Y = F (X). Then the cdf of Y is

P (Y ≤ y) = P (F (X) ≤ y)
= P (X ≤ F −1 (y))
= F (F −1 (y)) = y,

which is the cdf of the Unif(0,1). 2

This result is of fundamental importance when it comes to generating


random variates during a simulation.

50 / 104
Probability Primer
Functions of a Random Variable

Example (how to generate exponential RV’s): Suppose


X ∼ Exp(λ), with cdf F (x) = 1 − e−λx for x ≥ 0.

So the Inverse Transform Theorem implies that

F (X) = 1 − e−λX ∼ Unif(0, 1).

Let U ∼ Unif(0, 1) and set F (X) = U . Then we have


−1
X = `n(1 − U ) ∼ Exp(λ).
λ
For instance, if λ = 2 and U = 0.27, then X = 0.157 is an Exp(2)
realization. 2

51 / 104
Probability Primer
Functions of a Random Variable

Exercise: Suppose that X has the Weibull distribution with cdf


β
F (x) = 1 − e−(λx) , x > 0.

If you set F (X) = U and solve for X, show that you get
1
X= [−`n(1 − U )]1/β .
λ
Now pick your favorite λ and β, and use this result to generate values
of X. In fact, make a histogram of your X values. Are there any
interesting values of λ and β you could’ve chosen?

52 / 104
Probability Primer
Functions of a Random Variable

Bonus Theorem: Here’s another way to get the pdf of Y = h(X)


for some nice continuous function h(·). The cdf of Y is

FY (y) = P (Y ≤ y) = P (h(X) ≤ y) = P (X ≤ h−1 (y)).

By the chain rule (and since a pdf must be ≥ 0), the pdf of Y is

d −1
d −1
fY (y) = FY (y) = fX (h (y)) h (y) .

dy dy

And now, here’s how to prove LOTUS!


Z Z
−1
d −1
E[Y ] = yfY (y) dy = yfX (h (y)) h (y) dy

R R dy
Z Z
“=” yfX (h−1 (y)) dh−1 (y) = h(x)fX (x) dx. 2
R R

53 / 104
Probability Primer
Jointly Distributed Random Variables

Jointly Distributed Random Variables

Consider two random variables interacting together — think height


and weight.

Definition: The joint cdf of X and Y is

F (x, y) ≡ P (X ≤ x, Y ≤ y), for all x, y.

Remark: The marginal cdf of X is FX (x) = F (x, ∞). (We use the
X subscript to remind us that it’s just the cdf of X all by itself.)
Similarly, the marginal cdf of Y is FY (y) = F (∞, y).

54 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: If X and Y are discrete, then


PthePjoint pmf of X and Y is
f (x, y) ≡ P (X = x, Y = y). Note that x y f (x, y) = 1.

Remark: The marginal pmf of X is


X
fX (x) = P (X = x) = f (x, y).
y

The marginal pmf of Y is


X
fY (y) = P (Y = y) = f (x, y).
x
Example: The following table gives the joint pmf f (x, y), along
with the accompanying marginals.
f (x, y) X=2 X=3 X=4 fY (y)
Y =4 0.3 0.2 0.1 0.6
Y =6 0.1 0.2 0.1 0.4
fX (x) 0.4 0.4 0.2 1 55 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: If X and Y are continuous,R then


R the joint pdf of X and
∂2
Y is f (x, y) ≡ ∂x∂y F (x, y). Note that R R f (x, y) dx dy = 1.

Remark: The marginal pdf’s of X and Y are


Z Z
fX (x) = f (x, y) dy and fY (y) = f (x, y) dx.
R R

Example: Suppose the joint pdf is


21 2
f (x, y) = x y, x2 ≤ y ≤ 1.
4
Then the marginal pdf’s are:
Z Z 1
21 2 21
fX (x) = f (x, y) dy = x y dy = x2 (1−x4 ), −1 ≤ x ≤ 1
R x2 4 8
and
Z Z √y
21 2 7
fY (y) = f (x, y) dx = √ 4
x y dx = y 5/2 , 0 ≤ y ≤ 1. 2
R − y 2
56 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: X and Y are independent RV’s if

f (x, y) = fX (x)fY (y) for all x, y.

Theorem: X and Y are indep if you can write their joint pdf as
f (x, y) = a(x)b(y) for some functions a(x) and b(y), and x and y
don’t have funny limits (their domains do not depend on each other).
Examples: If f (x, y) = cxy for 0 ≤ x ≤ 2, 0 ≤ y ≤ 3, then X and
Y are independent.
If f (x, y) = 21 2 2
4 x y for x ≤ y ≤ 1, then X and Y are not
independent.
If f (x, y) = c/(x + y) for 1 ≤ x ≤ 2, 1 ≤ y ≤ 3, then X and Y are
not independent. 2

57 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: The conditional pdf (or pmf ) of Y given X = x is


f (y|x) ≡ f (x, y)/fX (x) (assuming fX (x) > 0).

This
R is a legit pmf/pdf. For example, in the continuous case,
R f (y|x) dy = 1, for any x.

21 2
Example: Suppose f (x, y) = 4 x y for x2 ≤ y ≤ 1. Then
21 2
f (x, y) 4 x y 2y
f (y|x) = = 21 2 = , x2 ≤ y ≤ 1. 2
fX (x) 4
8 x (1 − x )
1 − x4

Theorem: If X and Y are indep, then f (y|x) = fY (y) for all x, y.

Proof: By definition of conditional and independence,


f (x, y) fX (x)fY (y)
f (y|x) = = . 2
fX (x) fX (x)
58 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: The conditional expectation of Y given X = x is


( P
yf (y|x) discrete
E[Y |X = x] ≡ R y
R yf (y|x) dy continuous

Example: The expected weight of a person who is 7 feet tall


(E[Y |X = 7]) will probably be greater than that of a random person
from the entire population (E[Y ]).
21 2
Old Cts Example: f (x, y) = 4 x y, if x2 ≤ y ≤ 1. Then
1
2y 2 2 1 − x6
Z Z
E[Y |x] = yf (y|x) dy = dy = · . 2
R x2 1 − x4 3 1 − x4

59 / 104
Probability Primer
Jointly Distributed Random Variables

Theorem (double expectations): E[E(Y |X)] = E[Y ].

Proof (cts case): By the Unconscious Statistician,


Z
E[E(Y |X)] = E(Y |x)fX (x) dx
R
Z Z 
= yf (y|x) dy fX (x) dx
R R
Z Z
= yf (y|x)fX (x) dx dy
R R
Z Z
= y f (x, y) dx dy
R R
Z
= yfY (y) dy = E[Y ]. 2
R

60 / 104
Probability Primer
Jointly Distributed Random Variables

Old Example: Suppose f (x, y) = 21 2 2


4 x y, if x ≤ y ≤ 1. By
previous examples, we know fX (x), fY (y), and E[Y |x]. Find E[Y ].

Solution #1 (old, boring way):


Z Z 1
7 7/2 7
E[Y ] = yfY (y) dy = y dy = .
R 0 2 9

Solution #2 (new, exciting way):


Z
E[Y ] = E[E(Y |X)] = E(Y |x)fX (x) dx
R
1
2 1 − x6
Z  
21 2 4 7
= · x (1 − x ) dx = .
−1 3 1 − x4 8 9

Notice that both answers are the same (good)! 2


61 / 104
Probability Primer
Jointly Distributed Random Variables

Example: A cutesy way to calculate the mean of the Geometric


distribution.

Let Y ∼ Geom(p), e.g., Y could be the number of coin flips before H


appears, where P (H) = p. From Baby Probability class, we know
that the pmf of Y is fY (y) = P (Y = y) = q y−1 p, for y = 1, 2, . . ..

Then the old-fashioned way to calculate the mean is:

X ∞
X
E[Y ] = yfY (y) = yq y−1 p = 1/p,
y y=1

where the last step follows because I tell you so. 2

But if you are not quite willing to believe me,. . .

62 / 104
Probability Primer
Jointly Distributed Random Variables

. . . Let’s use double expectation to do what’s called a “standard


one-step conditioning argument”. Define X = 1 if the first flip is H;
and X = 0 otherwise.

Based on the result X of the first step, we have


X
E[Y ] = E[E(Y |X)] = E(Y |x)fX (x)
x
= E(Y |X = 0)P (X = 0) + E(Y |X = 1)P (X = 1)
= (1 + E[Y ])(1 − p) + 1(p). (why?)

Solving, we get E[Y ] = 1/p again! 2

63 / 104
Probability Primer
Jointly Distributed Random Variables

Computing Probabilities by Conditioning

Let A be some event, and define the RV Y = 1 if A occurs; and


Y = 0 otherwise. Then
X
E[Y ] = yfY (y) = P (Y = 1) = P (A).
y

Similarly, for any RV X, we have


X
E[Y |X = x] = yfY (y|x) = P (Y = 1|X = x) = P (A|X = x).
y

64 / 104
Probability Primer
Jointly Distributed Random Variables

Thus,
P (A) = E[Y ] = E[E(Y |X)]
Z
= E[Y |X = x]dFX (x)
ZR
= P (A|X = x)dFX (x).
R

Example/Theorem: If X and Y are independent cts RV’s, then


Z
P (Y < X) = P (Y < x)fX (x) dx.
R

Proof: Follows from above result if we let the event A = {Y < X}.
2

65 / 104
Probability Primer
Jointly Distributed Random Variables

Example: If X ∼ Exp(µ) and Y ∼ Exp(λ) are indep RV’s, then


Z
P (Y < X) = P (Y < x)fX (x) dx
R
Z ∞
= (1 − e−λx )µe−µx dx
0
λ
= . 2
λ+µ

66 / 104
Probability Primer
Jointly Distributed Random Variables

Theorem (variance decomposition):


Var(Y ) = E [Var(Y |X)] + Var [E(Y |X)]
Proof (from Ross): By definition of variance and double expectation,
h i
E [Var(Y |X)] = E E(Y 2 |X) − {E(Y |X)}2
h i
= E(Y 2 ) − E {E(Y |X)}2 .

Similarly,
h i
Var [E(Y |X)] = E {E(Y |X)}2 − {E [E(Y |X)]}2
h i
= E {E(Y |X)}2 − {E(Y )}2 .

Thus,
E [Var(Y |X)]+Var [E(Y |X)] = E(Y 2 )−{E(Y )}2 = Var(Y ). 2
67 / 104
Probability Primer
Covariance and Correlation

“Definition” (two-dimensional LOTUS): Suppose that h(X, Y ) is


some function of the RV’s X and Y . Then
( P P
h(x, y)f (x, y) if (X, Y ) is discrete
E[h(X, Y )] = R Rx y
R R h(x, y)f (x, y) dx dy if (X, Y ) is continuous

Theorem: Whether or not X and Y are independent, we have


E[X + Y ] = E[X] + E[Y ].

Theorem: If X and Y are independent, then


Var(X + Y ) = Var(X) + Var(Y ).

(Stay tuned for dependent case.)

68 / 104
Probability Primer
Covariance and Correlation

Definition: X1 , . . . , Xn form a random sample from f (x) if (i)


X1 , . . . , Xn are independent, and (ii) each Xi has the same pdf (or
pmf) f (x).

iid
Notation: X1 , . . . , Xn ∼ f (x). (The term “iid” reads independent
and identically distributed.)
iid
P If X1 , . . . , Xn ∼ f (x) and the sample mean
Example:
X̄n ≡ ni=1 Xi /n, then E[X̄n ] = E[Xi ] and Var(X̄n ) = Var(Xi )/n.
Thus, the variance decreases as n increases. 2

But not all RV’s are independent...

69 / 104
Probability Primer
Covariance and Correlation

Covariance and Correlation


Definition: The covariance between X and Y is

Cov(X, Y ) ≡ E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ].

Note that Var(X) = Cov(X, X).

Theorem: If X and Y are independent RV’s, then Cov(X, Y ) = 0.

Remark: Cov(X, Y ) = 0 doesn’t mean X and Y are independent!

Example: Suppose X ∼ Unif(−1, 1) and Y = X 2 . Then X and Y


are clearly dependent. However,
Z 1 3
x
3 2
Cov(X, Y ) = E[X ]−E[X]E[X ] = E[X ] = 3
dx = 0. 2
−1 2

70 / 104
Probability Primer
Covariance and Correlation

Theorem: Cov(aX, bY ) = abCov(X, Y ).

Theorem: Whether or not X and Y are independent,

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

and

Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).

Definition: The correlation between X and Y is


Cov(X, Y )
ρ ≡ p .
Var(X)Var(Y )

Theorem: −1 ≤ ρ ≤ 1.

71 / 104
Probability Primer
Covariance and Correlation

Example: Consider the following joint pmf.

f (x, y) X=2 X=3 X=4 fY (y)


Y = 40 0.00 0.20 0.10 0.3
Y = 50 0.15 0.10 0.05 0.3
Y = 60 0.30 0.00 0.10 0.4
fX (x) 0.45 0.30 0.25 1

E[X] = 2.8, Var(X) = 0.66, E[Y ] = 51, Var(Y ) = 69,


XX
E[XY ] = xyf (x, y) = 140,
x y

and
E[XY ] − E[X]E[Y ]
ρ = p = −0.415. 2
Var(X)Var(Y )
72 / 104
Probability Primer
Covariance and Correlation

Portfolio Example: Consider two assets, S1 and S2 , with expected


returns E[S1 ] = µ1 and E[S2 ] = µ2 , and variabilities Var(S1 ) = σ12 ,
Var(S2 ) = σ22 , and Cov(S1 , S2 ) = σ12 .

Define a portfolio P = wS1 + (1 − w)S2 , where w ∈ [0, 1]. Then

E[P ] = wµ1 + (1 − w)µ2

Var(P ) = w2 σ12 + (1 − w)2 σ22 + 2w(1 − w)σ12 .

d
Setting dw Var(P ) = 0, we obtain the critical point that (hopefully)
minimizes the variance of the portfolio,
σ 2 − σ12
w = 2 2 2 . 2
σ1 + σ2 − 2σ12

73 / 104
Probability Primer
Covariance and Correlation

Portfolio Exercise: Suppose E[S1 ] = 0.2, E[S2 ] = 0.1,


Var(S1 ) = 0.2, Var(S2 ) = 0.4, and Cov(S1 , S2 ) = −0.1.

What value of w maximizes the expected return of the portfolio?

What value of w minimizes the variance? (Note the negative


covariance I’ve introduced into the picture.)

Let’s talk trade-offs.

74 / 104
Probability Primer
Some Probability Distributions

Some Probability Distributions


First, some discrete distributions. . .

X ∼ Bernoulli(p).
(
p if x = 1
f (x) =
1 − p (= q) if x = 0

E[X] = p, Var(X) = pq, MX (t) = pet + q.

iid
Y ∼ Binomial(n, p). If X1 , X
P2 , . . . , Xn ∼ Bern(p) (i.e.,
n
Bernoulli(p) trials), then Y = i=1 Xi ∼ Bin(n, p).
!
n y n−y
f (y) = p q , y = 0, 1, . . . , n.
y

E[Y ] = np, Var(Y ) = npq, MY (t) = (pet + q)n .


75 / 104
Probability Primer
Some Probability Distributions

X ∼ Geometric(p) is the number of Bern(p) trials until a success


occurs. For example, “FFFS” implies that X = 4.

f (x) = q x−1 p, x = 1, 2, . . . .

E[X] = 1/p, Var(X) = q/p2 , MX (t) = pet /(1 − qet ).

Y ∼ NegBin(r, p) is the sum of r iid Geom(p) RV’s, i.e., the time


until the rth success occurs. For example, “FFFSSFS” implies that
NegBin(3, p) = 7.
!
y − 1 y−r r
f (y) = q p , y = r, r + 1, . . . .
r−1

E[Y ] = r/p, Var(Y ) = qr/p2 .

76 / 104
Probability Primer
Some Probability Distributions

X ∼ Poisson(λ).

Definition: A counting process N (t) tallies the number of “arrivals”


observed in [0, t]. A Poisson process is a counting process satisfying
the following.
i. Arrivals occur one-at-a-time at rate λ (e.g., λ = 4 customers/hr)
ii. Independent increments, i.e., the numbers of arrivals in disjoint
time intervals are independent.
iii. Stationary increments, i.e., the distribution of the number of
arrivals in [s, s + t] only depends on t.

X ∼ Pois(λ) is the number of arrivals that a Poisson process


experiences in one time unit, i.e., N (1).
e−λ λx
f (x) = , x = 0, 1, . . . .
x!
t −1)
E[X] = λ = Var(X), MX (t) = eλ(e .
77 / 104
Probability Primer
Some Probability Distributions

Now, some continuous distributions. . .


1 a+b
X ∼ Uniform(a, b). f (x) = b−a for a ≤ x ≤ b, E[X] = 2 ,
(b−a)2
Var(X) = 12 , MX (t) = (etb − eta )/(tb − ta).

X ∼ Exponential(λ). f (x) = λe−λx for x ≥ 0, E[X] = 1/λ,


Var(X) = 1/λ2 , MX (t) = λ/(λ − t) for t < λ.

Theorem: The exponential distribution has the memoryless property


(and is the only continuous distribution with this property), i.e., for
s, t > 0, P (X > s + t|X > s) = P (X > t).

Example: Suppose X ∼ Exp(λ = 1/100). Then

P (X > 200|X > 50) = P (X > 150) = e−λt = e−150/100 . 2

78 / 104
Probability Primer
Some Probability Distributions

X ∼ Gamma(α, λ).
λα xα−1 e−λx
f (x) = , x ≥ 0,
Γ(α)
where the gamma function is
Z ∞
Γ(α) ≡ tα−1 e−t dt.
0
h iα
E[X] = α/λ, Var(X) = α/λ2 , MX (t) = λ/(λ − t) for t < λ.

iid
If X1 , X2 , . . . , Xn ∼ Exp(λ), then Y ≡ ni=1 Xi ∼ Gamma(n, λ).
P
The Gamma(n, λ) is also called the Erlangn (λ). It has cdf
n−1
X (λy)j
FY (y) = 1 − e−λy , y ≥ 0.
j!
j=0

79 / 104
Probability Primer
Some Probability Distributions

X ∼ Triangular(a, b, c). Good for modeling things with limited


data — a is the smallest possible value, b is the “most likely,” and c is
the largest.

2(x−a)
 (b−a)(c−a) if a < x ≤ b


2(c−x)
f (x) = (c−b)(c−a) if b < x ≤ c .


 0 otherwise

E[X] = (a + b + c)/3.
Γ(a+b) a−1
X ∼ Beta(a, b). f (x) = Γ(a)Γ(b) x (1 − x)b−1 for 0 ≤ x ≤ 1 and
a, b > 0.

a ab
E[X] = and Var(X) = .
a+b (a + b)2 (a + b + 1)

80 / 104
Probability Primer
Some Probability Distributions

X ∼ Normal(µ, σ 2 ). Most important distribution.


−(x − µ)2
 
1
f (x) = √ exp , x ∈ R.
2πσ 2 2σ 2
E[X] = µ, Var(X) = σ 2 , and MX (t) = exp(µt + 12 σ 2 t2 ).

Theorem: If X ∼ Nor(µ, σ 2 ), then aX + b ∼ Nor(aµ + b, a2 σ 2 ).

Corollary: If X ∼ Nor(µ, σ 2 ), then Z ≡ X−µσ ∼ Nor(0, 1), the


2
standard normal distribution, with pdf φ(z) ≡ √12π e−z /2 and cdf
.
Φ(z), which is tabled. E.g., Φ(1.96) = 0.975.

Theorem: If X1 and X2 are independent with Xi ∼ Nor(µi , σi2 ),


i = 1, 2, then X1 + X2 ∼ Nor(µ1 + µ2 , σ12 + σ22 ).

Example: Suppose X ∼ Nor(3, 4), Y ∼ Nor(4, 6), and X and Y


are independent. Then 2X − 3Y + 1 ∼ Nor(−5, 70). 2
81 / 104
Probability Primer
Limit Theorems

Limit Theorems

Corollary (of a previous theorem): If X1 , . . . , Xn are iid Nor(µ, σ 2 ),


then the sample mean X̄n ∼ Nor(µ, σ 2 /n).

This is a special case of the Law of Large Numbers, which says that
X̄n approximates µ well as n becomes large.

Definition: The sequence of RV’s Y1 , Y2 , . . . with respective cdf’s


FY1 (y), FY2 (y), . . . converges in distribution to the RV Y having cdf
FY (y) if limn→∞ FYn (y) = FY (y) for all y belonging to the
d
continuity set of Y . Notation: Yn −→ Y .

d
Idea: If Yn −→ Y and n is large, then you ought to be able to
approximate the distribution of Yn by the limit distribution of Y .

82 / 104
Probability Primer
Limit Theorems

iid
Central Limit Theorem: If X1 , X2 , . . . , Xn ∼ f (x) with mean µ
and variance σ 2 , then
Pn √
i=1√Xi − nµ n(X̄n − µ) d
Zn ≡ = −→ Nor(0, 1).
nσ σ

Thus, the cdf of Zn approaches Φ(z) as n increases.

The CLT is the most-important theorem in the universe.

The CLT usually works well if the pmf/pdf is fairly symmetric and
n ≥ 15.

We will eventually look at more-general versions of the CLT


described above.

83 / 104
Probability Primer
Limit Theorems

iid
Example: If X1 , X2 , . . . , X100 ∼ Exp(1) (so µ = σ 2 = 1), then
 100
X 
P 90 ≤ Xi ≤ 110
i=1
 
90 − 100 110 − 100
= P √ ≤ Z100 ≤ √
100 100
≈ P (−1 ≤ Nor(0, 1) ≤ 1) = 0.6827.

By the way, since 100


P
i=1 Xi ∼ Erlangk=100 (λ = 1), we can use the
cdf (which may be tedious)
P or software such as Minitab to obtain the
exact value of P (90 ≤ 100i=1 Xi ≤ 110) = 0.6835.

Wow! The CLT and exact answers match nicely! 2

84 / 104
Probability Primer
Limit Theorems

Exercise: Demonstrate that the CLT actually works.


1 Pick your favorite RV X1 . Simulate it and make a histogram.
2 Now suppose X1 and X2 are iid from your favorite distribution.
Make a histogram of X1 + X2 .
3 Now X1 + X2 + X3 .
4 . . . Now X1 + X2 + · · · + Xn for some reasonably large n.
5 Does the CLT work for the Cauchy distribution, i.e.,
X = tan(2πU ), where U ∼ Unif(0, 1)?

85 / 104
Statistics Primer

Outline
1 Calculus Primer
2 Probability Primer
Basics
Simulating Random Variables
Great Expectations
Functions of a Random Variable
Jointly Distributed Random Variables
Covariance and Correlation
Some Probability Distributions
Limit Theorems
3 Statistics Primer
Intro to Estimation
Unbiased Estimation
Maximum Likelihood Estimation
Distributional Results and Confidence Intervals
86 / 104
Statistics Primer
Intro to Estimation

Intro to Estimation

Definition: A statistic is a function of the observations X1 , . . . , Xn ,


and not explicitly dependent on any unknown parameters.
1 Pn 1 Pn
Examples of statistics: X̄ = n i=1 Xi , S2 = n−1 i=1 (Xi − X̄)2 .

Statistics are random variables. If we take two different samples,


we’d expect to get two different values of a statistic.

A statistic is usually used to estimate some unknown parameter from


the underlying probability distribution of the Xi ’s.

Examples of parameters: µ, σ 2 .

87 / 104
Statistics Primer
Intro to Estimation

Let X1 , . . . , Xn be iid RV’s and let T (X) ≡ T (X1 , . . . , Xn ) be a


statistic based on the Xi ’s. Suppose we use T (X) to estimate some
unknown parameter θ. Then T (X) is called a point estimator for θ.

Examples: X̄ is usually a point estimator for the mean µ = E[Xi ],


and S 2 is often a point estimator for the variance σ 2 = Var(Xi ).

It would be nice if T (X) had certain properties:

* Its expected value should equal the parameter it’s trying to estimate.

* It should have low variance.

88 / 104
Statistics Primer
Unbiased Estimation

Unbiased Estimators
Definition: T (X) is unbiased for θ if E[T (X)] = θ.

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with


mean µ. Then
 Xn  n
1 1X
E[X̄] = E Xi = E[Xi ] = E[Xi ] = µ.
n n
i=1 i=1

So X̄ is always unbiased for µ. That’s why X̄ is called the sample


mean.

Baby Example: In particular, suppose X1 , . . . , Xn are iid Exp(λ).


Then X̄ is unbiased for µ = E[Xi ] = 1/λ.

But be careful. . . . 1/X̄ is biased for λ in this exponential case, i.e.,


E[1/X̄] 6= 1/E[X̄] = λ.
89 / 104
Statistics Primer
Unbiased Estimation

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with


mean µ and variance σ 2 . Then
 Pn 2
2 i=1 (Xi − X̄)
E[S ] = E = Var(Xi ) = σ 2 .
n−1

Thus, S 2 is always unbiased for σ 2 . This is why S 2 is called the


sample variance.

Baby Example: Suppose X1 , . . . , Xn are iid Exp(λ). Then S 2 is


unbiased for Var(Xi ) = 1/λ2 .

90 / 104
Statistics Primer
Unbiased Estimation

Proof (of general result): First, some algebra gives


Pn 2
Pn 2 2
2 i=1 (Xi − X̄) i=1 Xi − nX̄
S = = .
n−1 n−1

Since E[X1 ] = E[X̄] and Var(X̄) = Var(X1 )/n = σ 2 /n, we have


Pn 2 2  
2 i=1 E[Xi ] − nE[X̄ ] n 2 2
E[S ] = = E[X1 ] − E[X̄ ]
n−1 n−1
 
n 2 2
= Var(X1 ) + (E[X1 ]) − Var(X̄) − (E[X̄])
n−1
n
= (σ 2 − σ 2 /n) = σ 2 . 2
n−1
Remark: S is biased for the standard deviation σ.

91 / 104
Statistics Primer
Unbiased Estimation
iid
Big Example: Suppose that X1 , . . . , Xn ∼ Unif(0, θ), i.e., the pdf
is f (x) = 1/θ, 0 < x < θ.
n+1
Consider two estimators: Y1 ≡ 2X̄ and Y2 ≡ n max1≤i≤n Xi

Since E[Y1 ] = 2E[X̄] = 2E[Xi ] = θ, we see that Y1 is unbiased for θ.

It’s also the case that Y2 is unbiased, but it takes a little more work to
show this. As a first step, let’s get the cdf of M ≡ maxi Xi ,
P (M ≤ y) = P (X1 ≤ y and X2 ≤ y and · · · and Xn ≤ y)
n
Y
= P (Xi ≤ y) = [P (X1 ≤ y)]n (Xi ’s are iid)
i=1
Z y n Z y n
= fX1 (x) dx = 1/θ dx = (y/θ)n .
0 0

92 / 104
Statistics Primer
Unbiased Estimation

This implies that the pdf of M is


d ny n−1
fM (y) ≡ (y/θ)n = ,
dy θn
and this implies that
θ θ
ny n
Z Z

E[M ] = yfM (y) dy = n
= .
0 0 θ n+1
n+1
Whew! So we see that Y2 = n max1≤i≤n Xi is unbiased for θ.

So both Y1 and Y2 are unbiased for θ, but which is better?

Let’s now compare variances. After similar algebra, we have


θ2 θ2
Var(Y1 ) = and Var(Y2 ) = .
3n n(n + 2)

Thus,Y2 has much lower variance than Y1 . 2


93 / 104
Statistics Primer
Unbiased Estimation

Mean Squared Error

Definition: The bias of T (X) as an estimator of θ is


Bias(T ) ≡ E[T ] − θ.

The mean squared error of T (X) is MSE(T ) ≡ E[(T − θ)2 ].

Remark: After some algebra, we get an easier expression for MSE


that combines the bias and variance of an estimator

MSE(T ) = Var(T ) + (E[T ] − θ)2 .


| {z }
Bias

Lower MSE is better — even if there’s a little bias.

94 / 104
Statistics Primer
Unbiased Estimation

Definition: The relative efficiency of T2 to T1 is


MSE(T1 )/MSE(T2 ). If this quantity is < 1, then we’d want T1 .

iid
Example: X1 , . . . , Xn ∼ Unif(0, θ).
n+1
Two estimators: Y1 = 2X̄ and Y2 = n maxi Xi .

Showed before E[Y1 ] = E[Y2 ] = θ (so both are unbiased).

θ2 θ2
Also, Var(Y1 ) = 3n and Var(Y2 ) = n(n+2) .

θ2 θ2
Thus, MSE(Y1 ) = 3n and MSE(Y2 ) = n(n+2) , so Y2 is better.

95 / 104
Statistics Primer
Maximum Likelihood Estimation

Maximum Likelihood Estimators


Definition: Consider an iid random sample X1 , . . . , Xn , where each
Xi has pdf/pmf f (x). Further, suppose that θ is some unknown
parameter from Xi . The likelihood function is L(θ) ≡ ni=1 f (xi ).
Q

Definition: The maximum likelihood estimator (MLE) of θ is the


value of θ that maximizes L(θ). The MLE is a function of the Xi ’s
and is a RV.
iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ). Find the MLE for λ.
n
Y n
Y  n
X 
−λxi n
L(λ) = f (xi ) = λe = λ exp − λ xi .
i=1 i=1 i=1

Now maximize L(λ) with respect to λ.


96 / 104
Statistics Primer
Maximum Likelihood Estimation

Could take the derivative and plow through all of the horrible algebra.
Too tedious. Need a trick. . . .

Useful Trick: Since the natural log function is one-to-one, it’s easy to
see that the λ that maximizes L(λ) also maximizes `n(L(λ))!
  n
X  n
X
n
`n(L(λ)) = `n λ exp − λ xi = n`n(λ) − λ xi
i=1 i=1

This makes our job less horrible.


n n
d d  X  n X
`n(L(λ)) = n`n(λ) − λ xi = − xi ≡ 0.
dλ dλ λ
i=1 i=1

This implies that the MLE is λ̂ = 1/X̄. 2


97 / 104
Statistics Primer
Maximum Likelihood Estimation

Remarks: (1) λ̂ = 1/X̄ makes sense since E[X] = 1/λ.

hat over λ to indicate that this is the


(2) At the end, we put a little d
MLE.

(3) At the end, we make all of the little xi ’s into big Xi ’s to indicate
that this is a RV.

(4) Just to be careful, you probably ought to perform a


second-derivative test, but I won’t blame you if you don’t.

98 / 104
Statistics Primer
Maximum Likelihood Estimation

Invariance Property of MLE’s

Theorem (Invariance Property): If θ̂ is the MLE of some parameter


θ and h(·) is a one-to-one function, then h(θ̂) is the MLE of h(θ).

iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ). We define the survival
function as

F̄ (x) = P (X > x) = 1 − F (x) = e−λx .

In addition, we saw that the MLE for λ is λ̂ = 1/X̄.

Then the invariance property says that the MLE of F̄ (x) is


[
F̄ (x) = e−λ̂x = e−x/X̄ .

This kind of thing is used all of the time the actuarial sciences. 2
99 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Distributional Results and Confidence Intervals

There are a number of distributions (including the normal) that come


up in statistical sampling problems. Here are a few:

Definitions: If Z1 , Z2 , . . . , Zk are iid Nor(0,1), then Y = ki=1 Zi2


P
has the χ2 distribution with k degrees of freedom (df). Notation:
Y ∼ χ2 (k). Note that E[Y ] = k and Var(Y ) = 2k.

If Z ∼ Nor(0,
p 1), Y ∼ χ2 (k), and Z and Y are independent, then
T = Z/ Y /k has the Student t distribution with k df. Notation:
T ∼ t(k). Note that the t(1) is the Cauchy distribution.

If Y1 ∼ χ2 (m), Y2 ∼ χ2 (n), and Y1 and Y2 are independent, then


F = (Y1 /m)/(Y2 /n) has the F distribution with m and n df.
Notation: F ∼ F (m, n).

100 / 104
Statistics Primer
Distributional Results and Confidence Intervals

How (and why) would one use the above facts? Because they can be
used to construct confidence intervals (CIs) for µ and σ 2 under a
variety of assumptions.

A 100(1 − α)% two-sided CI for an unknown parameter θ is a


random interval [L, U ] such that P (L ≤ θ ≤ U ) = 1 − α.

Here are some examples / theorems, all of which assume that the Xi ’s
are iid normal. . .

Example: If σ 2 is known, then a 100(1 − α)% CI for µ is


r r
σ2 σ2
X̄n − zα/2 ≤ µ ≤ X̄n + zα/2 ,
n n
where zγ is the 1 − γ quantile of the standard normal distribution, i.e.,
zγ ≡ Φ−1 (1 − γ).
101 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Example: If σ 2 is unknown, then a 100(1 − α)% CI for µ is


r r
S2 S2
X̄n − tα/2,n−1 ≤ µ ≤ X̄n + tα/2,n−1 ,
n n
where tγ,ν is the 1 − γ quantile of the t(ν) distribution.

Example: A 100(1 − α)% CI for σ 2 is

(n − 1)S 2 2 (n − 1)S 2
≤ σ ≤ ,
χ2α ,n−1 χ21− α ,n−1
2 2

where χ2γ,ν is the 1 − γ quantile of the χ2 (ν) distribution.

102 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Exercise: Here are 20 residual flame times (in sec.) of treated


specimens of children’s nightwear. (Don’t worry — children were not
in the nightwear when the clothing was set on fire.)

9.85 9.93 9.75 9.77 9.67


9.87 9.67 9.94 9.85 9.75
9.83 9.92 9.74 9.99 9.88
9.95 9.95 9.93 9.92 9.89

Let’s get a 95% CI for the mean residual flame time.

103 / 104
Statistics Primer
Distributional Results and Confidence Intervals

After a little algebra, we get

X̄ = 9.8525 and S = 0.0965.

Further, you can use the Excel function t.inv(0.975,19) to get


tα/2,n−1 = t0.025,19 = 2.093.

Then the half-length of the CI is


p (2.093)(0.0965)
H = tα/2,n−1 S 2 /n = √ = 0.0451.
20

Thus, the CI is µ ∈ X̄ ± H, or 9.8074 ≤ µ ≤ 9.8976. 2

104 / 104

You might also like