100% found this document useful (1 vote)

107 views

Calculus, Probability, and Statistics Primers: Dave Goldsman

This document provides a primer on calculus concepts including: definitions of continuity, differentiability, and derivatives; properties and examples of derivatives including the product rule, quotient rule, and chain rule; methods for finding zeros of functions like bisection and Newton's method; and definitions and examples of antiderivatives and the fundamental theorem of calculus. The document is intended to review calculus concepts that will be used in subsequent primers on probability and statistics.

Uploaded by

banned miner

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

107 views

Calculus, Probability, and Statistics Primers: Dave Goldsman

Uploaded by

banned miner

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

Calculus, Probability, and Statistics Primers

Dave Goldsman

Georgia Institute of Technology, Atlanta, GA, USA

12/30/18

1 / 104
Outline
1 Calculus Primer
2 Probability Primer
Basics
Simulating Random Variables
Great Expectations
Functions of a Random Variable
Jointly Distributed Random Variables
Covariance and Correlation
Some Probability Distributions
Limit Theorems
3 Statistics Primer
Intro to Estimation
Unbiased Estimation
Maximum Likelihood Estimation
Distributional Results and Confidence Intervals
2 / 104
Calculus Primer

Calculus Primer

Goal: This section provides a brief review of various calculus tidbits

that we’ll be using later on.

First of all, let’s suppose that f (x) is a function that maps values of x
from a certain domain X to a certain range Y , which we can denote
by the shorthand f : X → Y .

Example If f (x) = x2 , then the function takes x-values from the

real line R to the nonnegative portion of the real line R+ .

3 / 104
Calculus Primer

Definition We say that f (x) is a continuous function if, for any x0

and x ∈ X, we have limx→x0 f (x) = f (x0 ), where “lim” denotes a
limit and f (x) is assumed to exist for all x ∈ X.

Example The function f (x) = 3x2 is continuous for all x. The

function f (x) = bxc (round down to the nearest integer, e.g.,
b3.4c = 3) has a “jump” discontinuity at any integer x. 2

Definition If f (x) is continuous, then it is differentiable (has a

derivative) if

d f (x + h) − f (x)
f (x) ≡ f 0 (x) ≡ lim
dx h→0 h
exists and is well-defined for any given x. Think of the derivative as
the slope of the function.

4 / 104
Calculus Primer

Example Some well-known derivatives are:

[xk ]0 = kxk−1 ,

[ex ]0 = ex ,

[sin(x)]0 = cos(x),

[cos(x)]0 = − sin(x),

1
[`n(x)]0 = ,
x
1
[arctan(x)]0 = . 2
1 + x2
5 / 104
Calculus Primer

Theorem Some well-known properties of derivatives are:

[af (x) + b]0 = af 0 (x),

[f (x) + g(x)]0 = f 0 (x) + g 0 (x),

[f (x)g(x)]0 = f 0 (x)g(x) + f (x)g 0 (x) (product rule),

0
g(x)f 0 (x) − f (x)g 0 (x)

f (x)
= (quotient rule)1 ,
g(x) g 2 (x)

[f (g(x))]0 = f 0 (g(x))g 0 (x) (chain rule)2 .

1
Ho dee Hi minus Hi dee Ho over Ho Ho.
2
www.youtube.com/watch?v=gGAiW5dOnKo
6 / 104
Calculus Primer

Example Suppose that f (x) = x2 and g(x) = `n(x). Then

d 2
[f (x)g(x)]0 = x `n(x) = 2x`n(x) + x,
dx
0
d x2

f (x) 2x`n(x) − x
= = ,
g(x) dx `n(x) `n2 (x)

2`n(x)
[f (g(x))]0 = 2g(x)g 0 (x) = . 2
x

7 / 104
Calculus Primer

Remark The second derivative f 00 (x) ≡ dx d 0

f (x) and is the “slope of
0
the slope.” If f (x) is “position,” then f (x) can be regarded as
“velocity,” and as f 00 (x) as “acceleration.”

The minimum or maximum of f (x) can only occur when the slope of
f (x) is zero, i.e., only when f 0 (x) = 0, say at x = x0 . Exception:
Check the endpoints of your interval of interest as well.

Then if f 00 (x0 ) < 0, you get a max; if f 00 (x0 ) > 0, you get a min; and
if f 00 (x0 ) = 0, you get a point of inflection.

Example Find the value of x that minimizes f (x) = e2x + e−x . The
minimum can only occur when f 0 (x) = 2e2x − e−x = 0. After a little
algebra, we find that this occurs at x0 = −(1/3)`n(2) ≈ −0.231. It’s
also easy to show that f 00 (x) > 0 for all x; and so x0 yields a
minimum. 2
8 / 104
Calculus Primer

Finding Zeroes: Speaking of solving for a 0, how might you do it if

a continuous function g(x) is a complicated nonlinear fellow?
Trial-and-error (not so great).
Bisection (divide-and-conquer).
Newton’s method (or some variation)
Fixed-point method (we’ll do this later).

9 / 104
Calculus Primer

Bisection: Suppose you can find x1 and x2 such that g(x1 ) < 0 and
g(x2 ) > 0. (We’ll follow similar logic if the inequalities are both
reversed.) By the Intermediate Value Theorem (which you may
remember), there must be a zero in [x1 , x2 ], that is, x? ∈ [x1 , x2 ] such
that g(x? ) = 0.

Thus, take x3 = (x1 + x2 )/2. If g(x3 ) < 0, then there must be a zero
in [x3 , x2 ]. Otherwise, if g(x3 ) > 0, then there must be a zero in
[x1 , x3 ]. In either case, you’ve reduced the length of the search
interval.

Continue in this same manner until the length of the search interval is
as small as desired.

Exercise: Try this√out for g(x) = x2 − 2, and come up with an

approximation for 2.
10 / 104
Calculus Primer

Newton’s Method: Suppose you can find a reasonable first guess

for the zero, say, xi , where we start off at iteration i = 0. If g(x) has a
nice, well-behaved derivative (which doesn’t happen to be too flat
near the zero of g(x)), then iterate your guess as follows:

g(xi )
xi+1 = xi − .
g 0 (xi )

Keep going until things appear to converge.

This makes sense since for xi and xi+1 close to each other and the
zero x? , we have
g(x? ) − g(xi )
g 0 (xi ) ≈ .
x? − xi

11 / 104
Calculus Primer

Exercise: Try Newton out for g(x) = x2 − 2, noting that the

iteration step is to set

x2i − 2 xi 1
xi+1 = xi − = + .
2xi 2 xi
Let’s start with a bad guess of x1 = 1. Then
x1 1 1
x2 = + = + 1 = 1.5
2 x1 2
x2 1 1.5 1
x3 = + ≈ + = 1.4167
2 x2 2 1.5
x3 1
x4 = + ≈ 1.4142 Wow! 2
2 x3

12 / 104
Calculus Primer

Integration

Definition The function F (x) having derivative f (x) is called the

antiderivative
R (or indefinite integral). It is denoted by
F (x) = f (x) dx.

Fundamental Theorem of Calculus: If f (x) is continuous, then

the area under the curve for x ∈ [a, b] is denoted and given by the
definite integral 3
Z b
b

f (x) dx ≡ F (x) ≡ F (b) − F (a).
a a

3
“I’m really an integral!”
13 / 104
Calculus Primer

Example Some well-known indefinite integrals are:

xk+1
Z
xk dx = + C for k 6= −1
k+1
Z
dx
= `n|x| + C,
x
Z
ex dx = ex + C,
Z
cos(x) dx = sin(x) + C,
Z
dx
= arctan(x) + C,
1 + x2
where C is an arbitrary constant. 2

14 / 104
Calculus Primer

Example It is easy to see that

Z
d cabin
= `n|cabin| + C = houseboat. 2
cabin
Theorem Some well-known properties of definite integrals are:
Z a
f (x) dx = 0,
a

Z b Z a
f (x) dx = − f (x) dx,
a b
Z b Z c Z b
f (x) dx = f (x) dx + f (x) dx.
a a c

15 / 104
Calculus Primer

Theorem Some other properties of general integrals are:

Z Z Z
[f (x) + g(x)] dx = f (x) dx + g(x) dx,

Z Z
f (x)g (x) dx = f (x)g(x)− g(x)f 0 (x) dx
0
(integration by parts)4 ,

Z Z
0
f (g(x))g (x) dx = f (u) du (substitution rule)5 .

4
www.youtube.com/watch?v=OTzLVIc-O5E
5
www.youtube.com/watch?v=eswQl-hcvU0
16 / 104
Calculus Primer

Example Using integration by parts with f (x) = x and g 0 (x) = e2x

and the chain rule, we have

xe2x 1 e2 e2x 1
Z 1 Z 1 2x
e2 + 1

e
2x
xe dx = − dx = − = . 2
0 2 0 0 2 2 4 0 4

Definition Derivatives of arbitrary order k can be written as f (k) (x)

dk (0) (x) = f (x).
or dx k f (x). By convention, f

The Taylor series expansion of f (x) about a point a is given by

∞
X f (k) (a)(x − a)k
f (x) = .
k!
k=0

The Maclaurin series is simply Taylor expanded around a = 0.

17 / 104
Calculus Primer

Example Here are some famous Maclaurin series.

∞
X (−1)k+1 x2k+1
sin(x) = ,
(2k + 1)!
k=0
∞
X (−1)k x2k
cos(x) = ,
(2k)!
k=0
∞
X xk
ex = .
k!
k=0

18 / 104
Calculus Primer

Example And while we’re at it, here are some miscellaneous sums
that you should know.
n
X n(n + 1)
k = ,
2
k=1
n
X n(n + 1)(2n + 1)
k2 = ,
6
k=1
∞
X 1
pk = (for −1 < p < 1).
1−p
k=0

19 / 104
Calculus Primer

Theorem Occasionally, we run into trouble when taking

indeterminate ratios of the form 0/0 or ∞/∞. In such cases,
L’Hôspital’s Rule6 is useful: If the limits limx→a f (x) and
limx→a g(x) both go to 0 or both go to ∞, then

f (x) f 0 (x)
lim = lim 0 .
x→a g(x) x→a g (x)

Example L’Hôspital shows that

sin(x) cos(x)
lim = lim = 1. 2
x→0 x x→0 1

6
This rule makes me sick.
20 / 104
Calculus Primer

Computer Exercise: Let’s do some easy integration via Riemann

sums. Simply approximate the area under the nice, continuous
function f (x) from a to b by adding up the areas of n adjacent
rectangles of width ∆x = (b − a)/n and height f (xi ), where
xi = a + i∆x is the right-hand endpoint of the ith rectangle. Thus,
b n n
b−aX i(b − a)
Z X
f (x) dx ≈ f (xi )∆x = f a+ .
a n n
i=1 i=1

In fact, as n → ∞, this result becomes an equality.

R1
Try it out on 0 sin(πx/2) dx (which secretly equals 2/π) for
different values of n, and see for yourself.

21 / 104
Calculus Primer

Riemann (cont’d): Since I’m such a nice guy, I’ve made things
easy for you. In this problem, I’ve thoughtfully taken a = 0 and
b = 1, so that ∆x = 1/n and xi = i/n, which simplifies the notation
a bit. Then
Z b Z 1
f (x) dx = f (x) dx
a 0
n
X
≈ f (xi )∆x
i=1
n
1X πi
= sin .
n 2n
i=1

For n = 100, this calculates out to a value of 0.6416, which is pretty

close to the true answer of 2/π ≈ 0.6366. 2

22 / 104
Calculus Primer

Computer Exercise, Trapezoid version: Same numerical

integration via the Trapezoid Rule (which usually works a little better
than Riemann). Now we have
n−1
Z b " #
f (x0 ) X f (xn )
f (x) dx ≈ + f (xi ) + ∆x
a 2 2
i=1
n−1
" #
b − a f (a) X i(b − a) f (b)
= + f a+ + .
n 2 n 2
i=1

R1
Again try it out on 0 sin(πx/2) dx.

23 / 104
Calculus Primer

Computer Exercise, Monte Carlo version: You will soon learn

a Monte Carlo method to accomplish approximate integration. Just
take my word for it for now. Let U1 , U2 , . . . , Un denote a sequence of
Unif(0,1) random numbers, which can be obtained from Excel using
RAND(). It can be shown that
b n
b−aX
Z
f (x) dx ≈ f (a + (b − a)Ui ),
a n
i=1

with the result becoming an equality as n → ∞.

R1
Yet again try it out on 0 sin(πx/2) dx.

24 / 104
Probability Primer

Outline
1 Calculus Primer
2 Probability Primer
Basics
Simulating Random Variables
Great Expectations
Functions of a Random Variable
Jointly Distributed Random Variables
Covariance and Correlation
Some Probability Distributions
Limit Theorems
3 Statistics Primer
Intro to Estimation
Unbiased Estimation
Maximum Likelihood Estimation
Distributional Results and Confidence Intervals
25 / 104
Probability Primer
Basics

Basics

Will assume that you know about sample spaces, events, and the
definition of probability.

Definition: P (A|B) ≡ P (A ∩ B)/P (B) is the conditional

probability of A given B.

Example: Toss a fair die. Let A = {1, 2, 3} and B = {3, 4, 5, 6}.

Then
P (A ∩ B) 1/6
P (A|B) = = = 1/4. 2
P (B) 4/6

26 / 104
Probability Primer
Basics

Definition: If P (A ∩ B) = P (A)P (B), then A and B are

independent events.

Theorem: If A and B are independent, then P (A|B) = P (A).

Example: Toss two dice. Let A = “Sum is 7” and

B = “First die is 4”. Then

P (A) = 1/6, P (B) = 1/6, and

P (A ∩ B) = P ((4, 3)) = 1/36 = P (A)P (B).

So A and B are independent. 2

27 / 104
Probability Primer
Basics

Definition: A random variable (RV) X is a function from the

sample space Ω to the real line, i.e., X : Ω → R.

Example: Let X be the sum of two dice rolls. Then X((4, 6)) = 10.
In addition,


 1/36 if x = 2


 2/36 if x = 3



..
P (X = x) = . 2


1/36 if x = 12





0 otherwise


28 / 104
Probability Primer
Basics

Definition: If the set of possible values of a RV X is finite or

countably infinite, then X is a discrete RV. Its probability
P mass
function (pmf) is f (x) ≡ P (X = x). Note that x f (x) = 1.

Example: Flip 2 coins. Let X be the number of heads.


 1/4 if x = 0 or 2


f (x) = 1/2 if x = 1 2


 0 otherwise

Examples: Here are some well-known discrete RV’s that you may
know: Bernoulli(p), Binomial(n, p), Geometric(p), Negative
Binomial, Poisson(λ), etc.

29 / 104
Probability Primer
Basics

Definition: A continuous RV is one with probability zero at every

individual point, and for which there exists aR probability density
function (pdf)R f (x) such that P (X ∈ A) = A f (x) dx for every set
A. Note that R f (x) dx = 1.

Example: Pick a random number between 3 and 7. Then

(
1/4 if 3 ≤ x ≤ 7
f (x) = 2
0 otherwise

Examples: Here are some well-known continuous RV’s:

Uniform(a, b), Exponential(λ), Normal(µ, σ 2 ), etc.

Notation: “∼” means “is distributed as.” For instance,

X ∼ Unif(0, 1) means that X has the uniform distribution on [0,1].

30 / 104
Probability Primer
Basics

Definition: For any RV X (discrete or continuous), the cumulative

distribution function (cdf) is
( P
f (y) if X is discrete
F (x) ≡ P (X ≤ x) = R x y≤x
−∞ f (y) dy if X is continuous
Note that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. In addition, if
d
X is continuous, then dx F (x) = f (x).

Example: Flip 2 coins. Let X be the number of heads.



 0 if x < 0


 1/4 if 0 ≤ x < 1
F (x) = 2


 3/4 if 1 ≤ x < 2

1 if x ≥ 2


Example: if X ∼ Exp(λ) (i.e., X is exponential with parameter λ),

then f (x) = λe−λx and F (x) = 1 − e−λx , x ≥ 0. 2
31 / 104
Probability Primer
Simulating Random Variables

Simulating Random Variables

We’ll make a brief aside here to show how to simulate some very
simple random variables.

Example (Discrete Uniform): Consider a D.U. on {1, 2, . . . , n},

i.e., X = i with probability 1/n for i = 1, 2, . . . , n. (Think of this as
an n-sided dice toss for you Dungeons and Dragons fans.)

If U ∼ Unif(0, 1), we can obtain a D.U. random variate simply by

setting X = dnU e, where d·e is the “ceiling” (or “round up”)
function.

For example, if n = 10 and we sample a Unif(0,1) random variable

U = 0.73, then X = d7.3e = 8. 2

32 / 104
Probability Primer
Simulating Random Variables

Example (Another Discrete Random Variable):




 0.25 if x = −2

 0.10 if x = 3
P (X = x) =


 0.65 if x = 4.2

 0 otherwise

Can’t use a die toss to simulate this random variable. Instead, use
what’s called the inverse transform method.

x f (x) P (X ≤ x) Unif(0,1)’s
−2 0.25 0.25 [0.00, 0.25]
3 0.10 0.35 (0.25, 0.35]
4.2 0.65 1.00 (0.35, 1.00)

Sample U ∼ Unif(0, 1). Choose the corresponding x-value, i.e.,

X = F −1 (U ). For example, U = 0.46 means that X = 4.2. 2
33 / 104
Probability Primer
Simulating Random Variables

Now we’ll use the inverse transform method to generate a continuous

random variable. We’ll talk about the following result a little later. . .

Theorem: If X is a continuous random variable with cdf F (x), then

the random variable F (X) ∼ Unif(0, 1).

This suggests a way to generate realizations of the RV X. Simply set

F (X) = U ∼ Unif(0, 1) and solve for X = F −1 (U ).

Example: Suppose X ∼ Exp(λ). Then F (x) = 1 − e−λx for x > 0.

Set F (X) = 1 − e−λX = U . Solve for X,
−1
X = `n(1 − U ) ∼ Exp(λ). 2
λ

34 / 104
Probability Primer
Simulating Random Variables

Example (Generating Uniforms): All of the above RV generation

examples relied on our ability to generate a Unif(0,1) RV. For now,
let’s assume that we can generate numbers that are “practically” iid
Unif(0,1).

If you don’t like programming, you can use Excel function RAND()
or something similar to generate Unif(0,1)’s.

Here’s an algorithm to generate pseudo-random numbers (PRN’s),

i.e., a series R1 , R2 , . . . of deterministic numbers that appear to be iid
Unif(0,1). Pick a seed integer X0 , and calculate

Xi = 16807Xi−1 mod(231 − 1), i = 1, 2, . . . .

Then set Ri = Xi /(231 − 1), i = 1, 2, . . ..

35 / 104
Probability Primer
Simulating Random Variables

Here’s an easy FORTRAN implementation of the above algorithm

(from Bratley, Fox, and Schrage).

FUNCTION UNIF(IX)
K1 = IX/127773 (this division truncates, e.g., 5/3 = 1.)
IX = 16807*(IX - K1*127773) - K1*2836 (update seed)
IF(IX.LT.0)IX = IX + 2147483647
UNIF = IX * 4.656612875E-10
RETURN
END

In the above function, we input a positive integer IX and the function

returns the PRN UNIF, as well as an updated IX that we can use
again. 2

36 / 104
Probability Primer
Simulating Random Variables

Some Exercises: In the following, I’ll assume that you can use
Excel (or whatever) to simulate independent Unif(0,1) RV’s. (We’ll
review independence in a little while.)
1 Make a histogram of Xi = −`n(Ui ), for i = 1, 2, . . . , 10000,
where the Ui ’s are independent Unif(0,1) RV’s. What kind of
distribution does it look like?
2 Suppose Xi and Yi are independentp Unif(0,1) RV’s,
i = 1, 2, . . . , 10000. Let Zi = −2`n(Xi ) sin(2πYi ), and
make a histogram of the Zi ’s based on the 10000 replications.
3 Suppose Xi and Yi are independent Unif(0,1) RV’s,
i = 1, 2, . . . , 10000. Let Zi = Xi /(Xi − Yi ), and make a
histogram of the Zi ’s based on the 10000 replications. This may
be somewhat interesting. It’s possible to derive the distribution
analytically, but it takes a lot of work.

37 / 104
Probability Primer
Great Expectations

Great Expectations

Definition: The expected value (or mean) of a RV X is

( P
x xf (x) if X is discrete
Z
E[X] ≡ R = x dF (x).
R xf (x) dx if X is continuous R

Example: Suppose that X ∼ Bernoulli(p). Then

(
1 with prob. p
X =
0 with prob. 1 − p (= q)

2
P
and we have E[X] = x xf (x) = p.

38 / 104
Probability Primer
Great Expectations

Example: Suppose that X ∼ Uniform(a, b). Then

(
1
b−a if a < x < b
f (x) =
0 otherwise

2
R
and we have E[X] = R xf (x) dx = (a + b)/2.

Example: Suppose that X ∼ Exponential(λ). Then

(
λe−λx if x > 0
f (x) =
0 otherwise

and we have (after integration by parts and L’Hôspital’s Rule)

Z Z ∞
1
E[X] = xf (x) dx = xλe−λx dx = . 2
R 0 λ

39 / 104
Probability Primer
Great Expectations

Def/Thm: (the “Law of the Unconscious Statistician” or “LOTUS”):

Suppose that h(X) is some function of the RV X. Then
( P
h(x)f (x) if X is disc
Z
E[h(X)] = R x = h(x) dF (x).
R h(x)f (x) dx if X is cts R

The function h(X) can be anything “nice”, e.g., h(X) = X 2 or 1/X

or sin(X) or `n(X).

40 / 104
Probability Primer
Great Expectations

Example: Suppose X is the following discrete RV:

x 2 3 4
f (x) 0.3 0.6 0.1

Then E[X 3 ] = 3 2
P
x x f (x) = 8(0.3) + 27(0.6) + 64(0.1) = 25.

Example: Suppose X ∼ Unif(0, 2). Then

Z
n
E[X ] = xn f (x) dx = 2n /(n + 1). 2
R

41 / 104
Probability Primer
Great Expectations

Definitions: E[X n ] is the nth moment of X.

E[(X − E[X])n ] is the nth central moment of X.

Var(X) ≡ E[(X − E[X])2 ] is the variance of X.

p
The standard deviation of X is Var(X).

Theorem: Var(X) = E[X 2 ] − (E[X])2 (sometimes easier to

calculate this way).

42 / 104
Probability Primer
Great Expectations

Example: Suppose X ∼ Bern(p). Recall that E[X] = p. Then

X
E[X 2 ] = x2 f (x) = p and
x

Var(X) = E[X 2 ] − (E[X])2 = p(1 − p). 2

Example: Suppose X ∼ Exp(λ). By LOTUS,

Z ∞
n
E[X ] = xn λe−λx dx = n!/λn .
0

2 1 2
Var(X) = E[X 2 ] − (E[X])2 = − = 1/λ2 . 2
λ2 λ

43 / 104
Probability Primer
Great Expectations

Theorem: E[aX + b] = aE[X] + b and Var(aX + b) = a2 Var(X).

Example: If X ∼ Exp(3), then

2
E[−2X + 7] = −2E[X] + 7 = − + 7.
3
4
Var(−2X + 7) = (−2)2 Var(X) = . 2
9

44 / 104
Probability Primer
Great Expectations

Definition: MX (t) ≡ E[etX ] is the moment generating function

(mgf) of the RV X. (MX (t) is a function of t, not of X!)

Example: X ∼ Bern(p). Then

X
MX (t) = E[etX ] = etx f (x) = et·1 p + et·0 q = pet + q. 2
x

Example: X ∼ Exp(λ). Then

Z Z ∞
λ
MX (t) = tx
e f (x) dx = λ e(t−λ)x dx = if λ > t. 2
< 0 λ−t
Theorem: Under certain technical conditions,
dk

k

E[X ] = k
MX (t) , k = 1, 2, . . . .
dt t=0

Thus, you can generate the moments of X from the mgf.

45 / 104
Probability Primer
Great Expectations

λ
Example: X ∼ Exp(λ). Then MX (t) = λ−t for λ > t. So

d λ
E[X] = MX (t)
= = 1/λ.
dt t=0 (λ − t)2 t=0

Further,

d2

2 2λ
= 2/λ2 .

E[X ] = MX (t)
=
dt2 t=0 (λ − t)3
t=0

Thus,
2 1
Var(X) = E[X 2 ] − (E[X])2 = − = 1/λ2 . 2
λ2 λ2
Moment generating functions have many other important uses, some
of which we’ll talk about in this course.

46 / 104
Probability Primer
Functions of a Random Variable

Functions of a Random Variable

Problem: Suppose we have a RV X with pmf/pdf f (x). Let

Y = h(X). Find g(y), the pmf/pdf of Y .

Examples (take my word for it for now):

If X ∼ Nor(0, 1), then Y = X 2 ∼ χ2 (1).

If U ∼ Unif(0, 1), then Y = − λ1 `n(U ) ∼ Exp(λ).

47 / 104
Probability Primer
Functions of a Random Variable

Discrete Example: Let X denote the number of H’s from two coin
tosses. We want the pmf for Y = X 3 − X.

x 0 1 2
f (x) 1/4 1/2 1/4
y= x3 −x 0 0 6

This implies that g(0) = P (Y = 0) = P (X = 0 or 1) = 3/4 and

g(6) = P (Y = 6) = 1/4. In other words,
(
3/4 if y = 0
g(y) = . 2
1/4 if y = 6

48 / 104
Probability Primer
Functions of a Random Variable

Continuous Example: Suppose X has pdf f (x) = |x|,

−1 ≤ x ≤ 1. Find the pdf of Y = X 2 .

First of all, the cdf of Y is

G(y) = P (Y ≤ y)
= P (X 2 ≤ y)
√ √
= P (− y ≤ X ≤ y)
Z √y
= √
|x| dx = y, 0 < y < 1.
− y

Thus, the pdf of Y is g(y) = G0 (y) = 1, 0 < y < 1, indicating that

Y ∼ Unif(0, 1). 2

49 / 104
Probability Primer
Functions of a Random Variable

Inverse Transform Theorem: Suppose X is a continuous random

variable having cdf F (x). Then, amazingly, F (X) ∼ Unif(0, 1).

Proof: Let Y = F (X). Then the cdf of Y is

P (Y ≤ y) = P (F (X) ≤ y)
= P (X ≤ F −1 (y))
= F (F −1 (y)) = y,

which is the cdf of the Unif(0,1). 2

This result is of fundamental importance when it comes to generating

random variates during a simulation.

50 / 104
Probability Primer
Functions of a Random Variable

Example (how to generate exponential RV’s): Suppose

X ∼ Exp(λ), with cdf F (x) = 1 − e−λx for x ≥ 0.

So the Inverse Transform Theorem implies that

F (X) = 1 − e−λX ∼ Unif(0, 1).

Let U ∼ Unif(0, 1) and set F (X) = U . Then we have

−1
X = `n(1 − U ) ∼ Exp(λ).
λ
For instance, if λ = 2 and U = 0.27, then X = 0.157 is an Exp(2)
realization. 2

51 / 104
Probability Primer
Functions of a Random Variable

Exercise: Suppose that X has the Weibull distribution with cdf

β
F (x) = 1 − e−(λx) , x > 0.

If you set F (X) = U and solve for X, show that you get
1
X= [−`n(1 − U )]1/β .
λ
Now pick your favorite λ and β, and use this result to generate values
of X. In fact, make a histogram of your X values. Are there any
interesting values of λ and β you could’ve chosen?

52 / 104
Probability Primer
Functions of a Random Variable

Bonus Theorem: Here’s another way to get the pdf of Y = h(X)

for some nice continuous function h(·). The cdf of Y is

FY (y) = P (Y ≤ y) = P (h(X) ≤ y) = P (X ≤ h−1 (y)).

By the chain rule (and since a pdf must be ≥ 0), the pdf of Y is

d −1
d −1
fY (y) = FY (y) = fX (h (y)) h (y) .

dy dy

And now, here’s how to prove LOTUS!

Z Z
−1
d −1
E[Y ] = yfY (y) dy = yfX (h (y)) h (y) dy

R R dy
Z Z
“=” yfX (h−1 (y)) dh−1 (y) = h(x)fX (x) dx. 2
R R

53 / 104
Probability Primer
Jointly Distributed Random Variables

Jointly Distributed Random Variables

Consider two random variables interacting together — think height

and weight.

Definition: The joint cdf of X and Y is

F (x, y) ≡ P (X ≤ x, Y ≤ y), for all x, y.

Remark: The marginal cdf of X is FX (x) = F (x, ∞). (We use the
X subscript to remind us that it’s just the cdf of X all by itself.)
Similarly, the marginal cdf of Y is FY (y) = F (∞, y).

54 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: If X and Y are discrete, then

PthePjoint pmf of X and Y is
f (x, y) ≡ P (X = x, Y = y). Note that x y f (x, y) = 1.

Remark: The marginal pmf of X is

X
fX (x) = P (X = x) = f (x, y).
y

The marginal pmf of Y is

X
fY (y) = P (Y = y) = f (x, y).
x
Example: The following table gives the joint pmf f (x, y), along
with the accompanying marginals.
f (x, y) X=2 X=3 X=4 fY (y)
Y =4 0.3 0.2 0.1 0.6
Y =6 0.1 0.2 0.1 0.4
fX (x) 0.4 0.4 0.2 1 55 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: If X and Y are continuous,R then

R the joint pdf of X and
∂2
Y is f (x, y) ≡ ∂x∂y F (x, y). Note that R R f (x, y) dx dy = 1.

Remark: The marginal pdf’s of X and Y are

Z Z
fX (x) = f (x, y) dy and fY (y) = f (x, y) dx.
R R

Example: Suppose the joint pdf is

21 2
f (x, y) = x y, x2 ≤ y ≤ 1.
4
Then the marginal pdf’s are:
Z Z 1
21 2 21
fX (x) = f (x, y) dy = x y dy = x2 (1−x4 ), −1 ≤ x ≤ 1
R x2 4 8
and
Z Z √y
21 2 7
fY (y) = f (x, y) dx = √ 4
x y dx = y 5/2 , 0 ≤ y ≤ 1. 2
R − y 2
56 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: X and Y are independent RV’s if

f (x, y) = fX (x)fY (y) for all x, y.

Theorem: X and Y are indep if you can write their joint pdf as
f (x, y) = a(x)b(y) for some functions a(x) and b(y), and x and y
don’t have funny limits (their domains do not depend on each other).
Examples: If f (x, y) = cxy for 0 ≤ x ≤ 2, 0 ≤ y ≤ 3, then X and
Y are independent.
If f (x, y) = 21 2 2
4 x y for x ≤ y ≤ 1, then X and Y are not
independent.
If f (x, y) = c/(x + y) for 1 ≤ x ≤ 2, 1 ≤ y ≤ 3, then X and Y are
not independent. 2

57 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: The conditional pdf (or pmf ) of Y given X = x is

f (y|x) ≡ f (x, y)/fX (x) (assuming fX (x) > 0).

This
R is a legit pmf/pdf. For example, in the continuous case,
R f (y|x) dy = 1, for any x.

21 2
Example: Suppose f (x, y) = 4 x y for x2 ≤ y ≤ 1. Then
21 2
f (x, y) 4 x y 2y
f (y|x) = = 21 2 = , x2 ≤ y ≤ 1. 2
fX (x) 4
8 x (1 − x )
1 − x4

Theorem: If X and Y are indep, then f (y|x) = fY (y) for all x, y.

Proof: By definition of conditional and independence,

f (x, y) fX (x)fY (y)
f (y|x) = = . 2
fX (x) fX (x)
58 / 104
Probability Primer
Jointly Distributed Random Variables

Definition: The conditional expectation of Y given X = x is

( P
yf (y|x) discrete
E[Y |X = x] ≡ R y
R yf (y|x) dy continuous

Example: The expected weight of a person who is 7 feet tall

(E[Y |X = 7]) will probably be greater than that of a random person
from the entire population (E[Y ]).
21 2
Old Cts Example: f (x, y) = 4 x y, if x2 ≤ y ≤ 1. Then
1
2y 2 2 1 − x6
Z Z
E[Y |x] = yf (y|x) dy = dy = · . 2
R x2 1 − x4 3 1 − x4

59 / 104
Probability Primer
Jointly Distributed Random Variables

Theorem (double expectations): E[E(Y |X)] = E[Y ].

Proof (cts case): By the Unconscious Statistician,

Z
E[E(Y |X)] = E(Y |x)fX (x) dx
R
Z Z
= yf (y|x) dy fX (x) dx
R R
Z Z
= yf (y|x)fX (x) dx dy
R R
Z Z
= y f (x, y) dx dy
R R
Z
= yfY (y) dy = E[Y ]. 2
R

60 / 104
Probability Primer
Jointly Distributed Random Variables

Old Example: Suppose f (x, y) = 21 2 2

4 x y, if x ≤ y ≤ 1. By
previous examples, we know fX (x), fY (y), and E[Y |x]. Find E[Y ].

Solution #1 (old, boring way):

Z Z 1
7 7/2 7
E[Y ] = yfY (y) dy = y dy = .
R 0 2 9

Solution #2 (new, exciting way):

Z
E[Y ] = E[E(Y |X)] = E(Y |x)fX (x) dx
R
1
2 1 − x6
Z
21 2 4 7
= · x (1 − x ) dx = .
−1 3 1 − x4 8 9

Notice that both answers are the same (good)! 2

61 / 104
Probability Primer
Jointly Distributed Random Variables

Example: A cutesy way to calculate the mean of the Geometric

distribution.

Let Y ∼ Geom(p), e.g., Y could be the number of coin flips before H

appears, where P (H) = p. From Baby Probability class, we know
that the pmf of Y is fY (y) = P (Y = y) = q y−1 p, for y = 1, 2, . . ..

Then the old-fashioned way to calculate the mean is:

X ∞
X
E[Y ] = yfY (y) = yq y−1 p = 1/p,
y y=1

where the last step follows because I tell you so. 2

But if you are not quite willing to believe me,. . .

62 / 104
Probability Primer
Jointly Distributed Random Variables

. . . Let’s use double expectation to do what’s called a “standard

one-step conditioning argument”. Define X = 1 if the first flip is H;
and X = 0 otherwise.

Based on the result X of the first step, we have

X
E[Y ] = E[E(Y |X)] = E(Y |x)fX (x)
x
= E(Y |X = 0)P (X = 0) + E(Y |X = 1)P (X = 1)
= (1 + E[Y ])(1 − p) + 1(p). (why?)

Solving, we get E[Y ] = 1/p again! 2

63 / 104
Probability Primer
Jointly Distributed Random Variables

Computing Probabilities by Conditioning

Let A be some event, and define the RV Y = 1 if A occurs; and

Y = 0 otherwise. Then
X
E[Y ] = yfY (y) = P (Y = 1) = P (A).
y

Similarly, for any RV X, we have

X
E[Y |X = x] = yfY (y|x) = P (Y = 1|X = x) = P (A|X = x).
y

64 / 104
Probability Primer
Jointly Distributed Random Variables

Thus,
P (A) = E[Y ] = E[E(Y |X)]
Z
= E[Y |X = x]dFX (x)
ZR
= P (A|X = x)dFX (x).
R

Example/Theorem: If X and Y are independent cts RV’s, then

Z
P (Y < X) = P (Y < x)fX (x) dx.
R

Proof: Follows from above result if we let the event A = {Y < X}.
2

65 / 104
Probability Primer
Jointly Distributed Random Variables

Example: If X ∼ Exp(µ) and Y ∼ Exp(λ) are indep RV’s, then

Z
P (Y < X) = P (Y < x)fX (x) dx
R
Z ∞
= (1 − e−λx )µe−µx dx
0
λ
= . 2
λ+µ

66 / 104
Probability Primer
Jointly Distributed Random Variables

Theorem (variance decomposition):

Var(Y ) = E [Var(Y |X)] + Var [E(Y |X)]
Proof (from Ross): By definition of variance and double expectation,
h i
E [Var(Y |X)] = E E(Y 2 |X) − {E(Y |X)}2
h i
= E(Y 2 ) − E {E(Y |X)}2 .

Similarly,
h i
Var [E(Y |X)] = E {E(Y |X)}2 − {E [E(Y |X)]}2
h i
= E {E(Y |X)}2 − {E(Y )}2 .

Thus,
E [Var(Y |X)]+Var [E(Y |X)] = E(Y 2 )−{E(Y )}2 = Var(Y ). 2
67 / 104
Probability Primer
Covariance and Correlation

“Definition” (two-dimensional LOTUS): Suppose that h(X, Y ) is

some function of the RV’s X and Y . Then
( P P
h(x, y)f (x, y) if (X, Y ) is discrete
E[h(X, Y )] = R Rx y
R R h(x, y)f (x, y) dx dy if (X, Y ) is continuous

Theorem: Whether or not X and Y are independent, we have

E[X + Y ] = E[X] + E[Y ].

Theorem: If X and Y are independent, then

Var(X + Y ) = Var(X) + Var(Y ).

(Stay tuned for dependent case.)

68 / 104
Probability Primer
Covariance and Correlation

Definition: X1 , . . . , Xn form a random sample from f (x) if (i)

X1 , . . . , Xn are independent, and (ii) each Xi has the same pdf (or
pmf) f (x).

iid
Notation: X1 , . . . , Xn ∼ f (x). (The term “iid” reads independent
and identically distributed.)
iid
P If X1 , . . . , Xn ∼ f (x) and the sample mean
Example:
X̄n ≡ ni=1 Xi /n, then E[X̄n ] = E[Xi ] and Var(X̄n ) = Var(Xi )/n.
Thus, the variance decreases as n increases. 2

But not all RV’s are independent...

69 / 104
Probability Primer
Covariance and Correlation

Covariance and Correlation

Definition: The covariance between X and Y is

Cov(X, Y ) ≡ E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ].

Note that Var(X) = Cov(X, X).

Theorem: If X and Y are independent RV’s, then Cov(X, Y ) = 0.

Remark: Cov(X, Y ) = 0 doesn’t mean X and Y are independent!

Example: Suppose X ∼ Unif(−1, 1) and Y = X 2 . Then X and Y

are clearly dependent. However,
Z 1 3
x
3 2
Cov(X, Y ) = E[X ]−E[X]E[X ] = E[X ] = 3
dx = 0. 2
−1 2

70 / 104
Probability Primer
Covariance and Correlation

Theorem: Cov(aX, bY ) = abCov(X, Y ).

Theorem: Whether or not X and Y are independent,

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

and

Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).

Definition: The correlation between X and Y is

Cov(X, Y )
ρ ≡ p .
Var(X)Var(Y )

Theorem: −1 ≤ ρ ≤ 1.

71 / 104
Probability Primer
Covariance and Correlation

Example: Consider the following joint pmf.

f (x, y) X=2 X=3 X=4 fY (y)

Y = 40 0.00 0.20 0.10 0.3
Y = 50 0.15 0.10 0.05 0.3
Y = 60 0.30 0.00 0.10 0.4
fX (x) 0.45 0.30 0.25 1

E[X] = 2.8, Var(X) = 0.66, E[Y ] = 51, Var(Y ) = 69,

XX
E[XY ] = xyf (x, y) = 140,
x y

and
E[XY ] − E[X]E[Y ]
ρ = p = −0.415. 2
Var(X)Var(Y )
72 / 104
Probability Primer
Covariance and Correlation

Portfolio Example: Consider two assets, S1 and S2 , with expected

returns E[S1 ] = µ1 and E[S2 ] = µ2 , and variabilities Var(S1 ) = σ12 ,
Var(S2 ) = σ22 , and Cov(S1 , S2 ) = σ12 .

Define a portfolio P = wS1 + (1 − w)S2 , where w ∈ [0, 1]. Then

E[P ] = wµ1 + (1 − w)µ2

Var(P ) = w2 σ12 + (1 − w)2 σ22 + 2w(1 − w)σ12 .

d
Setting dw Var(P ) = 0, we obtain the critical point that (hopefully)
minimizes the variance of the portfolio,
σ 2 − σ12
w = 2 2 2 . 2
σ1 + σ2 − 2σ12

73 / 104
Probability Primer
Covariance and Correlation

Portfolio Exercise: Suppose E[S1 ] = 0.2, E[S2 ] = 0.1,

Var(S1 ) = 0.2, Var(S2 ) = 0.4, and Cov(S1 , S2 ) = −0.1.

What value of w maximizes the expected return of the portfolio?

What value of w minimizes the variance? (Note the negative

covariance I’ve introduced into the picture.)

Let’s talk trade-offs.

74 / 104
Probability Primer
Some Probability Distributions

Some Probability Distributions

First, some discrete distributions. . .

X ∼ Bernoulli(p).
(
p if x = 1
f (x) =
1 − p (= q) if x = 0

E[X] = p, Var(X) = pq, MX (t) = pet + q.

iid
Y ∼ Binomial(n, p). If X1 , X
P2 , . . . , Xn ∼ Bern(p) (i.e.,
n
Bernoulli(p) trials), then Y = i=1 Xi ∼ Bin(n, p).
!
n y n−y
f (y) = p q , y = 0, 1, . . . , n.
y

E[Y ] = np, Var(Y ) = npq, MY (t) = (pet + q)n .

75 / 104
Probability Primer
Some Probability Distributions

X ∼ Geometric(p) is the number of Bern(p) trials until a success

occurs. For example, “FFFS” implies that X = 4.

f (x) = q x−1 p, x = 1, 2, . . . .

E[X] = 1/p, Var(X) = q/p2 , MX (t) = pet /(1 − qet ).

Y ∼ NegBin(r, p) is the sum of r iid Geom(p) RV’s, i.e., the time

until the rth success occurs. For example, “FFFSSFS” implies that
NegBin(3, p) = 7.
!
y − 1 y−r r
f (y) = q p , y = r, r + 1, . . . .
r−1

E[Y ] = r/p, Var(Y ) = qr/p2 .

76 / 104
Probability Primer
Some Probability Distributions

X ∼ Poisson(λ).

Definition: A counting process N (t) tallies the number of “arrivals”

observed in [0, t]. A Poisson process is a counting process satisfying
the following.
i. Arrivals occur one-at-a-time at rate λ (e.g., λ = 4 customers/hr)
ii. Independent increments, i.e., the numbers of arrivals in disjoint
time intervals are independent.
iii. Stationary increments, i.e., the distribution of the number of
arrivals in [s, s + t] only depends on t.

X ∼ Pois(λ) is the number of arrivals that a Poisson process

experiences in one time unit, i.e., N (1).
e−λ λx
f (x) = , x = 0, 1, . . . .
x!
t −1)
E[X] = λ = Var(X), MX (t) = eλ(e .
77 / 104
Probability Primer
Some Probability Distributions

Now, some continuous distributions. . .

1 a+b
X ∼ Uniform(a, b). f (x) = b−a for a ≤ x ≤ b, E[X] = 2 ,
(b−a)2
Var(X) = 12 , MX (t) = (etb − eta )/(tb − ta).

X ∼ Exponential(λ). f (x) = λe−λx for x ≥ 0, E[X] = 1/λ,

Var(X) = 1/λ2 , MX (t) = λ/(λ − t) for t < λ.

Theorem: The exponential distribution has the memoryless property

(and is the only continuous distribution with this property), i.e., for
s, t > 0, P (X > s + t|X > s) = P (X > t).

Example: Suppose X ∼ Exp(λ = 1/100). Then

P (X > 200|X > 50) = P (X > 150) = e−λt = e−150/100 . 2

78 / 104
Probability Primer
Some Probability Distributions

X ∼ Gamma(α, λ).
λα xα−1 e−λx
f (x) = , x ≥ 0,
Γ(α)
where the gamma function is
Z ∞
Γ(α) ≡ tα−1 e−t dt.
0
h iα
E[X] = α/λ, Var(X) = α/λ2 , MX (t) = λ/(λ − t) for t < λ.

iid
If X1 , X2 , . . . , Xn ∼ Exp(λ), then Y ≡ ni=1 Xi ∼ Gamma(n, λ).
P
The Gamma(n, λ) is also called the Erlangn (λ). It has cdf
n−1
X (λy)j
FY (y) = 1 − e−λy , y ≥ 0.
j!
j=0

79 / 104
Probability Primer
Some Probability Distributions

X ∼ Triangular(a, b, c). Good for modeling things with limited

data — a is the smallest possible value, b is the “most likely,” and c is
the largest.

2(x−a)
 (b−a)(c−a) if a < x ≤ b


2(c−x)
f (x) = (c−b)(c−a) if b < x ≤ c .


 0 otherwise

E[X] = (a + b + c)/3.
Γ(a+b) a−1
X ∼ Beta(a, b). f (x) = Γ(a)Γ(b) x (1 − x)b−1 for 0 ≤ x ≤ 1 and
a, b > 0.

a ab
E[X] = and Var(X) = .
a+b (a + b)2 (a + b + 1)

80 / 104
Probability Primer
Some Probability Distributions

X ∼ Normal(µ, σ 2 ). Most important distribution.

−(x − µ)2

1
f (x) = √ exp , x ∈ R.
2πσ 2 2σ 2
E[X] = µ, Var(X) = σ 2 , and MX (t) = exp(µt + 12 σ 2 t2 ).

Theorem: If X ∼ Nor(µ, σ 2 ), then aX + b ∼ Nor(aµ + b, a2 σ 2 ).

Corollary: If X ∼ Nor(µ, σ 2 ), then Z ≡ X−µσ ∼ Nor(0, 1), the

2
standard normal distribution, with pdf φ(z) ≡ √12π e−z /2 and cdf
.
Φ(z), which is tabled. E.g., Φ(1.96) = 0.975.

Theorem: If X1 and X2 are independent with Xi ∼ Nor(µi , σi2 ),

i = 1, 2, then X1 + X2 ∼ Nor(µ1 + µ2 , σ12 + σ22 ).

Example: Suppose X ∼ Nor(3, 4), Y ∼ Nor(4, 6), and X and Y

are independent. Then 2X − 3Y + 1 ∼ Nor(−5, 70). 2
81 / 104
Probability Primer
Limit Theorems

Limit Theorems

Corollary (of a previous theorem): If X1 , . . . , Xn are iid Nor(µ, σ 2 ),

then the sample mean X̄n ∼ Nor(µ, σ 2 /n).

This is a special case of the Law of Large Numbers, which says that
X̄n approximates µ well as n becomes large.

Definition: The sequence of RV’s Y1 , Y2 , . . . with respective cdf’s

FY1 (y), FY2 (y), . . . converges in distribution to the RV Y having cdf
FY (y) if limn→∞ FYn (y) = FY (y) for all y belonging to the
d
continuity set of Y . Notation: Yn −→ Y .

d
Idea: If Yn −→ Y and n is large, then you ought to be able to
approximate the distribution of Yn by the limit distribution of Y .

82 / 104
Probability Primer
Limit Theorems

iid
Central Limit Theorem: If X1 , X2 , . . . , Xn ∼ f (x) with mean µ
and variance σ 2 , then
Pn √
i=1√Xi − nµ n(X̄n − µ) d
Zn ≡ = −→ Nor(0, 1).
nσ σ

Thus, the cdf of Zn approaches Φ(z) as n increases.

The CLT is the most-important theorem in the universe.

The CLT usually works well if the pmf/pdf is fairly symmetric and
n ≥ 15.

We will eventually look at more-general versions of the CLT

described above.

83 / 104
Probability Primer
Limit Theorems

iid
Example: If X1 , X2 , . . . , X100 ∼ Exp(1) (so µ = σ 2 = 1), then
100
X
P 90 ≤ Xi ≤ 110
i=1

90 − 100 110 − 100
= P √ ≤ Z100 ≤ √
100 100
≈ P (−1 ≤ Nor(0, 1) ≤ 1) = 0.6827.

By the way, since 100

P
i=1 Xi ∼ Erlangk=100 (λ = 1), we can use the
cdf (which may be tedious)
P or software such as Minitab to obtain the
exact value of P (90 ≤ 100i=1 Xi ≤ 110) = 0.6835.

Wow! The CLT and exact answers match nicely! 2

84 / 104
Probability Primer
Limit Theorems

Exercise: Demonstrate that the CLT actually works.

1 Pick your favorite RV X1 . Simulate it and make a histogram.
2 Now suppose X1 and X2 are iid from your favorite distribution.
Make a histogram of X1 + X2 .
3 Now X1 + X2 + X3 .
4 . . . Now X1 + X2 + · · · + Xn for some reasonably large n.
5 Does the CLT work for the Cauchy distribution, i.e.,
X = tan(2πU ), where U ∼ Unif(0, 1)?

85 / 104
Statistics Primer

Intro to Estimation

Definition: A statistic is a function of the observations X1 , . . . , Xn ,

and not explicitly dependent on any unknown parameters.
1 Pn 1 Pn
Examples of statistics: X̄ = n i=1 Xi , S2 = n−1 i=1 (Xi − X̄)2 .

Statistics are random variables. If we take two different samples,

we’d expect to get two different values of a statistic.

A statistic is usually used to estimate some unknown parameter from

the underlying probability distribution of the Xi ’s.

Examples of parameters: µ, σ 2 .

87 / 104
Statistics Primer
Intro to Estimation

Let X1 , . . . , Xn be iid RV’s and let T (X) ≡ T (X1 , . . . , Xn ) be a

statistic based on the Xi ’s. Suppose we use T (X) to estimate some
unknown parameter θ. Then T (X) is called a point estimator for θ.

Examples: X̄ is usually a point estimator for the mean µ = E[Xi ],

and S 2 is often a point estimator for the variance σ 2 = Var(Xi ).

It would be nice if T (X) had certain properties:

* Its expected value should equal the parameter it’s trying to estimate.

* It should have low variance.

88 / 104
Statistics Primer
Unbiased Estimation

Unbiased Estimators
Definition: T (X) is unbiased for θ if E[T (X)] = θ.

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with

mean µ. Then
Xn n
1 1X
E[X̄] = E Xi = E[Xi ] = E[Xi ] = µ.
n n
i=1 i=1

So X̄ is always unbiased for µ. That’s why X̄ is called the sample

mean.

Baby Example: In particular, suppose X1 , . . . , Xn are iid Exp(λ).

Then X̄ is unbiased for µ = E[Xi ] = 1/λ.

But be careful. . . . 1/X̄ is biased for λ in this exponential case, i.e.,

E[1/X̄] 6= 1/E[X̄] = λ.
89 / 104
Statistics Primer
Unbiased Estimation

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with

mean µ and variance σ 2 . Then
Pn 2
2 i=1 (Xi − X̄)
E[S ] = E = Var(Xi ) = σ 2 .
n−1

Thus, S 2 is always unbiased for σ 2 . This is why S 2 is called the

sample variance.

Baby Example: Suppose X1 , . . . , Xn are iid Exp(λ). Then S 2 is

unbiased for Var(Xi ) = 1/λ2 .

90 / 104
Statistics Primer
Unbiased Estimation

Proof (of general result): First, some algebra gives

Pn 2
Pn 2 2
2 i=1 (Xi − X̄) i=1 Xi − nX̄
S = = .
n−1 n−1

Since E[X1 ] = E[X̄] and Var(X̄) = Var(X1 )/n = σ 2 /n, we have

Pn 2 2
2 i=1 E[Xi ] − nE[X̄ ] n 2 2
E[S ] = = E[X1 ] − E[X̄ ]
n−1 n−1

n 2 2
= Var(X1 ) + (E[X1 ]) − Var(X̄) − (E[X̄])
n−1
n
= (σ 2 − σ 2 /n) = σ 2 . 2
n−1
Remark: S is biased for the standard deviation σ.

91 / 104
Statistics Primer
Unbiased Estimation
iid
Big Example: Suppose that X1 , . . . , Xn ∼ Unif(0, θ), i.e., the pdf
is f (x) = 1/θ, 0 < x < θ.
n+1
Consider two estimators: Y1 ≡ 2X̄ and Y2 ≡ n max1≤i≤n Xi

Since E[Y1 ] = 2E[X̄] = 2E[Xi ] = θ, we see that Y1 is unbiased for θ.

It’s also the case that Y2 is unbiased, but it takes a little more work to
show this. As a first step, let’s get the cdf of M ≡ maxi Xi ,
P (M ≤ y) = P (X1 ≤ y and X2 ≤ y and · · · and Xn ≤ y)
n
Y
= P (Xi ≤ y) = [P (X1 ≤ y)]n (Xi ’s are iid)
i=1
Z y n Z y n
= fX1 (x) dx = 1/θ dx = (y/θ)n .
0 0

92 / 104
Statistics Primer
Unbiased Estimation

This implies that the pdf of M is

d ny n−1
fM (y) ≡ (y/θ)n = ,
dy θn
and this implies that
θ θ
ny n
Z Z
nθ
E[M ] = yfM (y) dy = n
= .
0 0 θ n+1
n+1
Whew! So we see that Y2 = n max1≤i≤n Xi is unbiased for θ.

So both Y1 and Y2 are unbiased for θ, but which is better?

Let’s now compare variances. After similar algebra, we have

θ2 θ2
Var(Y1 ) = and Var(Y2 ) = .
3n n(n + 2)

Thus,Y2 has much lower variance than Y1 . 2

93 / 104
Statistics Primer
Unbiased Estimation

Mean Squared Error

Definition: The bias of T (X) as an estimator of θ is

Bias(T ) ≡ E[T ] − θ.

The mean squared error of T (X) is MSE(T ) ≡ E[(T − θ)2 ].

Remark: After some algebra, we get an easier expression for MSE

that combines the bias and variance of an estimator

MSE(T ) = Var(T ) + (E[T ] − θ)2 .

| {z }
Bias

Lower MSE is better — even if there’s a little bias.

94 / 104
Statistics Primer
Unbiased Estimation

Definition: The relative efficiency of T2 to T1 is

MSE(T1 )/MSE(T2 ). If this quantity is < 1, then we’d want T1 .

iid
Example: X1 , . . . , Xn ∼ Unif(0, θ).
n+1
Two estimators: Y1 = 2X̄ and Y2 = n maxi Xi .

Showed before E[Y1 ] = E[Y2 ] = θ (so both are unbiased).

θ2 θ2
Also, Var(Y1 ) = 3n and Var(Y2 ) = n(n+2) .

θ2 θ2
Thus, MSE(Y1 ) = 3n and MSE(Y2 ) = n(n+2) , so Y2 is better.

95 / 104
Statistics Primer
Maximum Likelihood Estimation

Maximum Likelihood Estimators

Definition: Consider an iid random sample X1 , . . . , Xn , where each
Xi has pdf/pmf f (x). Further, suppose that θ is some unknown
parameter from Xi . The likelihood function is L(θ) ≡ ni=1 f (xi ).
Q

Definition: The maximum likelihood estimator (MLE) of θ is the

value of θ that maximizes L(θ). The MLE is a function of the Xi ’s
and is a RV.
iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ). Find the MLE for λ.
n
Y n
Y n
X
−λxi n
L(λ) = f (xi ) = λe = λ exp − λ xi .
i=1 i=1 i=1

Now maximize L(λ) with respect to λ.

96 / 104
Statistics Primer
Maximum Likelihood Estimation

Could take the derivative and plow through all of the horrible algebra.
Too tedious. Need a trick. . . .

Useful Trick: Since the natural log function is one-to-one, it’s easy to
see that the λ that maximizes L(λ) also maximizes `n(L(λ))!
n
X n
X
n
`n(L(λ)) = `n λ exp − λ xi = n`n(λ) − λ xi
i=1 i=1

This makes our job less horrible.

n n
d d X n X
`n(L(λ)) = n`n(λ) − λ xi = − xi ≡ 0.
dλ dλ λ
i=1 i=1

This implies that the MLE is λ̂ = 1/X̄. 2

97 / 104
Statistics Primer
Maximum Likelihood Estimation

Remarks: (1) λ̂ = 1/X̄ makes sense since E[X] = 1/λ.

hat over λ to indicate that this is the

(2) At the end, we put a little d
MLE.

(3) At the end, we make all of the little xi ’s into big Xi ’s to indicate
that this is a RV.

(4) Just to be careful, you probably ought to perform a

second-derivative test, but I won’t blame you if you don’t.

98 / 104
Statistics Primer
Maximum Likelihood Estimation

Invariance Property of MLE’s

Theorem (Invariance Property): If θ̂ is the MLE of some parameter

θ and h(·) is a one-to-one function, then h(θ̂) is the MLE of h(θ).

iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ). We define the survival
function as

F̄ (x) = P (X > x) = 1 − F (x) = e−λx .

In addition, we saw that the MLE for λ is λ̂ = 1/X̄.

Then the invariance property says that the MLE of F̄ (x) is

[
F̄ (x) = e−λ̂x = e−x/X̄ .

This kind of thing is used all of the time the actuarial sciences. 2
99 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Distributional Results and Confidence Intervals

There are a number of distributions (including the normal) that come

up in statistical sampling problems. Here are a few:

Definitions: If Z1 , Z2 , . . . , Zk are iid Nor(0,1), then Y = ki=1 Zi2

P
has the χ2 distribution with k degrees of freedom (df). Notation:
Y ∼ χ2 (k). Note that E[Y ] = k and Var(Y ) = 2k.

If Z ∼ Nor(0,
p 1), Y ∼ χ2 (k), and Z and Y are independent, then
T = Z/ Y /k has the Student t distribution with k df. Notation:
T ∼ t(k). Note that the t(1) is the Cauchy distribution.

If Y1 ∼ χ2 (m), Y2 ∼ χ2 (n), and Y1 and Y2 are independent, then

F = (Y1 /m)/(Y2 /n) has the F distribution with m and n df.
Notation: F ∼ F (m, n).

100 / 104
Statistics Primer
Distributional Results and Confidence Intervals

How (and why) would one use the above facts? Because they can be
used to construct confidence intervals (CIs) for µ and σ 2 under a
variety of assumptions.

A 100(1 − α)% two-sided CI for an unknown parameter θ is a

random interval [L, U ] such that P (L ≤ θ ≤ U ) = 1 − α.

Here are some examples / theorems, all of which assume that the Xi ’s
are iid normal. . .

Example: If σ 2 is known, then a 100(1 − α)% CI for µ is

r r
σ2 σ2
X̄n − zα/2 ≤ µ ≤ X̄n + zα/2 ,
n n
where zγ is the 1 − γ quantile of the standard normal distribution, i.e.,
zγ ≡ Φ−1 (1 − γ).
101 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Example: If σ 2 is unknown, then a 100(1 − α)% CI for µ is

r r
S2 S2
X̄n − tα/2,n−1 ≤ µ ≤ X̄n + tα/2,n−1 ,
n n
where tγ,ν is the 1 − γ quantile of the t(ν) distribution.

Example: A 100(1 − α)% CI for σ 2 is

(n − 1)S 2 2 (n − 1)S 2
≤ σ ≤ ,
χ2α ,n−1 χ21− α ,n−1
2 2

where χ2γ,ν is the 1 − γ quantile of the χ2 (ν) distribution.

102 / 104
Statistics Primer
Distributional Results and Confidence Intervals

Exercise: Here are 20 residual flame times (in sec.) of treated

specimens of children’s nightwear. (Don’t worry — children were not
in the nightwear when the clothing was set on fire.)

9.85 9.93 9.75 9.77 9.67

9.87 9.67 9.94 9.85 9.75
9.83 9.92 9.74 9.99 9.88
9.95 9.95 9.93 9.92 9.89

Let’s get a 95% CI for the mean residual flame time.

103 / 104
Statistics Primer
Distributional Results and Confidence Intervals

After a little algebra, we get

X̄ = 9.8525 and S = 0.0965.

Further, you can use the Excel function t.inv(0.975,19) to get

tα/2,n−1 = t0.025,19 = 2.093.

Then the half-length of the CI is

p (2.093)(0.0965)
H = tα/2,n−1 S 2 /n = √ = 0.0451.
20

Thus, the CI is µ ∈ X̄ ± H, or 9.8074 ≤ µ ≤ 9.8976. 2

104 / 104

OCA Java Programmer 8 Fundamentals - 1Z0-808
100% (2)
OCA Java Programmer 8 Fundamentals - 1Z0-808
525 pages
DERIVATIVES FORMULAS
No ratings yet
DERIVATIVES FORMULAS
7 pages
M2C3 Solution of Algebraic and Transcendental Newton Raphson Method
No ratings yet
M2C3 Solution of Algebraic and Transcendental Newton Raphson Method
5 pages
HBR Marketing Management
No ratings yet
HBR Marketing Management
67 pages
Lec6 Constr Opt
No ratings yet
Lec6 Constr Opt
30 pages
C 05 Further Differentiation and Applications
No ratings yet
C 05 Further Differentiation and Applications
40 pages
L'Hopital's Rule
100% (1)
L'Hopital's Rule
3 pages
Dynamical Systems: Supplementary Notes
No ratings yet
Dynamical Systems: Supplementary Notes
27 pages
Functions of Several Variables
No ratings yet
Functions of Several Variables
18 pages
SMA 1216 Lecture Notes 2016
No ratings yet
SMA 1216 Lecture Notes 2016
18 pages
C 05 Further Differentiation and Applications
No ratings yet
C 05 Further Differentiation and Applications
40 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
Solutions Sheet 2, Continuity.: Questions 7-12
No ratings yet
Solutions Sheet 2, Continuity.: Questions 7-12
4 pages
Math 111
No ratings yet
Math 111
163 pages
Numerical Differentiation
No ratings yet
Numerical Differentiation
26 pages
2.1 Derivatives of Algebraic and Transcendental Functions
0% (1)
2.1 Derivatives of Algebraic and Transcendental Functions
16 pages
Controle16
No ratings yet
Controle16
4 pages
Lecture17 PDF
No ratings yet
Lecture17 PDF
4 pages
Numerical Integration Note
No ratings yet
Numerical Integration Note
10 pages
Introduction, Function and Limits With Derivatives: Module Outline
No ratings yet
Introduction, Function and Limits With Derivatives: Module Outline
10 pages
Solution of Nonlinear Equations F (X) 0: Professor Henry Arguello
No ratings yet
Solution of Nonlinear Equations F (X) 0: Professor Henry Arguello
90 pages
Lecture 9 - Increasing and Decreasing Functions, Extrema, and The First Derivative Test
No ratings yet
Lecture 9 - Increasing and Decreasing Functions, Extrema, and The First Derivative Test
10 pages
1 Lecture 14: The Product and Quotient Rule: 1.1 Outline
No ratings yet
1 Lecture 14: The Product and Quotient Rule: 1.1 Outline
4 pages
C22-Derivatives - Part 2
No ratings yet
C22-Derivatives - Part 2
31 pages
Weak 3
No ratings yet
Weak 3
22 pages
Differentiation Revision
No ratings yet
Differentiation Revision
4 pages
Pure Step3 2019
No ratings yet
Pure Step3 2019
14 pages
Lesson 6 - Basic Differentiation
No ratings yet
Lesson 6 - Basic Differentiation
3 pages
(2023 - 2024) CALCULUS I Student Handouts 19 and 20
No ratings yet
(2023 - 2024) CALCULUS I Student Handouts 19 and 20
16 pages
ME Lecture Note
No ratings yet
ME Lecture Note
65 pages
Week 11.2E Limits and Continuity
No ratings yet
Week 11.2E Limits and Continuity
6 pages
Department of Mathematics MAL 110 (Mathematics I) Tutorial Sheet No. 1 Taylors Theorem and Integral Calculus
No ratings yet
Department of Mathematics MAL 110 (Mathematics I) Tutorial Sheet No. 1 Taylors Theorem and Integral Calculus
2 pages
Practice Test 1 Solutions
No ratings yet
Practice Test 1 Solutions
7 pages
MA101(2024-25)-Lecture-22-1
No ratings yet
MA101(2024-25)-Lecture-22-1
33 pages
Dynamical Systems: Exercises and Solutions
No ratings yet
Dynamical Systems: Exercises and Solutions
55 pages
Lec3 Convex Function Exercise
No ratings yet
Lec3 Convex Function Exercise
4 pages
Cal1 Econ Week2
No ratings yet
Cal1 Econ Week2
15 pages
02 Limit Continuity Differentiation (1)
No ratings yet
02 Limit Continuity Differentiation (1)
95 pages
Hyperbolic Equations: 8.1 D'Alembert's Solution
No ratings yet
Hyperbolic Equations: 8.1 D'Alembert's Solution
10 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
PDE - S and Transition Density Function
No ratings yet
PDE - S and Transition Density Function
44 pages
Advanced Math PS1 Solution
No ratings yet
Advanced Math PS1 Solution
4 pages
8. Applications of Differential Calculus
No ratings yet
8. Applications of Differential Calculus
33 pages
ACVC 1 Mark Questions
No ratings yet
ACVC 1 Mark Questions
5 pages
FINAL BOOK For Cal1
No ratings yet
FINAL BOOK For Cal1
260 pages
MA-201:Error in Interpolation
No ratings yet
MA-201:Error in Interpolation
2 pages
Quotient Rule
No ratings yet
Quotient Rule
3 pages
l4
No ratings yet
l4
49 pages
1 Review of Key Concepts From Previous Lectures: Lecture Notes - Amber Habib - December 1
No ratings yet
1 Review of Key Concepts From Previous Lectures: Lecture Notes - Amber Habib - December 1
4 pages
Lecture_3_taxonomy_taylor
No ratings yet
Lecture_3_taxonomy_taylor
4 pages
Stability of P Order Metric Regularity
No ratings yet
Stability of P Order Metric Regularity
7 pages
MTS102CAL
No ratings yet
MTS102CAL
21 pages
Math 4 week 4 UvA solutions
No ratings yet
Math 4 week 4 UvA solutions
10 pages
Assignment 3: Derivatives, Maxima and Minima, Rolle's Theorem
No ratings yet
Assignment 3: Derivatives, Maxima and Minima, Rolle's Theorem
2 pages
North South University: Dy DX
No ratings yet
North South University: Dy DX
5 pages
Solution. Solution. The Limit Is Equal To 1. Denote The Function F (X, Y)
No ratings yet
Solution. Solution. The Limit Is Equal To 1. Denote The Function F (X, Y)
2 pages
Problem Set 5
No ratings yet
Problem Set 5
2 pages
Differentiations
No ratings yet
Differentiations
20 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)
Potential-Based Shaping in Model-Based Reinforcement Learning
No ratings yet
Potential-Based Shaping in Model-Based Reinforcement Learning
6 pages
Cederborg Etal 2015 PDF
No ratings yet
Cederborg Etal 2015 PDF
7 pages
Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004
No ratings yet
Apprenticeship Learning About Multiple Intentions: Abbeel & NG 2004
8 pages
Module01 TourOfSimulation 190513 PDF
No ratings yet
Module01 TourOfSimulation 190513 PDF
73 pages
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
No ratings yet
Maximum Entropy Inverse Reinforcement Learning: Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey
6 pages
Dango Daikazoku PDF
No ratings yet
Dango Daikazoku PDF
3 pages
CHM13P Learning Task 2
No ratings yet
CHM13P Learning Task 2
2 pages
Prop 47 Fact Sheet - Final
No ratings yet
Prop 47 Fact Sheet - Final
2 pages
Analisis Penerapan Human Capital Management Terhadap Kinerja Karyawan Studi Pada Pt. Telkomsel Branch Purwokerto
No ratings yet
Analisis Penerapan Human Capital Management Terhadap Kinerja Karyawan Studi Pada Pt. Telkomsel Branch Purwokerto
21 pages
Sir Kashif Chapter 1
No ratings yet
Sir Kashif Chapter 1
33 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
42 pages
Ateneo de Manila CHEMISTRY Generic Course Syllabus II Sem SY 12-13
No ratings yet
Ateneo de Manila CHEMISTRY Generic Course Syllabus II Sem SY 12-13
5 pages
STQA 2015 16 April
No ratings yet
STQA 2015 16 April
2 pages
Basic TV Production: Tvprod-01
No ratings yet
Basic TV Production: Tvprod-01
36 pages
Migrating Panelview Enhanced Terminal Applications: Quick Start
No ratings yet
Migrating Panelview Enhanced Terminal Applications: Quick Start
160 pages
For Kindergartners!: Created By: April Teal 2016
No ratings yet
For Kindergartners!: Created By: April Teal 2016
23 pages
R.D.D - Chapter 1
No ratings yet
R.D.D - Chapter 1
19 pages
Rapp Geom Geod Vol II Rev
No ratings yet
Rapp Geom Geod Vol II Rev
225 pages
Foundry Lab Sheet
No ratings yet
Foundry Lab Sheet
2 pages
1st Quarter Week 8 WLP
No ratings yet
1st Quarter Week 8 WLP
15 pages
PHP Set - 1: B. Program or Sequence of Instruction That Is Interpreted or Carried Out by Another Program
No ratings yet
PHP Set - 1: B. Program or Sequence of Instruction That Is Interpreted or Carried Out by Another Program
9 pages
Bengkel Pembinaan Item Bahasa Inggeris (Uasa) Tahap 2.PDF
No ratings yet
Bengkel Pembinaan Item Bahasa Inggeris (Uasa) Tahap 2.PDF
51 pages
Assignment Task 1 Sample.
No ratings yet
Assignment Task 1 Sample.
22 pages
Kindergarten Yardsticks
No ratings yet
Kindergarten Yardsticks
4 pages
Workload Matrix
No ratings yet
Workload Matrix
9 pages
E-Folio Nurs 402 Ebp
No ratings yet
E-Folio Nurs 402 Ebp
4 pages
CAP526 - Software Testing and Quality Assurance Home Work - 2
No ratings yet
CAP526 - Software Testing and Quality Assurance Home Work - 2
8 pages
Accenture Placement Papers
0% (1)
Accenture Placement Papers
151 pages
Grammar and Spoken English
100% (3)
Grammar and Spoken English
130 pages
Analysis of Thermal ICE Protection Using Piccolo Tube
No ratings yet
Analysis of Thermal ICE Protection Using Piccolo Tube
3 pages
Rheological Properties and Stability of Emulsion Explosive Matrix
No ratings yet
Rheological Properties and Stability of Emulsion Explosive Matrix
7 pages
1Z0051 Oracle 11g SQL Fundamentals I
No ratings yet
1Z0051 Oracle 11g SQL Fundamentals I
73 pages
Adr Script
No ratings yet
Adr Script
7 pages
Alaska Hybrid Start For Clipper Programmer
100% (4)
Alaska Hybrid Start For Clipper Programmer
17 pages
Cables
No ratings yet
Cables
12 pages