Maximum Likelihood Estimation
Maximum Likelihood Estimation
It seems reasonable that a good estimate of the unknown parameter θ would be the
value of θ that maximizes the probability, errrr... that is, the likelihood... of getting
the data we observed. (So, do you see from where the name "maximum likelihood"
comes?) So, that is, in a nutshell, the idea behind the method of maximum likelihood
estimation. But how would we implement the method in practice? Well, suppose we
have a random sample X1, X2,..., Xn for which the probability density (or mass)
function of each Xi is f(xi; θ). Then, the joint probability mass (or density) function
of X1, X2,..., Xn, which we'll (not so arbitrarily) call L(θ) is:
n
L(θ) = P(X1 = x1 , X2 = x2 , … , Xn = xn ) = f (x1 ; θ) ⋅ f (x2 ; θ) ⋯ f (xn ; θ) = ∏ f (xi ; θ)
i=1
The first equality is of course just the definition of the joint probability mass function.
The second equality comes from that fact that we have a random sample, which
implies by definition that the Xi are independent. And, the last equality just uses the
shorthand mathematical notation of a product of indexed terms. Now, in light of the
basic idea of maximum likelihood estimation, one reasonable way to proceed is to
treat the "likelihood function" L(θ) as a function of θ, and find the value of θ that
maximizes it.
Is this still sounding like too much abstract gibberish? Let's take a look at an example
to see if we can make it a bit more concrete.
Example
1 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
L(p) = p∑ xi (1 − p)n−∑ xi
Now, in order to implement the method of maximum likelihood, we need to
find the p that maximizes the likelihood L(p). We need to put on our calculus
hats now, since in order to maximize the function, we are going to need to
differentiate the likelihood function with respect to p. In doing so, we'll use a
"trick" that often makes the differentiation a bit easier. Note that the natural
logarithm is an increasing function of x:
That is, if x1 < x2, then f(x1) < f(x2). That means that the value of p that
maximizes the natural logarithm of the likelihood function ln(L(p)) is also the
value of p that maximizes the likelihood function L(p). So, the "trick" is to take
the derivative of ln(L(p)) (with respect to p) rather than taking the derivative
of L(p). Again, doing so often makes the differentiation much easier. (By the
2 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
way, throughout the remainder of this course, I will use either ln(L(p)) or
log(L(p)) to denote the natural logarithm of the likelihood function.)
(∑ xi )(1 − p) − (n − ∑ xi )p = 0
Upon distributing, we see that two of the resulting terms cancel each other
out:
leaving us with:
∑ xi − np = 0
Now, all we have to do is solve for p. In doing so, you'll want to make sure
that you always put a hat ("^") on the parameter, in this case p, to indicate it
is an estimate:
n
∑ xi
i=1
p̂ =
n
or, alternatively, an estimator:
n
∑ Xi
i=1
p̂ =
n
Oh, and we should technically verify that we indeed did obtain a maximum.
We can do that by verifying that the second derivative of the log likelihood
with respect to p is negative. It is, but you might want to do the work to
convince yourself!
Now, with that example behind us, let us take a look at formal definitions of the terms
(1) likelihood function, (2) maximum likelihood estimators, and (3) maximum
likelihood estimates.
3 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
(1) When regarded as a function of θ1, θ2,..., θm, the joint probability density
(or mass) function of X1, X2,..., Xn:
n
L(θ1 , θ2 , … , θm ) = ∏ f (xi ; θ1 , θ2 , … , θm )
i=1
(2) If:
θî = ui (X1 , X2 , … , Xn )
is the maximum likelihood estimator of θi, for i = 1, 2, ..., m.
Example
Based on the definitions given above, identify the likelihood function and the
maximum likelihood estimator of μ, the mean weight of all American female college
students. Using the given sample, find a maximum likelihood estimate of μ as well.
4 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
[ 2σ 2 ]
1 (xi − μ)2
f (xi ; μ, σ 2 ) = exp −
‾‾
σ √2π‾
for −∞ < x < ∞. The parameter space is Ω = {(μ, σ): −∞ < μ < ∞ and 0 < σ
< ∞}. Therefore, (you might want to convince yourself that) the likelihood
function is:
[ 2σ i=1 ]
−n −n/2 1 n
L(μ, σ) = σ (2π ) exp − 2 ∑ (xi − μ)2
for −∞ < μ < ∞ and 0 < σ < ∞. It can be shown (we'll do so in the next
example!), upon maximizing the likelihood function with respect to μ, that the
maximum likelihood estimator of μ is:
1 n
μ̂ = ∑ Xi = X̄
n i=1
Based on the given sample, a maximum likelihood estimate of μ is:
1 n 1
μ̂ = ∑ xi = (115 + ⋯ + 180) = 142.2
n i=1 10
pounds. Note that the only difference between the formulas for the maximum
likelihood estimator and the maximum likelihood estimate is that:
the estimator is defined using capital letters (to denote that its value is
random), and
the estimate is defined using lowercase letters (to denote that its value
is fixed and based on an obtained sample)
Okay, so now we have the formal definitions out of the way. The first example on this
page involved a joint probability mass function that depends on only one parameter,
namely p, the proportion of successes. Now, let's take a look at an example that
involves a joint probability density function that depends on two parameters.
Example
Let X1, X2,..., Xn be a random sample from a normal distribution with unknown
mean μ and variance σ2. Find maximum likelihood estimators of mean μ and
variance σ2.
Solution. In finding the estimators, the first thing we'll do is write the
probability density function as a function of θ1 = μ and θ2 = σ2:
[ ]
1 (xi − θ1 )2
f (xi ; θ1 , θ2 ) = exp −
√‾‾ ‾‾
‾
θ2 √2π 2θ2
5 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
for −∞ < θ1 < ∞ and 0 < θ2 < ∞. We do this so as not to cause confusion
when taking the derivative of the likelihood with respect to σ2. Now, that
makes the likelihood function:
[ ]
n 1 n
L(θ1 , θ2 ) = ∏ f (xi ; θ1 , θ2 ) = θ2−n/2 (2π )−n/2 exp − ∑ (xi − θ1 )2
i=1 2θ2 i=1
n n ∑(xi − θ1 )2
logL(θ1 , θ2 ) = − logθ2 − log(2π) −
2 2 2θ2
Now, upon taking the partial derivative of the log likelihood with respect
to θ1, and setting to 0, we see that a few things cancel each other out,
leaving us with:
∑ xi − nθ1 = 0
Now, solving for θ1, and putting on its hat, we have shown that the maximum
likelihood estimate of θ1 is:
∑ xi
θ1̂ = μ̂ = = x̄
n
Now for θ2. Taking the partial derivative of the log likelihood with respect
to θ2, and setting to 0, we get:
we get:
6 of 7 12/05/19, 2:59 pm
https://newonlinecourses.science.psu.edu/stat4...
−nθ2 + ∑(xi − θ1 )2 = 0
And, solving for θ2, and putting on its hat, we have shown that the maximum
likelihood estimate of θ2 is:
̂ 2 ∑(xi − x̄)2
θ2 = σ ̂ =
n
(I'll again leave it to you to verify, in each case, that the second partial
derivative of the log likelihood is negative, and therefore that we did indeed
find maxima.) In summary, we have shown that the maximum likelihood
estimators of μ and variance σ2 for the normal model are:
∑ Xi ∑(Xi − X̄)2
μ̂ = = X̄ and σ 2̂ =
n n
respectively.
Note that the maximum likelihood estimator of σ2 for the normal model is not the
sample variance S2. They are, in fact, competing estimators. So how do we know
which estimator we should use for σ2 ? Well, one way is to choose the estimator that
is "unbiased." Let's go learn about unbiased estimators now.
7 of 7 12/05/19, 2:59 pm