MLEstimation
MLEstimation
The likelihood and log-likelihood functions are the basis for deriving estimators for parameters,
given data. While the shapes of these two functions are different, they have their maximum point
at the same value. In fact, the value of p that corresponds to this maximum point is defined as
the Maximum Likelihood Estimate (MLE) and that value is denoted as ^p. This is the value that
is “mostly likely" relative to the other values. This is a simple, compelling concept and it has a
host of good statistical properties.
Thus, in general, we seek ^) , such that this value maximizes the log-likelihood function. In the
binomial model, the log/ (_) is a function of only one variable, so it is easy to plot and visualize.
The maximum likelihood estimate of p (the unknown parameter in the model) is that value that
maximizes the log-likelihood, given the data. We denote this as ^p. In the binomial model, there
is an analytical form (termed “closed form") of the MLE, thus maximization of the log-likelihood
is not required. In this simple case,
The log-likelihood links the data, unknown model parameters and assumptions and
allows rigorous, statistical inferences.
Real world problems have more that one variable or parameter (e.g., p, in the example).
Computers can find the maximum of the multi-dimensional log-likelihood function, the biologist
need not to be terribly concerned with these details.
The actual numerical value of the log-likelihood at its maximum point is of substantial
importance. In the binomial coin flipping example with n = 11 and y = 7, max(_) =
-1.411 (see graph).
The log-likelihood function is of fundamental importance in the theory of inference and in all of
statistics. It is the basis for the methods explored in FW-663. Students should make every
effort to get comfortable with this function in the simple cases. Then, extending the concepts to
more complex cases will come easy.
1. The basis for deriving estimators or estimates of model parameters (e.g., survival
probabilities). These are termed “maximum likelihood estimates," MLEs.
2. Estimates of the precision (or repeatability). This is usually the conditional (on the model)
sampling variance
-covariance matrix (to be discussed).
Numbers 1-3 (above) require a model to be “given." Number 4, statistical hypothesis testing,
has become less useful in many respects in the past two decades and we do not stress this
approach as much as others might. Likelihood theory is also important in Bayesian statistics.
3. MLEs are asymptotically unbiased (MLEs are often biased, but the bias Ä 0 as n Ä
_).
-
One to one transformations are also MLEs. For example, mean life span L is defined as
- ^-
1/log/ (S). Thus, an estimator of L = 1/log/ (S^ ) and then L is also an MLE.
The log-likelihood functions we will see have a single mode or maximum point and no local
optima. These conditions make the use of numerical methods appealing and efficient.
Consider, first, the binomial model with a single unknown parameter, p. Using calculus one
could take the first partial derivative of the log-likelihood function with respect to the p, set it to
zero and solve for p. This solution will give ^p, the MLE. This value of ^p is the one the
maximizes the log-likelihood function. It is the value of the parameter that is most likely, given
the data.
Now, what if 8 people had a single ticket, one had 4 tickets, but the last had 80 tickets.
Surely, the person with 80 tickets is most likely to win (but not with certainty). In this
simple example you have a feeling about the “strength of evidence" about the likely
winner. In the first case, one person has an edge, but not much more. In the second
case, the person with 80 tickets is relatively very likely to win.
The shape of the log-likelihood function is important in a conceptual way to the raffle ticket
example. If the log-likelihood function is relatively flat, one can make the interpretation that
several (perhaps many) values of p are nearly equally likely. They are relatively alike; this is
quantified as the sampling variance or standard error.
If the log-likelihood function is fairly flat, this implies considerable uncertainty and this is
reflected in large sampling variances and standard errors, and wide confidence intervals. On the
other hand, if the log-likelihood function is fairly peaked near its maximum point, this indicates
some values of p are relatively very likely compared to others (like the person with 80 raffle
tickets). There is some considerable degree of certainty implied and this is reflected in small
sampling variances and standard errors, and narrow confidence intervals. So, the log-likelihood
function at its maximum point is important as well as the shape of the function near this maximum
point.
The shape of the likelihood function near the maximum point can be measured by the analytical
second partial derivatives and these can be closely approximated numerically by a computer.
Such numerical derivatives are important in complicated problems where the log-likelihood
exists in 20-60 dimensions (i.e., has 20-60 unknown parameters).
0.14
0.12
0.1
Likelihood
0.08
0.06
0.04
0.02
0
0.75 0.8 0.85 0.9 0.95 1
theta
-2
-4
log Likelihood
-6
-8
-10
-12
0.75 0.8 0.85 0.9 0.95 1
theta
The standard, analytical method of finding the MLEs is to take the first partial derivatives of the
log-likelihood with respect to each parameter in the model. For example:
`jn(_( p ))
`p œ 11 5
p • 1•p (n œ 16)
Set to zero:
`jn(_( p ))
`p œ0
11 5 œ0
• 1•
p p
`log(_)
`)1 œ0
`log(_)
`)2 œ0
`log(_)
`)K œ0
The MLEs are almost always unique; in particular this is true of multinomial-based models.
Sampling variances and covariances of the MLEs are computed from the log-likelihood,
based on curvature at the maximum. Actual formulae involve second mixed-partial derivatives of
the log-likelihood, hence quantities like
`# log(_) `# log(_)
`)1 `)1 and `)1 `)2
evaluated at the MLEs.
`# log(_)
• `)i `)i
`# jn(_)
• `)i `)j
The use of log-likelihood functions (rather than likelihood functions) is deeply rooted in the
nature of likelihood theory. Note also that LRT theory leads to tests which basically always
involve taking • 2 ‚ (log-likelihood at MLEs).
evaluated at the MLEs for some model. Here, the first term is the log-likelihood, evaluated at
its maximum point, for the model in question and the second term is the log-likelihood, evaluated
at its maximum point, for the saturated model. The meaning of a saturated model will become
clear in the following material; basically, in the multinomial models, it is a model with as many
parameters as cells. This final term in the deviance can often be dropped, as it is often a
constant across models.
The deviance for the saturated model ´ 0. Deviance, like information, is additive. The
deviance is approximately ;# with df = number of cells –K and is thus useful is examining
goodness-of-fit of a model. There are some ways where use of the deviance in this way will not
provide correct results. MARK outputs the deviance as a measure of model fit and this is often
very useful.
25
20
Deviance
15
10
0
0.75 0.8 0.85 0.9 0.95 1
theta