0% found this document useful (0 votes)

0 views

Notes3_Likelihood

Uploaded by

czf1643605493

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Notes3_Likelihood

Uploaded by

czf1643605493

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Note Set 3: Models, Parameters, and Likelihood

Padhraic Smyth,
Department of Computer Science
University of California, Irvine
January 2024

1 Introduction

This is a brief set of notes covering basic concepts related to likelihood and maximum likelihood. The goal
of this set of notes is to connect the types of probability models we have discussed in Notes 1 and 2 to
observed data. Essentially this involves two steps:

1. Construct a generative or forward model M with parameters θ of how data D can be generated. We
can think of this generative model as a stochastic simulator for the data, with parameters θ. We will
assume for now that M , the structure or functional form of the model, is known, but that the param-
eters θ are unknown1 . An example would be that the model M is a Gaussian (Normal) probability
density function with unknown parameters θ = {µ, σ 2 }.

2. Given the generative model for the data we then “work backwards” to make inferences about θ given
observed data D. This is the essence of probabilistic learning (and much of statistics): going from
observed data to inferences about unknown parameters that we are interested in, via a probabilistic
model. In this set of Notes we will focus on so-called point estimates of parameters θ, denoted by θ̂.
The idea is that this is our best guess, if forced to select a single number, of some true (but unknown)
θ.

2 Likelihood

We define likelihood as the probability of observed data D given a model M where the model has parameters
θ, i.e.,
L(θ) = P (D|θ, M )

• Likelihood is always defined relative to some model M . However, for our initial discussions at least,
we will often drop the explicit reference to M in discussions below and just implicitly assume that
there is some model M being conditioned on.
1
Later in class we will discuss the situation where there are multiple candidate models M1 , . . . , MK under consideration.

1
Note Set 3, Models, Parameters, and Likelihood 2

• We will refer to data sets as D. For 1-dimensional observations this will be a set of values {x1 , . . . , xn }.
For d-dimensional vector observations x we have D = {x1 , . . . , xn }, where xij is the jth component
of the ith observation, 1 ≤ j ≤ d, 1 ≤ i ≤ n. We can also think of D as a data matrix with n rows
indexed by i (each row is a data vector xi ), and with d columns (variables) indexed by j.

• Likelihood is viewed as a function of θ conditioned on a fixed observed data set D. We are interested
in how the likelihood changes as θ changes, where θ is usually real-valued. If a parameter θ1 has
higher likelihood L(θ1 ) than the likelihood of another parameter θ2 , then P (D|θ1 ) > P (D|θ2 ), i.e.,
the observed data is more likely given θ1 than θ2 .

• This leads naturally to the concept of maximum likelihood (discussed below), i.e., finding the θ value
that corresponds to the maximum of L(θ) (assuming a unique maximum exists).

• In defining the likelihood we can drop (ignore) any terms in p(D|θ) that don’t involve θ, such as
normalizing constants. What is usually important is the shape of the likelihood function, or the relative
value of the likelihood, rather than the actual value of the likelihood.

• The likelihood function will typically be quite “wide” when we have relatively little data, and will
“narrow” in shape as we get more data. (This is generally a good description of what happens for
simple models, but is not necessarily true for more complex ones).

• The likelihood function can be defined on vectors of parameters θ, rather than just a single scalar
parameter θ. For a parameter vector defined as θ = (θ1 , . . . , θp ), L(θ) is a scalar function of p
arguments. As with a multi-dimensional probability density function, we can think of the multi-
dimensional likelihood function as a “surface” (non-negative) defined over p dimensions.

• As an example, for a Gaussian density model p(x) for a one-dimensional continuous random variable
X, the parameters are θ = {θ1 , θ2 } = {µ, σ 2 }, i.e., the unknown mean and variance. The likelihood
L(θ) = L(µ, σ 2 ) is a scalar function over the two-dimensional µ, σ 2 space. Note that we could
define θ2 here as either σ or σ 2 —either is fine, but it turns out that σ 2 will make the maximum
likelihood analysis somewhat easier to work with. It can also sometimes be convenient to work with
reparametrizations such as log σ or σ12 , depending on the context, rather than σ or σ 2 directly.

• The likelihood function can equally well be defined when the probability model is a distribution
P (D|θ) (e.g., for discrete random variables) or a probability density function p(D|θ) (for continuous
random variables), or for a combination of the two (e.g., p(D1 |D2 , θ1 )P (D2 |θ2 )) where D1 models
the variables that are real-valued using parameters θ1 , and D2 models the variables that are discrete-
valued with parameters θ2 .
Note Set 3, Models, Parameters, and Likelihood 3

Example 1: Binomial Likelihood: Consider tossing a coin with probability θ of heads and 1 − θ
of tails. This is the Bernoulli model. Now say we observe a sequence of tosses of the same coin.
This set of outcomes represents our data D, where D = {x1 , . . . , xi } and xi ∈ {0, 1} represents the
outcome of the ith toss (e.g., with 1 corresponding to head and 0 to tails).
In defining a likelihood, we need to specify a probability model for multiple samples {x1 , . . . , xi }
rather than just for a single sample xi . The standard assumption for coin-tossing (and many other
phenomena that don’t exhibit any “memory” in how individual data points are generated) is to assume
that each observation xi is conditionally independent of the other observations given the parameter
θ, i.e.,
Yn
L(θ) = P (D|θ) = P (x1 , . . . , xn |θ) = P (xi |θ)
i=1

where P (xi |θ) = θ for xi = 1 and P (xi |θ) = 1 − θ for xi = 0. This particular “coin-tossing”
model, combining a Bernoulli with conditional independence of the xi ’s is referred to as a Binomial
likelihood.
The conditional independence assumption on the xi ’s in the likelihood definition is sometimes
(loosely) also referred to as the IID assumption (independent and identically distributed). The
notion of exchangeability in statistics is essentially the same idea. Note this assumption allows for
a tremendous simplification in our model: instead of dealing with the joint P (x1 , . . . , xn |θ) we can
instead work with individual terms P (xi |θ). Of course we have to be careful that this is a reasonable
assumption. It is certainly a reasonable assumption in the case where the xi ’s are coin tosses, or
perhaps (and closer to the real-world) the case where Xi represents the ith Web surfer to arrive at an
ecommerce Web site and xi is a binary value indicating whether the Web surfer makes a purchase
or not. But in other applications the xi ’s may have some dependence on each other, e.g., if the xi ’s
represented the value of the stock market on different days or words in text. If such dependence was
thought to exist then it should be modeled (see example below).

Continuing on with our binomial likelihood example, we can write

n
Y
L(θ) = P (xi |θ) = θr (1 − θ)n−r
i=1

where r is the number of “heads” observed and n − r is the number of tails. Note that we did not
include the usual combinatorial (binomial) term in front of the expression above, i.e., nr to count

the number of different ways that r heads could occur in n trials, since this term does not involve θ.

Figure 1 shows two examples of the binomial likelihood function for different data sets. In
Figure 1(a) we have r = 3 and n = 10. The likelihood function is relatively wide and is maximized
at 3/10 = 0.3, which makes sense intuitively. In Figure Figure 1(b) we have r = 30 and n = 100:
here the likelihood is much narrower as we might expect and, as a result, the plausible values for θ
are much narrower after seeing 100 observations compared with just 10.
Note Set 3, Models, Parameters, and Likelihood 4

−3 −27
x 10 x 10

3.5
2.5 BINOMIAL LIKELIHOOD: r = 3, n = 10 θML = 0.3 BINOMIAL LIKELIHOOD: r = 30, n = 100 θML = 0.3

2
2.5
Likelihood, L(θ)

Likelihood, L(θ)
1.5 2

1.5
1

0.5
0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ θ

Figure 1: Binomial likelihood for (a) r = 3, n = 10, and (b) r = 30, n = 100.

An interesting side-note with the example above is that conditional independence plays a key role in
our definition of likelihood in the binomial model. In fact the xi ’s are not marginally independent, but only
conditionally independent. Why? If θ is unknown (remember that θ is the probability of heads) then the xi ’s
carry information about each other. As an example, say θ = 0.999 but we don’t know this. So we will tend
to see a lot of heads showing up and very rarely a tail showing up. Having seen such a sequence of xi ’s with
many more heads than tails, this data is informative about the next coin toss. Of course, if someone were to
tell us the true value of θ then the previous xi values have no information at all in terms of predicting the
next x value, since we have all the information we need in θ.

Example 2: Likelihood with Memory: In the previous binomial example, if instead of modeling
coin tosses we were modeling the occurrence of rain on day i in Irvine (xi indicates whether it rains
or not on day i), then we would want to consider abandoning the IID assumption and introducing
some dependence among the xi values (since we will tend to get “runs” of wet days and dry days).
For example, we could make a Markov assumption (Note Set 2) and assume that xi+1 on day i + 1 is
conditionally independent of x values on days i − 1, i − 2, . . . , 1, given the value of xi . Accordingly
the likelihood would be defined as:
n−1
Y
L(θ) = P (x1 , . . . , xn |θ) = P (x1 |θ1 ) · P (xi+1 |xi , θ2 )
i=1

where θ1 = p(x1 = 1) and θ2 is a parameter vector representing a 2 × 2 Markov transition matrix of

parameters (the conditional probabilities of rain or not-rain, conditioned on rain or not-rain the day
before).
Note Set 3, Models, Parameters, and Likelihood 5

Example 3: Gaussian Likelihood: Consider a data set D = {x1 , . . . , xn } where the xi ’s are real-
valued scalars and are samples from a random variable X. Assume we wish to model the xi values
with a Gaussian density function. The Gaussian has two parameters µ and σ 2 . Treating these two
parameters as unknown, and referring to them as θ1 = µ and θ2 = σ 2 we can write the likelihood as:
n
Y
p(D|θ) = p(x1 , . . . , xn |θ) = p(xi |θ)
i=1

where here we make the assumption that the xi ’s are conditionally independent given θ (for a real
problem we would want to convince ourselves that this is reasonable to do).

The individual terms in our likelihood are by definition Gaussian density functions, each
evaluated at xi :
1 1 xi −µ 2
p(xi |θ) = p(xi |µ, θ) = √ exp− 2 ( σ ) .
2πσ 2
Taking the product of these terms, and then taking the log (to the base e for convenience) we arrive
at the log-likelihood
n 2
n 1 X
log L(θ) = l(θ) = − log(2πθ2 ) − x i − θ1 .
2 2θ2
i=1

Imagine that θ2 = σ 2 is fixed (assume for example that it is known). Then l(θ1 ) (viewed as a function
of θ1 only) is proportional to a 2nd order polynomial involving xi ’s and θ1 , i.e.,
n
X
l(θ1 ) ∝ − (xi − θ1 )2
i=1

from which we see that l(θ1 ) is larger if ni=1 (xi − θ1 )2 is smaller, i.e., l(θ1 ) will be larger for
P

values of θ1 = µ that are closer to xi ’s on average (since this is a sum of squared errors between the
observed set of xi values and a single scalar θ1 = µ).

Figure 1 shows some examples of the Gaussian log-likelihood function l(µ) (treating µ as
unknown, but assuming that σ 2 is known) being plotted for different sized data samples, where the
data was simulated from a known Gaussian density function with µ = 5 and σ 2 = 1. Again as n
increases we see that the likelihood begins to narrow in around the true value of µ = 5.
Note Set 3, Models, Parameters, and Likelihood 6

Gaussian likelihood as a function of µ with 1 simulated data point Gaussian likelihood as a function of µ with 3 simulated data points
20 20

0 0

−20 −20
Log−Likelihood

Log−Likelihood
−40 −40

−60 −60

−80 −80

−100 −100
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Parameter µ Parameter µ

Gaussian likelihood as a function of µ with 20 simulated data points Gaussian likelihood as a function of µ with 2000 simulated data points
20 20

0 0

−20 −20
Log−Likelihood

Log−Likelihood

−40 −40

−60 −60

−80 −80

−100 −100
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Parameter µ Parameter µ

Figure 2: Log-likelihood for 4 different sample sizes, as a function of parameter θ = µ, with data simulated
from a Gaussian with true µ = 5 and σ = 1 (simulated data shown as dots horizontally at the top of the
plot).
Note Set 3, Models, Parameters, and Likelihood 7

3 The Principle of Maximum Likelihood

The principle of maximum likelihood follows naturally from what we have discussed above, namely that if
we had to summarize our data by selecting only a single parameter value θ̂, and if we only have the observed
data and the likelihood available and no other information, then it is reasonable to argue that the value of θ
that we should select is the one that maximizes the likelihood L(θ). Or, more formally:
θ̂M L = arg max L(θ) = arg max P (D|θ).
θ θ
The subscript “ML” denotes “maximum likelihood” since we will later discuss other types of estimates for
which we will use other subscripts. The “hat” notation, θ̂, denotes an estimate of some unknown (true)
quantity θ.

Example 4: Maximum Likelihood Estimate for the Binomial Model: From earlier, the binomial
likelihood can be written as:
n−1
Y
L(θ) = P (x1 , . . . , xn |θ) = P (x1 |θ1 ) · P (xi+1 |xi , θ2 )
i=1

where r is the number of successes in n trials. We can easily find the maximum likelihood estimate
of θ as follows. First lets work with the log-likelihood since the log-likelihood is a little easier to
work witha .
log L(θ) = l(θ) = r log θ + (n − r) log(1 − θ).
d
A necessary condition to maximize l(θ) is that dθ l(θ) = 0, i.e., this condition must be satisfied at
θ = θ̂M L . Thus, we calculate the derivative with respect to θ and set to 0, i.e.,

d r n−r
l(θ) = − = 0, at θ = θ̂M L
dθ θ 1−θ
and after some rearrangement of terms we get
r
θ̂M L =
n
i.e., the standard intuitive frequency-based estimate for the probability of success given r successes
in n trials. At this point it seems like we may not have gained very much with our likelihood-based
framework since we arrived back at the “obvious” answer! However, the power of the likelihood
(and related) approaches is that we can generalize to much more complex problems where there is
no obvious “intuitive” estimator for a parameter θ. And if we think about it we should have expected
to get this estimate for θ̂M L a priori. Had we gotten any other estimate we might have good cause for
concern that our likelihood-based procedures did not match our intuition.
a
Note that the value of θ that maximizes the log-likelihood is the same as the value of θ that maximizes the likelihood
since log is a monotonic function.
Note Set 3, Models, Parameters, and Likelihood 8

Example 5: Maximum Likelihood Estimate for the Gaussian IID Model:

Consider the case where σ is known and µ is unknown. From Example 3 earlier we saw that for the
Gaussian IID model we can write:
n
X
l(θ) = − (xi − θ)2
i=1

where θ = µ the unknown mean parameter. To maximize this as a function of θ we can use simple
calculus, i.e., differentiate the right-hand side above with respect to θ, set to 0, and solve for θ. (Left
as an exercise for the reader).

Example 6: Maximum Likelihood Estimation with Two Noisy Data Sources:

There are many problems in scientific data analysis where we need to combine multiple different data sets
to make predictions about a single quantity of interest. The following example discusses such a problem and
also illustrates a situation where the maximum likelihood approach leads to an estimate that is not obvious,
i.e., the equation defining θ̂M L could not easily be guessed, at least not until we have an idea what the correct
approach is.

Consider the following scenario. We are working with an astronomer monitoring a distant object in the
sky with two different CCD cameras connected to 2 different telescopes in different parts of the world. As-
sume in this simplified example that each camera produces noisy estimates of the object’s true brightness—
we assume that there is a true constant brightness µ for the object but our cameras only get noisy mea-
surements x1 , x2 , . . . (our astronomer can get multiple xi measurements from each camera over multiple
nights).

Say that camera 1 produces measurements that have a Gaussian distribution with mean µ and variance
σ12 ,
and that camera 2 produces measurements with mean µ and variance σ22 . We are assuming that the true
mean of the measurements from each individual camera is the same as the true brightness, but the variances
are different, e.g., if σ12 is much smaller than σ22 this could be because camera 1 is connected to a much more
accurate (newer, stronger) telescope. We will also assume (for simplicity) that the two variances are known
(but that µ is unknown)—which is not unreasonable, since astronomers are often very good at coming up
with techniques to calibrate the noise in their instruments.

The question is how to estimate µ given data D consisting of n1 measurements from camera 1 and n2
measurements from camera 2. A naive estimate of µ is simply the average over all of the measurements,
i.e.,
1 X
µ̂naive = xi
n1 + n2
i

where the sum ranges over all of the of measurements. But in constructing this simple estimate we are
ignoring the fact that one camera is more accurate than the other, i.e., σ12 ̸= σ22 . The more different these two
variances are, the more important it may be to account for measurements from the two data sets differently.
In the extreme case, for example, we might have only 1 measurement in D1 from camera 1 and (say) 1000
measurements from D2 from camera 2, but say that camera 1 has 10 times less variance than camera 2. In
Note Set 3, Models, Parameters, and Likelihood 9

this case how should we combine the data to arrive at an estimate of µ? Intuitively we can imagine that
some form of weighting scheme is probably appropriate, where we downweight measurements from the
more noisy camera and upweight measurements from the more accurate one. But its not obvious what these
weights should be.
This is the type of situation where formal probabilistic modeling (such as likelihood based methods) can
be very useful. So lets see what the maximum likelihood estimator for µ is in this situation.

L(µ) = p(D|µ)
= p(D1 , D2 |µ)
= p(D1 |µ)p(D2 |µ)
n1
Y n2
Y
f xi ; µ, σ12 · f xj ; µ, σ22

=
i=1 j=1

where the first product is over the n1 data points in data set D1 and the second product is over the n2 data
points in data set D2 . The notation f (xi ; µ, σ12 ) denotes a Gaussian (Normal) density function evaluated at
xi with mean µ and variance σ12 . We have also assumed IID measurements, which may be reasonable for
example if the measurements were taken relatively far apart in time (e.g., on different nights). Taking logs
and dropping terms that don’t involve µ, we get
n1 2 n2 2
1 X 1 X
l(µ) = − 2 xi − µ − 2 xj − µ .
2σ1 i=1 2σ2 j=1

Taking the derivative with respect to µ yields

n1 n2
d 1 X 1 X
l(µ) = 2 (xi − µ) + 2 (xj − µ).
dµ σ1 i=1 σ2 j=1

Setting this expression to 0, and rearranging terms we get that

n1 n2
n1 n2 1 X 1 X
µ̂M L 2 + 2 = x i + xj .
σ1 σ2 σ12 i=1 σ22 j=1

Multiplying through by σ12 ,

n1 n2
σ12 σ2 X
X
µ̂M L n1 + n2 2 = xi + 12 xj ,
σ2 i=1
σ 2 j=1

yielding:
n1 n2
σ12 −1 X σ12 X

µ̂M L = n1 + n2 2 xi + 2 xj .
σ2 i=1
σ2 j=1

σ2
We see that the relative weighting of the two data sets is controlled by the ratio r = σ12 . If r = 1 (same
2
variance in both cameras) we get the standard “unweighted” solution, i.e., the maximum likelihood estimate
Note Set 3, Models, Parameters, and Likelihood 10

of µ corresponds to the empirical average of all of the data points (as we would expect). If σ12 < σ22 (so the
ratio r < 1) then the data points from camera 2 (with higher variance and more noise) are essentially being
σ2
downweighted by a factor of r = σ12 . Conversely, if σ12 > σ22 and the measurements from camera 2 are less
2
noisy, then camera 2’s measurements are upweighted by the factor r > 1.
We might have guessed at a similar solution in an ad hoc manner—but the likelihood-based approach
provides a clear and principled way to derive estimators, and can be particularly useful in problems that are
often much more complex than this example. For example, imagine K cameras, with different (possibly
non-Gaussian) noise models for each and with various dependencies among the cameras. The noise charac-
teristics for some cameras could be unknown but nonetheless may be known to be inter-dependent in some
manner, e.g., two cameras have unknown variances but we know that the first camera has twice the variance
of the other. Maximum likelihood gives us a principled way to address such problems.

4 Maximum Likelihood for Graphical Models

4.1 Basic Concepts: Two Random Variables

Consider two discrete random variables A and B each taking M values, with possible values a and b.
Assume we already know the marginal probabilities p(a) for variable A and we wish to learn the condi-
tional probabilities P (b|a). We will treat these unknown conditional probabilities as parameters θ. We
can separate θ into M different sets of conditional probability parameters, one for each value of A, i.e.,
θ = (θ1 , . . . , θM ). Each set of parameters θk contains M conditional probabilities that sum to 1, each
conditioned on a particular value A = k, i.e., θk = (θk,1 , . . . , θk,M ) where θk,l = P (B = l|A = k).
Sidenote on notation: we will use notation below such as P (B = l|A = k, θk,l ), which you can think of
as saying “if we know A = k and we know the value of θk,l , then our conditional probability for B = l given
A = k is itself the parameter θk,l .” This notation, where we put parameters like θk,l on the conditioning side
of a conditional probability, is not very elegant, but it is convenient and useful in general (will be particularly
useful when we discuss Bayesian learning later on).
Now say we have an observed data set D = {(ai , bi )}, 1 ≤ i ≤ N , i.e., a set of N observations, with ai
and bi denoting the value of A and the value of B respectively for each pair of observations. For example i
might refer to an individual and A and B might be two discrete variables or attributes that we can measure
for any individual.
If we assume the observations are IID conditioned on the unknown parameters θ, we can write the
log-likelihood as
N
X
log L(θ) = log P (ai , bi |θ)
i=1
XN
= log P (bi |ai , θ) + log P (ai )
i=1
Note Set 3, Models, Parameters, and Likelihood 11

We can drop the terms log P (ai ) from the likelihood since they are assumed here to be known and do
not depend on θ. We can also simplify the log-likelihood expression by writing it out as a sum over the
parameters for each of the different conditional probability vectors θk for each value of A:
N
X
log L(θ) = log P (bi |ai , θ)
i=1
X Nk
M X
= log P (bi |ai = k, θk )
k=1 i=1
XM
= log L(θk )
k=1

where Nk is the number of times that a = k occurs in the data (here we have grouped the likelihood terms
in correspondence with the M values of A). Since each of the terms log L(θk ) involves different sets of
parameters θk , we can maximize each one separately, i.e., estimate the maximum likelihood parameters
(conditional probabilities) for each of the M different values of A. Thus, we have
Nk
X
log L(θk ) = log P (bi |ai = k, θk )
i=1
XM X
= log P (bi = l|ai = k, θk ) by grouping terms with bi = l
l=1 i:bi =l
M
X
= rk,l log P (bi = l|ai = k, θk )
l=1
XM
= rk,l log θk,l
l=1

where rk,l is a count of the number of times that a = k and b = l in the data D, and where M
P
l=1 rk,l = Nk .
This is the same form as the multinomial problem (See lectures and/or homework 2). If we maximize this
for each θk,l , the solution is
ML rk,l
θ̂k,l = 1 ≤ l, k ≤ M
Nk
i.e., the maximum likelihood estimate of each conditional probability θk,l = P (B = l|A = k) is the number
of times rk,l that A = k and B = l occur in the data, divided by the number of times Nk that A = k occurs,
i.e,. a standard frequency-based estimate for a conditional probability.

If we now have a more complicated graphical model, e.g., A → B → C, we can again factorize
the likelihood into terms that only involve local conditional probability tables, with a local table for each
variable conditioned on its parents. The maximum likelihood estimates of these conditional probabilities
are the “local” frequency based estimates of how often both the parent-child combination of values occurs
divided by the number of times the parent value occurs (see subsection below for details).
Note Set 3, Models, Parameters, and Likelihood 12

4.2 More General Graphical Models (Optional Reading)

We can generalize the ideas above to any arbitrary directed graphical model. Assume we have a set of d
random variables where we know the structure of an associated graphical model, i.e., for each variable Xj
we know the parent set pa(Xj ) in the graph. We also have available an N × d data matrix D consisting of
independent random samples from the joint distribution P (x1 , . . . , xd ), where xij is the observed value for
variable Xj in the ith random sample. For simplicity assume that each variable Xj is discrete and takes M
values. Given the structure of the graphical model we would like to use the data D to estimate the CPTs for
the model.

The parameters θ can in general be defined as the set θ = {θj } where the index j = 1, . . . , d, i.e., j
ranges over the variables X1 , . . . , Xd (note that this is a little different to the notation from earlier in this
section). Each θj is the set of relevant parameters for variable Xj , or more specifically, the set of parameters
defining the CPT P (xj |pa(Xj )). (Note again that when we say “parameters” here we mean conditional
probabilities: we refer to them as parameters since they are unknown and we wish to estimate them from
data).

It is straightforward to show that the overall likelihood can be decomposed into separate local likelihood
terms, one per variable Xj , as follows:
N
Y
L(θ) = P (D|θ) = P (xi |θ)
i=1
N Y
Y d
= P (xij |pa(Xj )i , θj )
i=1 j=1
d Y
Y N
= P (xij |pa(Xj )i , θj )
j=1 i=1
d
Y
= L(θj )
j=1

where L(θj ) = N
Q
i=1 P (xij |pa(Xj )i , θ j ) is the part of the likelihood only involving parameters θ j for
variable Xj . (Here pa(Xj )i indicates the value(s) of the parents of Xj for the ith data point xi ). Thus, the
total likelihood decomposes into local likelihoods per node (or per variable).

We can write this in log-likelihood form as:

d
X
log L(θ) = log L(θj )
j=1

where
N
X
log L(θj ) = log P (xij |pa(Xj )i , θj )
i=1
Note Set 3, Models, Parameters, and Likelihood 13

We can maximize the full log-likelihood by independently maximizing each local log-likelihood log L(θj )
as long as the θj parameters for each variable Xj are not constrained or related (if they are then we would
need to a joint maximization over the different terms). Thus, we have reduced the problem of finding the
maximum likelihood parameters for a directed graphical model to d separate problems, where each problem
corresponds to finding the maximum likelihood parameters for the conditional probability tables for child
node Xj given its parents in the graphical model.
The parameters θj can be defined as a set θj = {θj,k }, where θj,k = {θj,k,l } where each θj,k,l =
P (xj = l|pa(xj ) = k), i.e., these parameters are the conditional probabilities of Xj taking different values
l, 1 ≤ l ≤ M , conditioned on a particular set of values k for the parent nodes. In the earlier subsection k
ranged over M values (the values of variable A): but in the general case a node might have multiple parents,
so k in general will range over all possible combinations of values of the parents.
Each local log-likelihood log L(θj ) can be written as
N
X
log L(θj ) = log P (xij |pa(Xj )i , θj )
i=1
X X
= rk,l log P (xj = l|pa(Xj ) = k, θj,k
k l

where the two sums are over all possible values l and k of the child and parent variable(s) respectively, and
where rk,l is the number of times that those particular combinations of parent-child values occur in the data.
The sum over k has M |pa(Xj | different terms, where |pa(Xj )| is the number of parents of variable xj (for
the special case in the model where all variables take the same number of values M ). The innermost sum l
is over the M possible values that each variable xj can take, conditioned on some setting k of the values of
the parent variables pa(Xj ).
It follows from the equation above that log L(θj ) can be further broken down as sums of local likelihood
terms X
log L(θj ) = log L(θj,k )
k
with a log-likelihood term log L(θj,k ) for each set of parameters θj,k , where each term log L(θj,k ) can be
maximized separately from all the other terms. From a maximum likelihood perspective, each of these terms
log L(θj,k ) corresponds (in general) to a different conditional distribution θj,k = {θj,k,l } with probabilities
that sum to 1 (over l), and the maximum likelihood estimates for each such distribution (corresponding to a
“variable and parent value” combination) is the standard multinomial estimate from earlier, i.e.,
rj,k,l
θ̂j,k,l =
Nk
where Nk is the number of times the specific parent values corresponding to k occur in the data and rj,k,l is
the number of times that variable Xj takes value l and that the parents pa(Xj ) = k, with j = 1, . . . , d , l =
1, . . . , M , k = 1, . . . , M |pa(Xj )| .
In this manner our maximum likelihood problem for the graphical model reduces to M |pa(xj | different
maximum likelihood estimations of conditional distributions, repeated for each variable Xj in the model.

Measures of Position For Ungrouped Data
83% (6)
Measures of Position For Ungrouped Data
1 page
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
16 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
hw4 Fall18
No ratings yet
hw4 Fall18
1 page
SI_Chapter-2
No ratings yet
SI_Chapter-2
53 pages
Stats300a Fall15 Lecture1
No ratings yet
Stats300a Fall15 Lecture1
7 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
MLE_Assingnment (1)
No ratings yet
MLE_Assingnment (1)
7 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
PSet 7
No ratings yet
PSet 7
2 pages
Chapter 10: Binary Choice and Limited Dependent Variable Models, and Maximum Likelihood Estimation
No ratings yet
Chapter 10: Binary Choice and Limited Dependent Variable Models, and Maximum Likelihood Estimation
30 pages
31 1 Poly Approx
No ratings yet
31 1 Poly Approx
18 pages
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
No ratings yet
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
25 pages
Assignment-04_cbfdbf0e-2f84-4fd5-b962-887056097cec
No ratings yet
Assignment-04_cbfdbf0e-2f84-4fd5-b962-887056097cec
6 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+1T2018+type@asset+block@ML Lecture1
17 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
03_lecturenote_MLE_MAP
No ratings yet
03_lecturenote_MLE_MAP
7 pages
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
No ratings yet
Homework 1 - Theoretical Part: IFT 6390 Fundamentals of Machine Learning Ioannis Mitliagkas
6 pages
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
10 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
lecture03b_overfitting
No ratings yet
lecture03b_overfitting
5 pages
Lectura 2 Point Estimator Basics
No ratings yet
Lectura 2 Point Estimator Basics
11 pages
l19
No ratings yet
l19
13 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
Resampling Methods For Time Series
No ratings yet
Resampling Methods For Time Series
5 pages
topic3_formalizing_estimation_Oct112023
No ratings yet
topic3_formalizing_estimation_Oct112023
55 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Lec11 PDF
No ratings yet
Lec11 PDF
7 pages
Module4
No ratings yet
Module4
3 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
Robust Estimation of Risk Factor Model Covariance Matrix
No ratings yet
Robust Estimation of Risk Factor Model Covariance Matrix
5 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
Dougherty5e Studyguide ch10
No ratings yet
Dougherty5e Studyguide ch10
26 pages
Lectura 1 Point Estimation
No ratings yet
Lectura 1 Point Estimation
47 pages
Lecture Notes Week 2
No ratings yet
Lecture Notes Week 2
10 pages
Q2 and 4
No ratings yet
Q2 and 4
4 pages
2425 Deep Learning For Symbolic Mat
No ratings yet
2425 Deep Learning For Symbolic Mat
24 pages
MIT14 30s09 Lec17
No ratings yet
MIT14 30s09 Lec17
9 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Msda3 Notes
No ratings yet
Msda3 Notes
8 pages
R300 Advanced Econometrics Methods Lecture Slides
No ratings yet
R300 Advanced Econometrics Methods Lecture Slides
362 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
Homework 2
No ratings yet
Homework 2
4 pages
Credit Modelling - Survival Models: 1 Random Variables
No ratings yet
Credit Modelling - Survival Models: 1 Random Variables
5 pages
Credit Modelling - Survival Models: 1 Random Variables
No ratings yet
Credit Modelling - Survival Models: 1 Random Variables
5 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
Convergence and Limits
No ratings yet
Convergence and Limits
28 pages
Notational Conventions Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Notational Conventions Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
3 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Simulating Maximum Likelihood Estimators: Corbin Miller Stat 342 February 14, 2011
No ratings yet
Simulating Maximum Likelihood Estimators: Corbin Miller Stat 342 February 14, 2011
5 pages
1.1 Parametric and Nonparametric Statistical Inference
No ratings yet
1.1 Parametric and Nonparametric Statistical Inference
8 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Integration, Measure and Probability
From Everand
Integration, Measure and Probability
H. R. Pitt
No ratings yet
How To Create A Cleanse Library
No ratings yet
How To Create A Cleanse Library
9 pages
P3MATHS Weekly W15
No ratings yet
P3MATHS Weekly W15
4 pages
Am. J. Agr. Econ.-1975-Wolgin-622-30
No ratings yet
Am. J. Agr. Econ.-1975-Wolgin-622-30
9 pages
DB Examples
No ratings yet
DB Examples
8 pages
Assignment 6 Answer
No ratings yet
Assignment 6 Answer
17 pages
Obect:-Apparatus Required:: Practical No.1
No ratings yet
Obect:-Apparatus Required:: Practical No.1
5 pages
STD 4 Syllabus
No ratings yet
STD 4 Syllabus
7 pages
Reference For Ctfs CTFT
No ratings yet
Reference For Ctfs CTFT
111 pages
Percentage Rate Base
No ratings yet
Percentage Rate Base
10 pages
05586659
No ratings yet
05586659
11 pages
Calculation of Thermally Permissible Short-Circuit Currents, Taking Into Account Non-Adiabatic Heating Effects
No ratings yet
Calculation of Thermally Permissible Short-Circuit Currents, Taking Into Account Non-Adiabatic Heating Effects
12 pages
Taller 2 de Programación Lineal y Entera - Solver - Dual y Sensibilidad
No ratings yet
Taller 2 de Programación Lineal y Entera - Solver - Dual y Sensibilidad
3 pages
NOAA - Conversion Table Specific Gravity To Salinity - 2006
No ratings yet
NOAA - Conversion Table Specific Gravity To Salinity - 2006
24 pages
3D Basic Mathematics for Computer Graphics
No ratings yet
3D Basic Mathematics for Computer Graphics
17 pages
Waste Minimisation
No ratings yet
Waste Minimisation
14 pages
R1.3 Indices Review ANSWERS
No ratings yet
R1.3 Indices Review ANSWERS
6 pages
12.04.dynamic Programming
No ratings yet
12.04.dynamic Programming
97 pages
diagnostic test for 3is
No ratings yet
diagnostic test for 3is
4 pages
Lab Report: Ripple Test
100% (1)
Lab Report: Ripple Test
12 pages
4-1 Inverse and Direct Variation
No ratings yet
4-1 Inverse and Direct Variation
18 pages
Hemispherical Head
No ratings yet
Hemispherical Head
5 pages
Mathematics Class 10th Full Book MCQ's 2025
No ratings yet
Mathematics Class 10th Full Book MCQ's 2025
6 pages
BdOL 1stquarter MATHEMATICS5 Week1
No ratings yet
BdOL 1stquarter MATHEMATICS5 Week1
2 pages
Practical Design of Ships and Other Floating Structures-Volume 3
No ratings yet
Practical Design of Ships and Other Floating Structures-Volume 3
767 pages
Quiz 1: Paper
No ratings yet
Quiz 1: Paper
2 pages
Quadratic Equation
No ratings yet
Quadratic Equation
18 pages
AP Physics 1 First and Second Law Intro Phet-1
No ratings yet
AP Physics 1 First and Second Law Intro Phet-1
2 pages
Kaust Cs204 Fall2022 Lecture10 Bfsanddfs
No ratings yet
Kaust Cs204 Fall2022 Lecture10 Bfsanddfs
35 pages
Statically Determinate and Indeterminate Structure
No ratings yet
Statically Determinate and Indeterminate Structure
8 pages

Notes3_Likelihood

Uploaded by

Notes3_Likelihood

Uploaded by

Note Set 3: Models, Parameters, and Likelihood

Continuing on with our binomial likelihood example, we can write

where θ1 = p(x1 = 1) and θ2 is a parameter vector representing a 2 × 2 Markov transition matrix of

3 The Principle of Maximum Likelihood

Example 5: Maximum Likelihood Estimate for the Gaussian IID Model:

Example 6: Maximum Likelihood Estimation with Two Noisy Data Sources:

Taking the derivative with respect to µ yields

Setting this expression to 0, and rearranging terms we get that

Multiplying through by σ12 ,

4 Maximum Likelihood for Graphical Models

4.1 Basic Concepts: Two Random Variables

4.2 More General Graphical Models (Optional Reading)

We can write this in log-likelihood form as:

You might also like