0% found this document useful (0 votes)
0 views

Notes3_Likelihood

Uploaded by

czf1643605493
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Notes3_Likelihood

Uploaded by

czf1643605493
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Note Set 3: Models, Parameters, and Likelihood

Padhraic Smyth,
Department of Computer Science
University of California, Irvine
January 2024

1 Introduction

This is a brief set of notes covering basic concepts related to likelihood and maximum likelihood. The goal
of this set of notes is to connect the types of probability models we have discussed in Notes 1 and 2 to
observed data. Essentially this involves two steps:

1. Construct a generative or forward model M with parameters θ of how data D can be generated. We
can think of this generative model as a stochastic simulator for the data, with parameters θ. We will
assume for now that M , the structure or functional form of the model, is known, but that the param-
eters θ are unknown1 . An example would be that the model M is a Gaussian (Normal) probability
density function with unknown parameters θ = {µ, σ 2 }.

2. Given the generative model for the data we then “work backwards” to make inferences about θ given
observed data D. This is the essence of probabilistic learning (and much of statistics): going from
observed data to inferences about unknown parameters that we are interested in, via a probabilistic
model. In this set of Notes we will focus on so-called point estimates of parameters θ, denoted by θ̂.
The idea is that this is our best guess, if forced to select a single number, of some true (but unknown)
θ.

2 Likelihood

We define likelihood as the probability of observed data D given a model M where the model has parameters
θ, i.e.,
L(θ) = P (D|θ, M )

• Likelihood is always defined relative to some model M . However, for our initial discussions at least,
we will often drop the explicit reference to M in discussions below and just implicitly assume that
there is some model M being conditioned on.
1
Later in class we will discuss the situation where there are multiple candidate models M1 , . . . , MK under consideration.

1
Note Set 3, Models, Parameters, and Likelihood 2

• We will refer to data sets as D. For 1-dimensional observations this will be a set of values {x1 , . . . , xn }.
For d-dimensional vector observations x we have D = {x1 , . . . , xn }, where xij is the jth component
of the ith observation, 1 ≤ j ≤ d, 1 ≤ i ≤ n. We can also think of D as a data matrix with n rows
indexed by i (each row is a data vector xi ), and with d columns (variables) indexed by j.

• Likelihood is viewed as a function of θ conditioned on a fixed observed data set D. We are interested
in how the likelihood changes as θ changes, where θ is usually real-valued. If a parameter θ1 has
higher likelihood L(θ1 ) than the likelihood of another parameter θ2 , then P (D|θ1 ) > P (D|θ2 ), i.e.,
the observed data is more likely given θ1 than θ2 .

• This leads naturally to the concept of maximum likelihood (discussed below), i.e., finding the θ value
that corresponds to the maximum of L(θ) (assuming a unique maximum exists).

• In defining the likelihood we can drop (ignore) any terms in p(D|θ) that don’t involve θ, such as
normalizing constants. What is usually important is the shape of the likelihood function, or the relative
value of the likelihood, rather than the actual value of the likelihood.

• The likelihood function will typically be quite “wide” when we have relatively little data, and will
“narrow” in shape as we get more data. (This is generally a good description of what happens for
simple models, but is not necessarily true for more complex ones).

• The likelihood function can be defined on vectors of parameters θ, rather than just a single scalar
parameter θ. For a parameter vector defined as θ = (θ1 , . . . , θp ), L(θ) is a scalar function of p
arguments. As with a multi-dimensional probability density function, we can think of the multi-
dimensional likelihood function as a “surface” (non-negative) defined over p dimensions.

• As an example, for a Gaussian density model p(x) for a one-dimensional continuous random variable
X, the parameters are θ = {θ1 , θ2 } = {µ, σ 2 }, i.e., the unknown mean and variance. The likelihood
L(θ) = L(µ, σ 2 ) is a scalar function over the two-dimensional µ, σ 2 space. Note that we could
define θ2 here as either σ or σ 2 —either is fine, but it turns out that σ 2 will make the maximum
likelihood analysis somewhat easier to work with. It can also sometimes be convenient to work with
reparametrizations such as log σ or σ12 , depending on the context, rather than σ or σ 2 directly.

• The likelihood function can equally well be defined when the probability model is a distribution
P (D|θ) (e.g., for discrete random variables) or a probability density function p(D|θ) (for continuous
random variables), or for a combination of the two (e.g., p(D1 |D2 , θ1 )P (D2 |θ2 )) where D1 models
the variables that are real-valued using parameters θ1 , and D2 models the variables that are discrete-
valued with parameters θ2 .
Note Set 3, Models, Parameters, and Likelihood 3

Example 1: Binomial Likelihood: Consider tossing a coin with probability θ of heads and 1 − θ
of tails. This is the Bernoulli model. Now say we observe a sequence of tosses of the same coin.
This set of outcomes represents our data D, where D = {x1 , . . . , xi } and xi ∈ {0, 1} represents the
outcome of the ith toss (e.g., with 1 corresponding to head and 0 to tails).
In defining a likelihood, we need to specify a probability model for multiple samples {x1 , . . . , xi }
rather than just for a single sample xi . The standard assumption for coin-tossing (and many other
phenomena that don’t exhibit any “memory” in how individual data points are generated) is to assume
that each observation xi is conditionally independent of the other observations given the parameter
θ, i.e.,
Yn
L(θ) = P (D|θ) = P (x1 , . . . , xn |θ) = P (xi |θ)
i=1

where P (xi |θ) = θ for xi = 1 and P (xi |θ) = 1 − θ for xi = 0. This particular “coin-tossing”
model, combining a Bernoulli with conditional independence of the xi ’s is referred to as a Binomial
likelihood.
The conditional independence assumption on the xi ’s in the likelihood definition is sometimes
(loosely) also referred to as the IID assumption (independent and identically distributed). The
notion of exchangeability in statistics is essentially the same idea. Note this assumption allows for
a tremendous simplification in our model: instead of dealing with the joint P (x1 , . . . , xn |θ) we can
instead work with individual terms P (xi |θ). Of course we have to be careful that this is a reasonable
assumption. It is certainly a reasonable assumption in the case where the xi ’s are coin tosses, or
perhaps (and closer to the real-world) the case where Xi represents the ith Web surfer to arrive at an
ecommerce Web site and xi is a binary value indicating whether the Web surfer makes a purchase
or not. But in other applications the xi ’s may have some dependence on each other, e.g., if the xi ’s
represented the value of the stock market on different days or words in text. If such dependence was
thought to exist then it should be modeled (see example below).

Continuing on with our binomial likelihood example, we can write


n
Y
L(θ) = P (xi |θ) = θr (1 − θ)n−r
i=1

where r is the number of “heads” observed and n − r is the number of tails. Note that we did not
include the usual combinatorial (binomial) term in front of the expression above, i.e., nr to count


the number of different ways that r heads could occur in n trials, since this term does not involve θ.

Figure 1 shows two examples of the binomial likelihood function for different data sets. In
Figure 1(a) we have r = 3 and n = 10. The likelihood function is relatively wide and is maximized
at 3/10 = 0.3, which makes sense intuitively. In Figure Figure 1(b) we have r = 30 and n = 100:
here the likelihood is much narrower as we might expect and, as a result, the plausible values for θ
are much narrower after seeing 100 observations compared with just 10.
Note Set 3, Models, Parameters, and Likelihood 4

−3 −27
x 10 x 10

3.5
2.5 BINOMIAL LIKELIHOOD: r = 3, n = 10 θML = 0.3 BINOMIAL LIKELIHOOD: r = 30, n = 100 θML = 0.3

2
2.5
Likelihood, L(θ)

Likelihood, L(θ)
1.5 2

1.5
1

0.5
0.5

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
θ θ

Figure 1: Binomial likelihood for (a) r = 3, n = 10, and (b) r = 30, n = 100.

An interesting side-note with the example above is that conditional independence plays a key role in
our definition of likelihood in the binomial model. In fact the xi ’s are not marginally independent, but only
conditionally independent. Why? If θ is unknown (remember that θ is the probability of heads) then the xi ’s
carry information about each other. As an example, say θ = 0.999 but we don’t know this. So we will tend
to see a lot of heads showing up and very rarely a tail showing up. Having seen such a sequence of xi ’s with
many more heads than tails, this data is informative about the next coin toss. Of course, if someone were to
tell us the true value of θ then the previous xi values have no information at all in terms of predicting the
next x value, since we have all the information we need in θ.

Example 2: Likelihood with Memory: In the previous binomial example, if instead of modeling
coin tosses we were modeling the occurrence of rain on day i in Irvine (xi indicates whether it rains
or not on day i), then we would want to consider abandoning the IID assumption and introducing
some dependence among the xi values (since we will tend to get “runs” of wet days and dry days).
For example, we could make a Markov assumption (Note Set 2) and assume that xi+1 on day i + 1 is
conditionally independent of x values on days i − 1, i − 2, . . . , 1, given the value of xi . Accordingly
the likelihood would be defined as:
n−1
Y
L(θ) = P (x1 , . . . , xn |θ) = P (x1 |θ1 ) · P (xi+1 |xi , θ2 )
i=1

where θ1 = p(x1 = 1) and θ2 is a parameter vector representing a 2 × 2 Markov transition matrix of


parameters (the conditional probabilities of rain or not-rain, conditioned on rain or not-rain the day
before).
Note Set 3, Models, Parameters, and Likelihood 5

Example 3: Gaussian Likelihood: Consider a data set D = {x1 , . . . , xn } where the xi ’s are real-
valued scalars and are samples from a random variable X. Assume we wish to model the xi values
with a Gaussian density function. The Gaussian has two parameters µ and σ 2 . Treating these two
parameters as unknown, and referring to them as θ1 = µ and θ2 = σ 2 we can write the likelihood as:
n
Y
p(D|θ) = p(x1 , . . . , xn |θ) = p(xi |θ)
i=1

where here we make the assumption that the xi ’s are conditionally independent given θ (for a real
problem we would want to convince ourselves that this is reasonable to do).

The individual terms in our likelihood are by definition Gaussian density functions, each
evaluated at xi :
1 1 xi −µ 2
p(xi |θ) = p(xi |µ, θ) = √ exp− 2 ( σ ) .
2πσ 2
Taking the product of these terms, and then taking the log (to the base e for convenience) we arrive
at the log-likelihood
n  2
n 1 X
log L(θ) = l(θ) = − log(2πθ2 ) − x i − θ1 .
2 2θ2
i=1

Imagine that θ2 = σ 2 is fixed (assume for example that it is known). Then l(θ1 ) (viewed as a function
of θ1 only) is proportional to a 2nd order polynomial involving xi ’s and θ1 , i.e.,
n
X
l(θ1 ) ∝ − (xi − θ1 )2
i=1

from which we see that l(θ1 ) is larger if ni=1 (xi − θ1 )2 is smaller, i.e., l(θ1 ) will be larger for
P

values of θ1 = µ that are closer to xi ’s on average (since this is a sum of squared errors between the
observed set of xi values and a single scalar θ1 = µ).

Figure 1 shows some examples of the Gaussian log-likelihood function l(µ) (treating µ as
unknown, but assuming that σ 2 is known) being plotted for different sized data samples, where the
data was simulated from a known Gaussian density function with µ = 5 and σ 2 = 1. Again as n
increases we see that the likelihood begins to narrow in around the true value of µ = 5.
Note Set 3, Models, Parameters, and Likelihood 6

Gaussian likelihood as a function of µ with 1 simulated data point Gaussian likelihood as a function of µ with 3 simulated data points
20 20

0 0

−20 −20
Log−Likelihood

Log−Likelihood
−40 −40

−60 −60

−80 −80

−100 −100
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Parameter µ Parameter µ

Gaussian likelihood as a function of µ with 20 simulated data points Gaussian likelihood as a function of µ with 2000 simulated data points
20 20

0 0

−20 −20
Log−Likelihood

Log−Likelihood

−40 −40

−60 −60

−80 −80

−100 −100
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Parameter µ Parameter µ

Figure 2: Log-likelihood for 4 different sample sizes, as a function of parameter θ = µ, with data simulated
from a Gaussian with true µ = 5 and σ = 1 (simulated data shown as dots horizontally at the top of the
plot).
Note Set 3, Models, Parameters, and Likelihood 7

3 The Principle of Maximum Likelihood

The principle of maximum likelihood follows naturally from what we have discussed above, namely that if
we had to summarize our data by selecting only a single parameter value θ̂, and if we only have the observed
data and the likelihood available and no other information, then it is reasonable to argue that the value of θ
that we should select is the one that maximizes the likelihood L(θ). Or, more formally:
θ̂M L = arg max L(θ) = arg max P (D|θ).
θ θ
The subscript “ML” denotes “maximum likelihood” since we will later discuss other types of estimates for
which we will use other subscripts. The “hat” notation, θ̂, denotes an estimate of some unknown (true)
quantity θ.

Example 4: Maximum Likelihood Estimate for the Binomial Model: From earlier, the binomial
likelihood can be written as:
n−1
Y
L(θ) = P (x1 , . . . , xn |θ) = P (x1 |θ1 ) · P (xi+1 |xi , θ2 )
i=1

where r is the number of successes in n trials. We can easily find the maximum likelihood estimate
of θ as follows. First lets work with the log-likelihood since the log-likelihood is a little easier to
work witha .
log L(θ) = l(θ) = r log θ + (n − r) log(1 − θ).
d
A necessary condition to maximize l(θ) is that dθ l(θ) = 0, i.e., this condition must be satisfied at
θ = θ̂M L . Thus, we calculate the derivative with respect to θ and set to 0, i.e.,

d r n−r
l(θ) = − = 0, at θ = θ̂M L
dθ θ 1−θ
and after some rearrangement of terms we get
r
θ̂M L =
n
i.e., the standard intuitive frequency-based estimate for the probability of success given r successes
in n trials. At this point it seems like we may not have gained very much with our likelihood-based
framework since we arrived back at the “obvious” answer! However, the power of the likelihood
(and related) approaches is that we can generalize to much more complex problems where there is
no obvious “intuitive” estimator for a parameter θ. And if we think about it we should have expected
to get this estimate for θ̂M L a priori. Had we gotten any other estimate we might have good cause for
concern that our likelihood-based procedures did not match our intuition.
a
Note that the value of θ that maximizes the log-likelihood is the same as the value of θ that maximizes the likelihood
since log is a monotonic function.
Note Set 3, Models, Parameters, and Likelihood 8

Example 5: Maximum Likelihood Estimate for the Gaussian IID Model:


Consider the case where σ is known and µ is unknown. From Example 3 earlier we saw that for the
Gaussian IID model we can write:
n
X
l(θ) = − (xi − θ)2
i=1

where θ = µ the unknown mean parameter. To maximize this as a function of θ we can use simple
calculus, i.e., differentiate the right-hand side above with respect to θ, set to 0, and solve for θ. (Left
as an exercise for the reader).

Example 6: Maximum Likelihood Estimation with Two Noisy Data Sources:


There are many problems in scientific data analysis where we need to combine multiple different data sets
to make predictions about a single quantity of interest. The following example discusses such a problem and
also illustrates a situation where the maximum likelihood approach leads to an estimate that is not obvious,
i.e., the equation defining θ̂M L could not easily be guessed, at least not until we have an idea what the correct
approach is.

Consider the following scenario. We are working with an astronomer monitoring a distant object in the
sky with two different CCD cameras connected to 2 different telescopes in different parts of the world. As-
sume in this simplified example that each camera produces noisy estimates of the object’s true brightness—
we assume that there is a true constant brightness µ for the object but our cameras only get noisy mea-
surements x1 , x2 , . . . (our astronomer can get multiple xi measurements from each camera over multiple
nights).

Say that camera 1 produces measurements that have a Gaussian distribution with mean µ and variance
σ12 ,
and that camera 2 produces measurements with mean µ and variance σ22 . We are assuming that the true
mean of the measurements from each individual camera is the same as the true brightness, but the variances
are different, e.g., if σ12 is much smaller than σ22 this could be because camera 1 is connected to a much more
accurate (newer, stronger) telescope. We will also assume (for simplicity) that the two variances are known
(but that µ is unknown)—which is not unreasonable, since astronomers are often very good at coming up
with techniques to calibrate the noise in their instruments.

The question is how to estimate µ given data D consisting of n1 measurements from camera 1 and n2
measurements from camera 2. A naive estimate of µ is simply the average over all of the measurements,
i.e.,
1 X
µ̂naive = xi
n1 + n2
i

where the sum ranges over all of the of measurements. But in constructing this simple estimate we are
ignoring the fact that one camera is more accurate than the other, i.e., σ12 ̸= σ22 . The more different these two
variances are, the more important it may be to account for measurements from the two data sets differently.
In the extreme case, for example, we might have only 1 measurement in D1 from camera 1 and (say) 1000
measurements from D2 from camera 2, but say that camera 1 has 10 times less variance than camera 2. In
Note Set 3, Models, Parameters, and Likelihood 9

this case how should we combine the data to arrive at an estimate of µ? Intuitively we can imagine that
some form of weighting scheme is probably appropriate, where we downweight measurements from the
more noisy camera and upweight measurements from the more accurate one. But its not obvious what these
weights should be.
This is the type of situation where formal probabilistic modeling (such as likelihood based methods) can
be very useful. So lets see what the maximum likelihood estimator for µ is in this situation.

L(µ) = p(D|µ)
= p(D1 , D2 |µ)
= p(D1 |µ)p(D2 |µ)
n1
Y n2
 Y
f xi ; µ, σ12 · f xj ; µ, σ22

=
i=1 j=1

where the first product is over the n1 data points in data set D1 and the second product is over the n2 data
points in data set D2 . The notation f (xi ; µ, σ12 ) denotes a Gaussian (Normal) density function evaluated at
xi with mean µ and variance σ12 . We have also assumed IID measurements, which may be reasonable for
example if the measurements were taken relatively far apart in time (e.g., on different nights). Taking logs
and dropping terms that don’t involve µ, we get
n1  2 n2  2
1 X 1 X
l(µ) = − 2 xi − µ − 2 xj − µ .
2σ1 i=1 2σ2 j=1

Taking the derivative with respect to µ yields


n1 n2
d 1 X 1 X
l(µ) = 2 (xi − µ) + 2 (xj − µ).
dµ σ1 i=1 σ2 j=1

Setting this expression to 0, and rearranging terms we get that


  n1 n2
n1 n2 1 X 1 X
µ̂M L 2 + 2 = x i + xj .
σ1 σ2 σ12 i=1 σ22 j=1

Multiplying through by σ12 ,


n1 n2
σ12 σ2 X
  X
µ̂M L n1 + n2 2 = xi + 12 xj ,
σ2 i=1
σ 2 j=1

yielding:
  n1 n2
σ12 −1 X σ12 X
 
µ̂M L = n1 + n2 2 xi + 2 xj .
σ2 i=1
σ2 j=1

σ2
We see that the relative weighting of the two data sets is controlled by the ratio r = σ12 . If r = 1 (same
2
variance in both cameras) we get the standard “unweighted” solution, i.e., the maximum likelihood estimate
Note Set 3, Models, Parameters, and Likelihood 10

of µ corresponds to the empirical average of all of the data points (as we would expect). If σ12 < σ22 (so the
ratio r < 1) then the data points from camera 2 (with higher variance and more noise) are essentially being
σ2
downweighted by a factor of r = σ12 . Conversely, if σ12 > σ22 and the measurements from camera 2 are less
2
noisy, then camera 2’s measurements are upweighted by the factor r > 1.
We might have guessed at a similar solution in an ad hoc manner—but the likelihood-based approach
provides a clear and principled way to derive estimators, and can be particularly useful in problems that are
often much more complex than this example. For example, imagine K cameras, with different (possibly
non-Gaussian) noise models for each and with various dependencies among the cameras. The noise charac-
teristics for some cameras could be unknown but nonetheless may be known to be inter-dependent in some
manner, e.g., two cameras have unknown variances but we know that the first camera has twice the variance
of the other. Maximum likelihood gives us a principled way to address such problems.

4 Maximum Likelihood for Graphical Models

4.1 Basic Concepts: Two Random Variables

Consider two discrete random variables A and B each taking M values, with possible values a and b.
Assume we already know the marginal probabilities p(a) for variable A and we wish to learn the condi-
tional probabilities P (b|a). We will treat these unknown conditional probabilities as parameters θ. We
can separate θ into M different sets of conditional probability parameters, one for each value of A, i.e.,
θ = (θ1 , . . . , θM ). Each set of parameters θk contains M conditional probabilities that sum to 1, each
conditioned on a particular value A = k, i.e., θk = (θk,1 , . . . , θk,M ) where θk,l = P (B = l|A = k).
Sidenote on notation: we will use notation below such as P (B = l|A = k, θk,l ), which you can think of
as saying “if we know A = k and we know the value of θk,l , then our conditional probability for B = l given
A = k is itself the parameter θk,l .” This notation, where we put parameters like θk,l on the conditioning side
of a conditional probability, is not very elegant, but it is convenient and useful in general (will be particularly
useful when we discuss Bayesian learning later on).
Now say we have an observed data set D = {(ai , bi )}, 1 ≤ i ≤ N , i.e., a set of N observations, with ai
and bi denoting the value of A and the value of B respectively for each pair of observations. For example i
might refer to an individual and A and B might be two discrete variables or attributes that we can measure
for any individual.
If we assume the observations are IID conditioned on the unknown parameters θ, we can write the
log-likelihood as
N
X
log L(θ) = log P (ai , bi |θ)
i=1
XN  
= log P (bi |ai , θ) + log P (ai )
i=1
Note Set 3, Models, Parameters, and Likelihood 11

We can drop the terms log P (ai ) from the likelihood since they are assumed here to be known and do
not depend on θ. We can also simplify the log-likelihood expression by writing it out as a sum over the
parameters for each of the different conditional probability vectors θk for each value of A:
N
X
log L(θ) = log P (bi |ai , θ)
i=1
X Nk
M X 
= log P (bi |ai = k, θk )
k=1 i=1
XM
= log L(θk )
k=1

where Nk is the number of times that a = k occurs in the data (here we have grouped the likelihood terms
in correspondence with the M values of A). Since each of the terms log L(θk ) involves different sets of
parameters θk , we can maximize each one separately, i.e., estimate the maximum likelihood parameters
(conditional probabilities) for each of the M different values of A. Thus, we have
Nk
X
log L(θk ) = log P (bi |ai = k, θk )
i=1
XM  X 
= log P (bi = l|ai = k, θk ) by grouping terms with bi = l
l=1 i:bi =l
M
X  
= rk,l log P (bi = l|ai = k, θk )
l=1
XM
= rk,l log θk,l
l=1

where rk,l is a count of the number of times that a = k and b = l in the data D, and where M
P
l=1 rk,l = Nk .
This is the same form as the multinomial problem (See lectures and/or homework 2). If we maximize this
for each θk,l , the solution is
ML rk,l
θ̂k,l = 1 ≤ l, k ≤ M
Nk
i.e., the maximum likelihood estimate of each conditional probability θk,l = P (B = l|A = k) is the number
of times rk,l that A = k and B = l occur in the data, divided by the number of times Nk that A = k occurs,
i.e,. a standard frequency-based estimate for a conditional probability.

If we now have a more complicated graphical model, e.g., A → B → C, we can again factorize
the likelihood into terms that only involve local conditional probability tables, with a local table for each
variable conditioned on its parents. The maximum likelihood estimates of these conditional probabilities
are the “local” frequency based estimates of how often both the parent-child combination of values occurs
divided by the number of times the parent value occurs (see subsection below for details).
Note Set 3, Models, Parameters, and Likelihood 12

4.2 More General Graphical Models (Optional Reading)

We can generalize the ideas above to any arbitrary directed graphical model. Assume we have a set of d
random variables where we know the structure of an associated graphical model, i.e., for each variable Xj
we know the parent set pa(Xj ) in the graph. We also have available an N × d data matrix D consisting of
independent random samples from the joint distribution P (x1 , . . . , xd ), where xij is the observed value for
variable Xj in the ith random sample. For simplicity assume that each variable Xj is discrete and takes M
values. Given the structure of the graphical model we would like to use the data D to estimate the CPTs for
the model.

The parameters θ can in general be defined as the set θ = {θj } where the index j = 1, . . . , d, i.e., j
ranges over the variables X1 , . . . , Xd (note that this is a little different to the notation from earlier in this
section). Each θj is the set of relevant parameters for variable Xj , or more specifically, the set of parameters
defining the CPT P (xj |pa(Xj )). (Note again that when we say “parameters” here we mean conditional
probabilities: we refer to them as parameters since they are unknown and we wish to estimate them from
data).

It is straightforward to show that the overall likelihood can be decomposed into separate local likelihood
terms, one per variable Xj , as follows:
N
Y
L(θ) = P (D|θ) = P (xi |θ)
i=1
N Y
Y d 
= P (xij |pa(Xj )i , θj )
i=1 j=1
d Y
Y N 
= P (xij |pa(Xj )i , θj )
j=1 i=1
d
Y
= L(θj )
j=1

where L(θj ) = N
Q
i=1 P (xij |pa(Xj )i , θ j ) is the part of the likelihood only involving parameters θ j for
variable Xj . (Here pa(Xj )i indicates the value(s) of the parents of Xj for the ith data point xi ). Thus, the
total likelihood decomposes into local likelihoods per node (or per variable).

We can write this in log-likelihood form as:


d
X
log L(θ) = log L(θj )
j=1

where
N
X
log L(θj ) = log P (xij |pa(Xj )i , θj )
i=1
Note Set 3, Models, Parameters, and Likelihood 13

We can maximize the full log-likelihood by independently maximizing each local log-likelihood log L(θj )
as long as the θj parameters for each variable Xj are not constrained or related (if they are then we would
need to a joint maximization over the different terms). Thus, we have reduced the problem of finding the
maximum likelihood parameters for a directed graphical model to d separate problems, where each problem
corresponds to finding the maximum likelihood parameters for the conditional probability tables for child
node Xj given its parents in the graphical model.
The parameters θj can be defined as a set θj = {θj,k }, where θj,k = {θj,k,l } where each θj,k,l =
P (xj = l|pa(xj ) = k), i.e., these parameters are the conditional probabilities of Xj taking different values
l, 1 ≤ l ≤ M , conditioned on a particular set of values k for the parent nodes. In the earlier subsection k
ranged over M values (the values of variable A): but in the general case a node might have multiple parents,
so k in general will range over all possible combinations of values of the parents.
Each local log-likelihood log L(θj ) can be written as
N
X
log L(θj ) = log P (xij |pa(Xj )i , θj )
i=1
X X 
= rk,l log P (xj = l|pa(Xj ) = k, θj,k
k l

where the two sums are over all possible values l and k of the child and parent variable(s) respectively, and
where rk,l is the number of times that those particular combinations of parent-child values occur in the data.
The sum over k has M |pa(Xj | different terms, where |pa(Xj )| is the number of parents of variable xj (for
the special case in the model where all variables take the same number of values M ). The innermost sum l
is over the M possible values that each variable xj can take, conditioned on some setting k of the values of
the parent variables pa(Xj ).
It follows from the equation above that log L(θj ) can be further broken down as sums of local likelihood
terms X
log L(θj ) = log L(θj,k )
k
with a log-likelihood term log L(θj,k ) for each set of parameters θj,k , where each term log L(θj,k ) can be
maximized separately from all the other terms. From a maximum likelihood perspective, each of these terms
log L(θj,k ) corresponds (in general) to a different conditional distribution θj,k = {θj,k,l } with probabilities
that sum to 1 (over l), and the maximum likelihood estimates for each such distribution (corresponding to a
“variable and parent value” combination) is the standard multinomial estimate from earlier, i.e.,
rj,k,l
θ̂j,k,l =
Nk
where Nk is the number of times the specific parent values corresponding to k occur in the data and rj,k,l is
the number of times that variable Xj takes value l and that the parents pa(Xj ) = k, with j = 1, . . . , d , l =
1, . . . , M , k = 1, . . . , M |pa(Xj )| .
In this manner our maximum likelihood problem for the graphical model reduces to M |pa(xj | different
maximum likelihood estimations of conditional distributions, repeated for each variable Xj in the model.

You might also like