Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
Machine Learning
Copyright
2015.
c Tom M. Mitchell. All rights reserved.
*DRAFT OF September 23, 2017*
This is a rough draft chapter intended for inclusion in the upcoming second edi-
tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are
welcome to use this for educational purposes, but do not duplicate or repost it
on the internet. For online copies of this and other materials related to this book,
visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].
P(X = xk |Y = yi )P(Y = yi )
P(Y = yi |X = xk ) =
∑ j P(X = xk |Y = y j )P(Y = y j )
1
Copyright
c 2015, Tom M. Mitchell. 2
where ym denotes the mth possible value for Y , xk denotes the kth possible vector
value for X, and where the summation in the denominator is over all legal values
of the random variable Y .
One way to learn P(Y |X) is to use the training data to estimate P(X|Y ) and
P(Y ). We can then use these estimates, together with Bayes rule above, to deter-
mine P(Y |X = xk ) for any new instance xk .
θi j ≡ P(X = xi |Y = y j )
where the index i takes on 2n possible values (one for each of the possible vector
values of X), and j takes on 2 possible values. Therefore, we will need to estimate
approximately 2n+1 parameters. To calculate the exact number of required param-
eters, note for any fixed j, the sum over i of θi j must be one. Therefore, for any
particular value y j , and the 2n possible values of xi , we need compute only 2n − 1
independent parameters. Given the two possible values for Y , we must estimate
a total of 2(2n − 1) such θi j parameters. Unfortunately, this corresponds to two
1 Why? See Chapter 5 of edition 1 of Machine Learning.
Copyright
c 2015, Tom M. Mitchell. 3
distinct parameters for each of the distinct instances in the instance space for X.
Worse yet, to obtain reliable estimates of each of these parameters, we will need to
observe each of these distinct instances multiple times! This is clearly unrealistic
in most practical learning domains. For example, if X is a vector containing 30
boolean features, then we will need to estimate more than 3 billion parameters.
P(X|Y ) = P(X1 , X2 |Y )
= P(X1 |X2 ,Y )P(X2 |Y )
= P(X1 |Y )P(X2 |Y )
Where the second line follows from a general property of probabilities, and the
third line follows directly from our above definition of conditional independence.
More generally, when X contains n attributes which satisfy the conditional inde-
pendence assumption, we have
n
P(X1 . . . Xn |Y ) = ∏ P(Xi |Y ) (1)
i=1
Notice that when Y and the Xi are boolean variables, we need only 2n parameters
to define P(Xi = xik |Y = y j ) for the necessary i, j, k. This is a dramatic reduction
compared to the 2(2n − 1) parameters needed to characterize P(X|Y ) if we make
no conditional independence assumption.
Let us now derive the Naive Bayes algorithm, assuming in general that Y is
any discrete-valued variable, and the attributes X1 . . . Xn are any discrete or real-
valued attributes. Our goal is to train a classifier that will output the probability
distribution over possible values of Y , for each new instance X that we ask it to
classify. The expression for the probability that Y will take on its kth possible
value, according to Bayes rule, is
P(Y = yk )P(X1 . . . Xn |Y = yk )
P(Y = yk |X1 . . . Xn ) =
∑ j P(Y = y j )P(X1 . . . Xn |Y = y j )
where the sum is taken over all possible values y j of Y . Now, assuming the Xi are
conditionally independent given Y , we can use equation (1) to rewrite this as
P(Y = yk ) ∏i P(Xi |Y = yk )
P(Y = yk |X1 . . . Xn ) = (2)
∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
Equation (2) is the fundamental equation for the Naive Bayes classifier. Given a
new instance X new = hX1 . . . Xn i, this equation shows how to calculate the prob-
ability that Y will take on any given value, given the observed attribute values
of X new and given the distributions P(Y ) and P(Xi |Y ) estimated from the training
data. If we are interested only in the most probable value of Y , then we have the
Naive Bayes classification rule:
P(Y = yk ) ∏i P(Xi |Y = yk )
Y ← arg max
yk ∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
which simplifies to the following (because the denominator does not depend on
yk ).
Y ← arg max P(Y = yk ) ∏ P(Xi |Y = yk ) (3)
yk
i
Copyright
c 2015, Tom M. Mitchell. 5
for each attribute Xi and each possible value yk of Y . Note there are 2nK of these
parameters, all of which must be estimated independently.
Of course we must also estimate the priors on Y as well
πk = P(Y = yk ) (12)
The above model summarizes a Gaussian Naive Bayes classifier, which as-
sumes that the data X is generated by a mixture of class-conditional (i.e., depen-
dent on the value of the class variable Y ) Gaussians. Furthermore, the Naive Bayes
assumption introduces the additional constraint that the attribute values Xi are in-
dependent of one another within each of these mixture components. In particular
problem settings where we have additional information, we might introduce addi-
tional assumptions to further restrict the number of parameters or the complexity
of estimating them. For example, if we have reason to believe that noise in the
observed Xi comes from a common source, then we might further assume that all
of the σik are identical, regardless of the attribute i or class k (see the homework
exercise on this issue).
Again, we can use either maximum likelihood estimates (MLE) or maximum
a posteriori (MAP) estimates for these parameters. The maximum likelihood esti-
mator for µik is
1 j
µ̂ik = j ∑ Xi δ(Y j = yk ) (13)
∑ j δ(Y = yk ) j
Copyright
c 2015, Tom M. Mitchell. 7
where the superscript j refers to the jth training example, and where δ(Y = yk ) is
1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training
examples for which Y = yk .
The maximum likelihood estimator for σ2ik is
1 j
σ̂2ik = ∑(X − µ̂ik )2δ(Y j = yk ) (14)
∑ j δ(Y j = yk ) j i
3 Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X → Y , or
P(Y |X) in the case where Y is discrete-valued, and X = hX1 . . . Xn i is any vector
containing discrete or continuous variables. In this section we will primarily con-
sider the case where Y is a boolean variable, in order to simplify notation. In the
final subsection we extend our treatment to the case where Y takes on any finite
number of discrete values.
Logistic Regression assumes a parametric form for the distribution P(Y |X),
then directly estimates its parameters from the training data. The parametric
model assumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = (16)
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = (17)
1 + exp(w0 + ∑ni=1 wi Xi )
Notice that equation (17) follows directly from equation (16), because the sum of
these two probabilities must equal 1.
One highly convenient property of this form for P(Y |X) is that it leads to a
simple linear expression for classification. To classify any given X we generally
want to assign the value yk that maximizes P(Y = yk |X). Put another way, we
assign the label Y = 0 if the following condition holds:
P(Y = 0|X)
1<
P(Y = 1|X)
Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-
sumed to follow this form.
and taking the natural log of both sides we have a linear classification rule that
assigns label Y = 0 if X satisfies
n
0 < w0 + ∑ wi Xi (18)
i=1
Note here we are assuming the standard deviations σi vary from attribute to at-
tribute, but do not depend on Y .
We now derive the parametric form of P(Y |X) that follows from this set of
GNB assumptions. In general, Bayes rule allows us to write
P(Y = 1)P(X|Y = 1)
P(Y = 1|X) =
P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)
Dividing both the numerator and denominator by the numerator yields:
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + P(Y
P(Y =1)P(X|Y =1)
or equivalently
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + exp(ln P(Y
P(Y =1)P(X|Y =1) )
Note the final step expresses P(Y = 0) and P(Y = 1) in terms of the binomial
parameter π.
Now consider just the summation in the denominator of equation (19). Given
our assumption that P(Xi |Y = yk ) is Gaussian, we can expand this term as follows:
2
√ 1 2 exp −(Xi −µ2 i0 )
P(Xi |Y = 0) 2πσ 2σi
∑ ln P(Xi|Y = 1) = ∑ ln √ 1 i −(Xi−µi1)2
i i exp 2σ2i
2πσ2i
(Xi − µi1 )2 − (Xi − µi0 )2
= ∑ ln exp
i 2σ2i
(Xi − µi1 )2 − (Xi − µi0 )2
= ∑
i 2σ2i
2
(Xi − 2Xi µi1 + µ2i1 ) − (Xi2 − 2Xi µi0 + µ2i0 )
= ∑
i 2σ2i
2Xi (µi0 − µi1 ) + µ2i1 − µ2i0
= ∑
i 2σ2i
µ2i1 − µ2i0
µi0 − µi1
= ∑ Xi + (20)
i σ2i 2σ2i
Copyright
c 2015, Tom M. Mitchell. 10
Note this expression is a linear weighted sum of the Xi ’s. Substituting expression
(20) back into equation (19), we have
1
P(Y = 1|X) = (21)
µ2i1 −µ2i0 )
µi0 −µi1
1 + exp(ln 1−π
π + ∑i σi2 Xi + 2σ2i
)
Or equivalently,
1
P(Y = 1|X) = (22)
1 + exp(w0 + ∑ni=1 wi Xi )
where the weights w1 . . . wn are given by
µi0 − µi1
wi =
σ2i
and where
1−π µ2 − µ2
w0 = ln + ∑ i1 2 i0
π i 2σi
Also we have
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = 1 − P(Y = 1|X) = (23)
1 + exp(w0 + ∑ni=1 wi Xi )
value of X in the lth training example. The expression to the right of the arg max
is the conditional data likelihood. Here we include W in the conditional, to em-
phasize that the expression is a function of the W we are attempting to maximize.
Equivalently, we can work with the log of the conditional likelihood:
This conditional data log likelihood, which we will denote l(W ) can be written
as
l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )
l
Note here we are utilizing the fact that Y can take only values 0 or 1, so only one
of the two terms in the expression will be non-zero for any given Y l .
To keep our derivation consistent with common usage, we will in this section
flip the assignment of the boolean variable Y so that we assign
1
P(Y = 0|X) = (24)
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 1|X) = (25)
1 + exp(w0 + ∑ni=1 wi Xi )
In this case, we can reexpress the log of the conditional likelihood as:
where Xil denotes the value of Xi for the lth training example. Note the superscript
l is not related to the log likelihood function l(W ).
Unfortunately, there is no closed form solution to maximizing l(W ) with re-
spect to W . Therefore, one common approach is to use gradient ascent, in which
we work with the gradient, which is the vector of partial derivatives. The ith
component of the vector gradient has the form
∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))
∂wi l
where P̂(Y l |X l ,W ) is the Logistic Regression prediction using equations (24) and
(25) and the weights W . To accommodate weight w0 , we assume an imaginary
X0 = 1 for all l. This expression for the derivative has an intuitive interpretation:
the term inside the parentheses is simply the prediction error; that is, the difference
Copyright
c 2015, Tom M. Mitchell. 12
between the observed Y l and its predicted probability! Note if Y l = 1 then we wish
for P̂(Y l = 1|X l ,W ) to be 1, whereas if Y l = 0 then we prefer that P̂(Y l = 1|X l ,W )
be 0 (which makes P̂(Y l = 0|X l ,W ) equal to 1). This error term is multiplied by
the value of Xil , which accounts for the magnitude of the wi Xil term in making this
prediction.
Given this formula for the derivative of each wi , we can use standard gradient
ascent to optimize the weights W . Beginning with initial weights of zero, we
repeatedly update the weights in the direction of the gradient, on each iteration
changing every weight wi according to
where η is a small constant (e.g., 0.01) which determines the step size. Because
the conditional log likelihood l(W ) is a concave function in W , this gradient ascent
procedure will converge to a global maximum. Gradient ascent is described in
greater detail, for example, in Chapter 4 of Mitchell (1997). In many cases where
computational efficiency is important it is common to use a variant of gradient
ascent called conjugate gradient ascent, which often converges more quickly.
λ
W ← arg max ∑ ln P(Y l |X l ,W ) − ||W ||2
W
l 2
∑ ln P(Y l |X l ,W ) + ln P(W )
l
and if P(W ) is a zero mean Gaussian distribution, then ln P(W ) yields a term
proportional to ||W ||2 .
Given this penalized log likelihood function, it is easy to rederive the gradient
descent rule. The derivative of this penalized log likelihood function is similar to
Copyright
c 2015, Tom M. Mitchell. 13
Here w ji denotes the weight associated with the jth class Y = y j and with input
Xi . It is easy to see that our earlier expressions for the case where Y is boolean
(equations (16) and (17)) are a special case of the above expressions. Note also
that the form of the expression for P(Y = yK |X) assures that [∑K k=1 P(Y = yk |X)] =
1.
The primary difference between these expressions and those for boolean Y is
that when Y takes on K possible values, we construct K −1 different linear expres-
sions to capture the distributions for the different values of Y . The distribution for
the final, Kth, value of Y is simply one minus the probabilities of the first K − 1
values.
In this case, the gradient descent rule with regularization becomes:
w ji ← w ji + η ∑ Xil (δ(Y l = y j ) − P̂(Y l = y j |X l ,W )) − ηλw ji (29)
l
• When the GNB modeling assumptions do not hold, Logistic Regression and
GNB typically learn different classifier functions. In this case, the asymp-
totic (as the number of training examples approach infinity) classification
accuracy for Logistic Regression is often better than the asymptotic accu-
racy of GNB. Although Logistic Regression is consistent with the Naive
Bayes assumption that the input features Xi are conditionally independent
given Y , it is not rigidly tied to this assumption as is Naive Bayes. Given
data that disobeys this assumption, the conditional likelihood maximization
algorithm for Logistic Regression will adjust its parameters to maximize the
fit to (the conditional likelihood of) the data, even if the resulting parameters
are inconsistent with the Naive Bayes parameter estimates.
• GNB and Logistic Regression converge toward their asymptotic accuracies
at different rates. As Ng & Jordan (2002) show, GNB parameter estimates
converge toward their asymptotic values in order log n examples, where n
is the dimension of X. In contrast, Logistic Regression parameter estimates
converge more slowly, requiring order n examples. The authors also show
that in several data sets Logistic Regression outperforms GNB when many
training examples are available, but GNB outperforms Logistic Regression
when training data is scarce.
• We can use Bayes rule as the basis for designing learning algorithms (func-
tion approximators), as follows: Given that we wish to learn some target
function f : X → Y , or equivalently, P(Y |X), we use the training data to
learn estimates of P(X|Y ) and P(Y ). New X examples can then be classi-
fied using these estimated probability distributions, plus Bayes rule. This
Copyright
c 2015, Tom M. Mitchell. 15
6 Further Reading
Wasserman (2004) describes a Reweighted Least Squares method for Logistic
Regression. Ng and Jordan (2002) provide a theoretical and experimental com-
parison of the Naive Bayes classifier and Logistic Regression.
Copyright
c 2015, Tom M. Mitchell. 16
EXERCISES
1. At the beginning of the chapter we remarked that “A hundred training ex-
amples will usually suffice to obtain an estimate of P(Y ) that is within a
few percent of the correct value.” Describe conditions under which the 95%
confidence interval for our estimate of P(Y ) will be ±0.02.
2. Consider learning a function X → Y where Y is boolean, where X = hX1 , X2 i,
and where X1 is a boolean variable and X2 a continuous variable. State the
parameters that must be estimated to define a Naive Bayes classifier in this
case. Give the formula for computing P(Y |X), in terms of these parameters
and the feature values X1 and X2 .
3. In section 3 we showed that when Y is Boolean and X = hX1 . . . Xn i is a
vector of continuous variables, then the assumptions of the Gaussian Naive
Bayes classifier imply that P(Y |X) is given by the logistic function with
appropriate parameters W . In particular:
1
P(Y = 1|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
Consider instead the case where Y is Boolean and X = hX1 . . . Xn i is a vec-
tor of Boolean variables. Prove for this case also that P(Y |X) follows this
same form (and hence that Logistic Regression is also the discriminative
counterpart to a Naive Bayes generative classifier over Boolean features).
Hints:
• Simple notation will help. Since the Xi are Boolean variables, you
need only one parameter to define P(Xi |Y = yk ). Define θi1 ≡ P(Xi =
1|Y = 1), in which case P(Xi = 0|Y = 1) = (1 − θi1 ). Similarly, use
θi0 to denote P(Xi = 1|Y = 0).
• Notice with the above notation you can represent P(Xi |Y = 1) as fol-
lows
P(Xi |Y = 1) = θXi1i (1 − θi1 )(1−Xi )
Note when Xi = 1 the second term is equal to 1 because its exponent
is zero. Similarly, when Xi = 0 the first term is equal to 1 because its
exponent is zero.
4. (based on a suggestion from Sandra Zilles). This question asks you to con-
sider the relationship between the MAP hypothesis and the Bayes optimal
hypothesis. Consider a hypothesis space H defined over the set of instances
X, and containing just two hypotheses, h1 and h2 with equal prior probabil-
ities P(h1) = P(h2) = 0.5. Suppose we are given an arbitrary set of training
Copyright
c 2015, Tom M. Mitchell. 17
7 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from the following: Nathaniel Fairfield, Rainer Gemulla, Vineet Kumar, Andrew
McCallum, Anand Prahlad, Wei Wang, Geoff Webb, and Sandra Zilles.
REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A compar-
ison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng,
A.Y., and Jordan, M. (2002).
Wasserman, L. (2004). All of Statistics, Springer-Verlag.