0% found this document useful (0 votes)

90 views

Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression

Uploaded by

neme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views

Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression

Uploaded by

neme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CHAPTER 3

GENERATIVE AND DISCRIMINATIVE

CLASSIFIERS:
NAIVE BAYES AND LOGISTIC REGRESSION

Machine Learning
Copyright 2015.
c Tom M. Mitchell. All rights reserved.
*DRAFT OF September 23, 2017*

PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION

This is a rough draft chapter intended for inclusion in the upcoming second edi-
tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are
welcome to use this for educational purposes, but do not duplicate or repost it
on the internet. For online copies of this and other materials related to this book,
visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
[email protected].

1 Learning Classifiers based on Bayes Rule

Here we consider the relationship between supervised learning, or function ap-
proximation problems, and Bayesian reasoning. We begin by considering how to
design learning algorithms based on Bayes rule.
Consider a supervised learning problem in which we wish to approximate an
unknown target function f : X → Y , or equivalently P(Y |X). To begin, we will
assume Y is a boolean-valued random variable, and X is a vector containing n
boolean attributes. In other words, X = hX1 , X2 . . . , Xn i, where Xi is the boolean
random variable denoting the ith attribute of X.
Applying Bayes rule, we see that P(Y = yi |X) can be represented as

P(X = xk |Y = yi )P(Y = yi )
P(Y = yi |X = xk ) =
∑ j P(X = xk |Y = y j )P(Y = y j )

1
Copyright
c 2015, Tom M. Mitchell. 2

where ym denotes the mth possible value for Y , xk denotes the kth possible vector
value for X, and where the summation in the denominator is over all legal values
of the random variable Y .
One way to learn P(Y |X) is to use the training data to estimate P(X|Y ) and
P(Y ). We can then use these estimates, together with Bayes rule above, to deter-
mine P(Y |X = xk ) for any new instance xk .

A NOTE ON NOTATION: We will consistently use upper case symbols (e.g.,

X) to refer to random variables, including both vector and non-vector variables. If
X is a vector, then we use subscripts (e.g., Xi to refer to each random variable, or
feature, in X). We use lower case symbols to refer to values of random variables
(e.g., Xi = xi j may refer to random variable Xi taking on its jth possible value). We
will sometimes abbreviate by omitting variable names, for example abbreviating
P(Xi = xi j |Y = yk ) to P(xi j |yk ). We will write E[X] to refer to the expected value
j
of X. We use superscripts to index training examples (e.g., Xi refers to the value
of the random variable Xi in the jth training example.). We use δ(x) to denote
an “indicator” function whose value is 1 if its logical argument x is true, and
whose value is 0 otherwise. We use the #D{x} operator to denote the number of
elements in the set D that satisfy property x. We use a “hat” to indicate estimates;
for example, θ̂ indicates an estimated value of θ.

1.1 Unbiased Learning of Bayes Classifiers is Impractical

If we are going to train a Bayes classifier by estimating P(X|Y ) and P(Y ), then
it is reasonable to ask how much training data will be required to obtain reliable
estimates of these distributions. Let us assume training examples are generated
by drawing instances at random from an unknown underlying distribution P(X),
then allowing a teacher to label this example with its Y value.
A hundred independently drawn training examples will usually suffice to ob-
tain a maximum likelihood estimate of P(Y ) that is within a few percent of its cor-
rect value1 when Y is a boolean variable. However, accurately estimating P(X|Y )
typically requires many more examples. To see why, consider the number of pa-
rameters we must estimate when Y is boolean and X is a vector of n boolean
attributes. In this case, we need to estimate a set of parameters

θi j ≡ P(X = xi |Y = y j )

where the index i takes on 2n possible values (one for each of the possible vector
values of X), and j takes on 2 possible values. Therefore, we will need to estimate
approximately 2n+1 parameters. To calculate the exact number of required param-
eters, note for any fixed j, the sum over i of θi j must be one. Therefore, for any
particular value y j , and the 2n possible values of xi , we need compute only 2n − 1
independent parameters. Given the two possible values for Y , we must estimate
a total of 2(2n − 1) such θi j parameters. Unfortunately, this corresponds to two
1 Why? See Chapter 5 of edition 1 of Machine Learning.
Copyright
c 2015, Tom M. Mitchell. 3

distinct parameters for each of the distinct instances in the instance space for X.
Worse yet, to obtain reliable estimates of each of these parameters, we will need to
observe each of these distinct instances multiple times! This is clearly unrealistic
in most practical learning domains. For example, if X is a vector containing 30
boolean features, then we will need to estimate more than 3 billion parameters.

2 Naive Bayes Algorithm

Given the intractable sample complexity for learning Bayesian classifiers, we must
look for ways to reduce this complexity. The Naive Bayes classifier does this
by making a conditional independence assumption that dramatically reduces the
number of parameters to be estimated when modeling P(X|Y ), from our original
2(2n − 1) to just 2n.

2.1 Conditional Independence

Definition: Given three sets of random variables X,Y and Z, we say X
is conditionally independent of Y given Z, if and only if the proba-
bility distribution governing X is independent of the value of Y given
Z; that is

(∀i, j, k)P(X = xi |Y = y j , Z = zk ) = P(X = xi |Z = zk )

As an example, consider three boolean random variables to describe the current

weather: Rain, T hunder and Lightning. We might reasonably assert that T hunder
is independent of Rain given Lightning. Because we know Lightning causes
T hunder, once we know whether or not there is Lightning, no additional infor-
mation about T hunder is provided by the value of Rain. Of course there is a
clear dependence of T hunder on Rain in general, but there is no conditional de-
pendence once we know the value of Lightning. Although X, Y and Z are each
single random variables in this example, more generally the definition applies to
sets of random variables. For example, we might assert that variables {A, B} are
conditionally independent of {C, D} given variables {E, F}.

2.2 Derivation of Naive Bayes Algorithm

The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a
set of conditional independence assumptions. Given the goal of learning P(Y |X)
where X = hX1 . . . , Xn i, the Naive Bayes algorithm makes the assumption that
each Xi is conditionally independent of each of the other Xk s given Y , and also
independent of each subset of the other Xk ’s given Y .
The value of this assumption is that it dramatically simplifies the representa-
tion of P(X|Y ), and the problem of estimating it from the training data. Consider,
for example, the case where X = hX1 , X2 i. In this case
Copyright
c 2015, Tom M. Mitchell. 4

P(X|Y ) = P(X1 , X2 |Y )
= P(X1 |X2 ,Y )P(X2 |Y )
= P(X1 |Y )P(X2 |Y )

Where the second line follows from a general property of probabilities, and the
third line follows directly from our above definition of conditional independence.
More generally, when X contains n attributes which satisfy the conditional inde-
pendence assumption, we have
n
P(X1 . . . Xn |Y ) = ∏ P(Xi |Y ) (1)
i=1

Notice that when Y and the Xi are boolean variables, we need only 2n parameters
to define P(Xi = xik |Y = y j ) for the necessary i, j, k. This is a dramatic reduction
compared to the 2(2n − 1) parameters needed to characterize P(X|Y ) if we make
no conditional independence assumption.
Let us now derive the Naive Bayes algorithm, assuming in general that Y is
any discrete-valued variable, and the attributes X1 . . . Xn are any discrete or real-
valued attributes. Our goal is to train a classifier that will output the probability
distribution over possible values of Y , for each new instance X that we ask it to
classify. The expression for the probability that Y will take on its kth possible
value, according to Bayes rule, is
P(Y = yk )P(X1 . . . Xn |Y = yk )
P(Y = yk |X1 . . . Xn ) =
∑ j P(Y = y j )P(X1 . . . Xn |Y = y j )
where the sum is taken over all possible values y j of Y . Now, assuming the Xi are
conditionally independent given Y , we can use equation (1) to rewrite this as
P(Y = yk ) ∏i P(Xi |Y = yk )
P(Y = yk |X1 . . . Xn ) = (2)
∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
Equation (2) is the fundamental equation for the Naive Bayes classifier. Given a
new instance X new = hX1 . . . Xn i, this equation shows how to calculate the prob-
ability that Y will take on any given value, given the observed attribute values
of X new and given the distributions P(Y ) and P(Xi |Y ) estimated from the training
data. If we are interested only in the most probable value of Y , then we have the
Naive Bayes classification rule:
P(Y = yk ) ∏i P(Xi |Y = yk )
Y ← arg max
yk ∑ j P(Y = y j ) ∏i P(Xi |Y = y j )
which simplifies to the following (because the denominator does not depend on
yk ).
Y ← arg max P(Y = yk ) ∏ P(Xi |Y = yk ) (3)
yk
i
Copyright
c 2015, Tom M. Mitchell. 5

2.3 Naive Bayes for Discrete-Valued Inputs

To summarize, let us precisely define the Naive Bayes learning algorithm by de-
scribing the parameters that must be estimated, and how we may estimate them.
When the n input attributes Xi each take on J possible discrete values, and
Y is a discrete variable taking on K possible values, then our learning task is to
estimate two sets of parameters. The first is
θi jk ≡ P(Xi = xi j |Y = yk ) (4)
for each input attribute Xi , each of its possible values xi j , and each of the possible
values yk of Y . Note there will be nJK such parameters, and note also that only
n(J − 1)K of these are independent, given that they must satisfy 1 = ∑ j θi jk for
each pair of i, k values.
In addition, we must estimate parameters that define the prior probability over
Y:
πk ≡ P(Y = yk ) (5)
Note there are K of these parameters, (K − 1) of which are independent.
We can estimate these parameters using either maximum likelihood estimates
(based on calculating the relative frequencies of the different events in the data),
or using Bayesian MAP estimates (augmenting this observed data with prior dis-
tributions over the values of these parameters).
Maximum likelihood estimates for θi jk given a set of training examples D are
given by
#D{Xi = xi j ∧Y = yk }
θ̂i jk = P̂(Xi = xi j |Y = yk ) = (6)
#D{Y = yk }
where the #D{x} operator returns the number of elements in the set D that satisfy
property x.
One danger of this maximum likelihood estimate is that it can sometimes re-
sult in θ estimates of zero, if the data does not happen to contain any training
examples satisfying the condition in the numerator. To avoid this, it is common to
use a “smoothed” estimate which effectively adds in a number of additional “hal-
lucinated” examples, and which assumes these hallucinated examples are spread
evenly over the possible values of Xi . This smoothed estimate is given by
#D{Xi = xi j ∧Y = yk } + l
θ̂i jk = P̂(Xi = xi j |Y = yk ) = (7)
#D{Y = yk } + lJ
where J is the number of distinct values Xi can take on, and l determines the
strength of this smoothing (i.e., the number of hallucinated examples is lJ). This
expression corresponds to a MAP estimate for θi jk if we assume a Dirichlet prior
distribution over the θi jk parameters, with equal-valued parameters. If l is set to
1, this approach is called Laplace smoothing.
Maximum likelihood estimates for πk are
#D{Y = yk }
π̂k = P̂(Y = yk ) = (8)
|D|
Copyright
c 2015, Tom M. Mitchell. 6

where |D| denotes the number of elements in the training set D.

Alternatively, we can obtain a smoothed estimate, or equivalently a MAP es-
timate based on a Dirichlet prior over the πk parameters assuming equal priors on
each πk , by using the following expression
#D{Y = yk } + l
π̂k = P̂(Y = yk ) = (9)
|D| + lK
where K is the number of distinct values Y can take on, and l again determines the
strength of the prior assumptions relative to the observed data D.

2.4 Naive Bayes for Continuous Inputs

In the case of continuous inputs Xi , we can of course continue to use equations
(2) and (3) as the basis for designing a Naive Bayes classifier. However, when the
Xi are continuous we must choose some other way to represent the distributions
P(Xi |Y ). One common approach is to assume that for each possible discrete value
yk of Y , the distribution of each continuous Xi is Gaussian, and is defined by a
mean and standard deviation specific to Xi and yk . In order to train such a Naive
Bayes classifier we must therefore estimate the mean and standard deviation of
each of these Gaussians:

µik = E[Xi |Y = yk ] (10)

σ2ik = E[(Xi − µik )2 |Y = yk ] (11)

for each attribute Xi and each possible value yk of Y . Note there are 2nK of these
parameters, all of which must be estimated independently.
Of course we must also estimate the priors on Y as well

πk = P(Y = yk ) (12)

The above model summarizes a Gaussian Naive Bayes classifier, which as-
sumes that the data X is generated by a mixture of class-conditional (i.e., depen-
dent on the value of the class variable Y ) Gaussians. Furthermore, the Naive Bayes
assumption introduces the additional constraint that the attribute values Xi are in-
dependent of one another within each of these mixture components. In particular
problem settings where we have additional information, we might introduce addi-
tional assumptions to further restrict the number of parameters or the complexity
of estimating them. For example, if we have reason to believe that noise in the
observed Xi comes from a common source, then we might further assume that all
of the σik are identical, regardless of the attribute i or class k (see the homework
exercise on this issue).
Again, we can use either maximum likelihood estimates (MLE) or maximum
a posteriori (MAP) estimates for these parameters. The maximum likelihood esti-
mator for µik is
1 j
µ̂ik = j ∑ Xi δ(Y j = yk ) (13)
∑ j δ(Y = yk ) j
Copyright
c 2015, Tom M. Mitchell. 7

where the superscript j refers to the jth training example, and where δ(Y = yk ) is
1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training
examples for which Y = yk .
The maximum likelihood estimator for σ2ik is

1 j
σ̂2ik = ∑(X − µ̂ik )2δ(Y j = yk ) (14)
∑ j δ(Y j = yk ) j i

This maximum likelihood estimator is biased, so the minimum variance unbi-

ased estimator (MVUE) is sometimes used instead. It is
1 j
σ̂2ik = j ∑ (Xi − µ̂ik )2 δ(Y j = yk ) (15)
(∑ j δ(Y = yk )) − 1 j

3 Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X → Y , or
P(Y |X) in the case where Y is discrete-valued, and X = hX1 . . . Xn i is any vector
containing discrete or continuous variables. In this section we will primarily con-
sider the case where Y is a boolean variable, in order to simplify notation. In the
final subsection we extend our treatment to the case where Y takes on any finite
number of discrete values.
Logistic Regression assumes a parametric form for the distribution P(Y |X),
then directly estimates its parameters from the training data. The parametric
model assumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = (16)
1 + exp(w0 + ∑ni=1 wi Xi )

and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = (17)
1 + exp(w0 + ∑ni=1 wi Xi )
Notice that equation (17) follows directly from equation (16), because the sum of
these two probabilities must equal 1.
One highly convenient property of this form for P(Y |X) is that it leads to a
simple linear expression for classification. To classify any given X we generally
want to assign the value yk that maximizes P(Y = yk |X). Put another way, we
assign the label Y = 0 if the following condition holds:

P(Y = 0|X)
1<
P(Y = 1|X)

substituting from equations (16) and (17), this becomes

n
1 < exp(w0 + ∑ wi Xi )
i=1
Copyright
c 2015, Tom M. Mitchell. 8

Figure 1: Form of the logistic function. In Logistic Regression, P(Y |X) is as-
sumed to follow this form.

and taking the natural log of both sides we have a linear classification rule that
assigns label Y = 0 if X satisfies
n
0 < w0 + ∑ wi Xi (18)
i=1

and assigns Y = 1 otherwise.

Interestingly, the parametric form of P(Y |X) used by Logistic Regression is
precisely the form implied by the assumptions of a Gaussian Naive Bayes classi-
fier. Therefore, we can view Logistic Regression as a closely related alternative to
GNB, though the two can produce different results in many cases.

3.1 Form of P(Y |X) for Gaussian Naive Bayes Classifier

Here we derive the form of P(Y |X) entailed by the assumptions of a Gaussian
Naive Bayes (GNB) classifier, showing that it is precisely the form used by Logis-
tic Regression and summarized in equations (16) and (17). In particular, consider
a GNB based on the following modeling assumptions:

• Y is boolean, governed by a Bernoulli distribution, with parameter π =

P(Y = 1)

• X = hX1 . . . Xn i, where each Xi is a continuous random variable

• For each Xi , P(Xi |Y = yk ) is a Gaussian distribution of the form N(µik , σi )

• For all i and j 6= i, Xi and X j are conditionally independent given Y

Note here we are assuming the standard deviations σi vary from attribute to at-
tribute, but do not depend on Y .
We now derive the parametric form of P(Y |X) that follows from this set of
GNB assumptions. In general, Bayes rule allows us to write
P(Y = 1)P(X|Y = 1)
P(Y = 1|X) =
P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)
Dividing both the numerator and denominator by the numerator yields:
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + P(Y
P(Y =1)P(X|Y =1)

or equivalently
1
P(Y = 1|X) = =0)P(X|Y =0)
1 + exp(ln P(Y
P(Y =1)P(X|Y =1) )

Because of our conditional independence assumption we can write this

1
P(Y = 1|X) = =0) i |Y =0)
1 + exp(ln P(Y
P(Y =1) + ∑i ln P(X
P(Xi |Y =1) )
1
= P(Xi |Y =0)
(19)
1 + exp(ln 1−π
π + ∑i ln P(Xi |Y =1) )

Note the final step expresses P(Y = 0) and P(Y = 1) in terms of the binomial
parameter π.
Now consider just the summation in the denominator of equation (19). Given
our assumption that P(Xi |Y = yk ) is Gaussian, we can expand this term as follows:
2

√ 1 2 exp −(Xi −µ2 i0 )
P(Xi |Y = 0) 2πσ 2σi
∑ ln P(Xi|Y = 1) = ∑ ln √ 1 i −(Xi−µi1)2
i i exp 2σ2i
2πσ2i
(Xi − µi1 )2 − (Xi − µi0 )2

= ∑ ln exp
i 2σ2i
(Xi − µi1 )2 − (Xi − µi0 )2

= ∑
i 2σ2i
2
(Xi − 2Xi µi1 + µ2i1 ) − (Xi2 − 2Xi µi0 + µ2i0 )

= ∑
i 2σ2i
2Xi (µi0 − µi1 ) + µ2i1 − µ2i0

= ∑
i 2σ2i
µ2i1 − µ2i0

µi0 − µi1
= ∑ Xi + (20)
i σ2i 2σ2i
Copyright
c 2015, Tom M. Mitchell. 10

Note this expression is a linear weighted sum of the Xi ’s. Substituting expression
(20) back into equation (19), we have
1
P(Y = 1|X) = (21)
µ2i1 −µ2i0 )

µi0 −µi1
1 + exp(ln 1−π
π + ∑i σi2 Xi + 2σ2i
)

Or equivalently,
1
P(Y = 1|X) = (22)
1 + exp(w0 + ∑ni=1 wi Xi )
where the weights w1 . . . wn are given by
µi0 − µi1
wi =
σ2i
and where
1−π µ2 − µ2
w0 = ln + ∑ i1 2 i0
π i 2σi
Also we have
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) = 1 − P(Y = 1|X) = (23)
1 + exp(w0 + ∑ni=1 wi Xi )

3.2 Estimating Parameters for Logistic Regression

The above subsection proves that P(Y |X) can be expressed in the parametric form
given by equations (16) and (17), under the Gaussian Naive Bayes assumptions
detailed there. It also provides the value of the weights wi in terms of the param-
eters estimated by the GNB classifier. Here we describe an alternative method
for estimating these weights. We are interested in this alternative for two reasons.
First, the form of P(Y |X) assumed by Logistic Regression holds in many problem
settings beyond the GNB problem detailed in the above section, and we wish to
have a general method for estimating it in a more broad range of cases. Second, in
many cases we may suspect the GNB assumptions are not perfectly satisfied. In
this case we may wish to estimate the wi parameters directly from the data, rather
than going through the intermediate step of estimating the GNB parameters which
forces us to adopt its more stringent modeling assumptions.
One reasonable approach to training Logistic Regression is to choose param-
eter values that maximize the conditional data likelihood. The conditional data
likelihood is the probability of the observed Y values in the training data, condi-
tioned on their corresponding X values. We choose parameters W that satisfy

W ← arg max ∏ P(Y l |X l ,W )

W
l

where W = hw0 , w1 . . . wn i is the vector of parameters to be estimated, Y l denotes

the observed value of Y in the lth training example, and X l denotes the observed
Copyright
c 2015, Tom M. Mitchell. 11

value of X in the lth training example. The expression to the right of the arg max
is the conditional data likelihood. Here we include W in the conditional, to em-
phasize that the expression is a function of the W we are attempting to maximize.
Equivalently, we can work with the log of the conditional likelihood:

W ← arg max ∑ ln P(Y l |X l ,W )

W
l

This conditional data log likelihood, which we will denote l(W ) can be written
as
l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )
l
Note here we are utilizing the fact that Y can take only values 0 or 1, so only one
of the two terms in the expression will be non-zero for any given Y l .
To keep our derivation consistent with common usage, we will in this section
flip the assignment of the boolean variable Y so that we assign
1
P(Y = 0|X) = (24)
1 + exp(w0 + ∑ni=1 wi Xi )

and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 1|X) = (25)
1 + exp(w0 + ∑ni=1 wi Xi )
In this case, we can reexpress the log of the conditional likelihood as:

l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )

l
P(Y l = 1|X l ,W )
= ∑ Y l ln l l
P(Y = 0|X ,W )
+ ln P(Y l = 0|X l ,W )
l
n n
= ∑ Y l (w0 + ∑ wiXil ) − ln(1 + exp(w0 + ∑ wiXil ))
l i i

where Xil denotes the value of Xi for the lth training example. Note the superscript
l is not related to the log likelihood function l(W ).
Unfortunately, there is no closed form solution to maximizing l(W ) with re-
spect to W . Therefore, one common approach is to use gradient ascent, in which
we work with the gradient, which is the vector of partial derivatives. The ith
component of the vector gradient has the form

∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))
∂wi l

where P̂(Y l |X l ,W ) is the Logistic Regression prediction using equations (24) and
(25) and the weights W . To accommodate weight w0 , we assume an imaginary
X0 = 1 for all l. This expression for the derivative has an intuitive interpretation:
the term inside the parentheses is simply the prediction error; that is, the difference
Copyright
c 2015, Tom M. Mitchell. 12

between the observed Y l and its predicted probability! Note if Y l = 1 then we wish
for P̂(Y l = 1|X l ,W ) to be 1, whereas if Y l = 0 then we prefer that P̂(Y l = 1|X l ,W )
be 0 (which makes P̂(Y l = 0|X l ,W ) equal to 1). This error term is multiplied by
the value of Xil , which accounts for the magnitude of the wi Xil term in making this
prediction.
Given this formula for the derivative of each wi , we can use standard gradient
ascent to optimize the weights W . Beginning with initial weights of zero, we
repeatedly update the weights in the direction of the gradient, on each iteration
changing every weight wi according to

wi ← wi + η ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))

where η is a small constant (e.g., 0.01) which determines the step size. Because
the conditional log likelihood l(W ) is a concave function in W , this gradient ascent
procedure will converge to a global maximum. Gradient ascent is described in
greater detail, for example, in Chapter 4 of Mitchell (1997). In many cases where
computational efficiency is important it is common to use a variant of gradient
ascent called conjugate gradient ascent, which often converges more quickly.

3.3 Regularization in Logistic Regression

Overfitting the training data is a problem that can arise in Logistic Regression,
especially when data is very high dimensional and training data is sparse. One
approach to reducing overfitting is regularization, in which we create a modified
“penalized log likelihood function,” which penalizes large values of W . One ap-
proach is to use the penalized log likelihood function

λ
W ← arg max ∑ ln P(Y l |X l ,W ) − ||W ||2
W
l 2

which adds a penalty proportional to the squared magnitude of W . Here λ is a

constant that determines the strength of this penalty term.
Modifying our objective by adding in this penalty term gives us a new objec-
tive to maximize. It is easy to show that maximizing it corresponds to calculating
the MAP estimate for W under the assumption that the prior distribution P(W ) is
a Normal distribution with mean zero, and a variance related to 1/λ. Notice that
in general, the MAP estimate for W involves optimizing the objective

∑ ln P(Y l |X l ,W ) + ln P(W )
l

and if P(W ) is a zero mean Gaussian distribution, then ln P(W ) yields a term
proportional to ||W ||2 .
Given this penalized log likelihood function, it is easy to rederive the gradient
descent rule. The derivative of this penalized log likelihood function is similar to
Copyright
c 2015, Tom M. Mitchell. 13

our earlier derivative, with one additional penalty term

∂l(W )
= ∑ Xil (Y l − P̂(Y l = 1|X l ,W )) − λwi
∂wi l
which gives us the modified gradient descent rule
wi ← wi + η ∑ Xil (Y l − P̂(Y l = 1|X l ,W )) − ηλwi (26)
l
In cases where we have prior knowledge about likely values for specific wi , it
is possible to derive a similar penalty term by using a Normal prior on W with a
non-zero mean.

3.4 Logistic Regression for Functions with Many Discrete Val-

ues
Above we considered using Logistic Regression to learn P(Y |X) only for the case
where Y is a boolean variable. More generally, if Y can take on any of the discrete
values {y1 , . . . yK }, then the form of P(Y = yk |X) for Y = y1 ,Y = y2 , . . .Y = yK−1
is:
exp(wk0 + ∑ni=1 wki Xi )
P(Y = yk |X) = (27)
1 + ∑K−1 n
j=1 exp(w j0 + ∑i=1 w ji Xi )
When Y = yK , it is
1
P(Y = yK |X) = (28)
1 + ∑K−1 n
j=1 exp(w j0 + ∑i=1 w ji Xi )

Here w ji denotes the weight associated with the jth class Y = y j and with input
Xi . It is easy to see that our earlier expressions for the case where Y is boolean
(equations (16) and (17)) are a special case of the above expressions. Note also
that the form of the expression for P(Y = yK |X) assures that [∑K k=1 P(Y = yk |X)] =
1.
The primary difference between these expressions and those for boolean Y is
that when Y takes on K possible values, we construct K −1 different linear expres-
sions to capture the distributions for the different values of Y . The distribution for
the final, Kth, value of Y is simply one minus the probabilities of the first K − 1
values.
In this case, the gradient descent rule with regularization becomes:
w ji ← w ji + η ∑ Xil (δ(Y l = y j ) − P̂(Y l = y j |X l ,W )) − ηλw ji (29)
l

where δ(Y l = y j ) = 1 if the lth training value, Y l , is equal to y j , and δ(Y l = y j ) = 0

otherwise. Note our earlier learning rule, equation (26), is a special case of this
new learning rule, when K = 2. As in the case for K = 2, the quantity inside the
parentheses can be viewed as an error term which goes to zero if the estimated
conditional probability P̂(Y l = y j |X l ,W )) perfectly matches the observed value
of Y l .
Copyright
c 2015, Tom M. Mitchell. 14

4 Relationship Between Naive Bayes Classifiers and

Logistic Regression
To summarize, Logistic Regression directly estimates the parameters of P(Y |X),
whereas Naive Bayes directly estimates parameters for P(Y ) and P(X|Y ). We of-
ten call the former a discriminative classifier, and the latter a generative classifier.
We showed above that the assumptions of one variant of a Gaussian Naive
Bayes classifier imply the parametric form of P(Y |X) used in Logistic Regres-
sion. Furthermore, we showed that the parameters wi in Logistic Regression can
be expressed in terms of the Gaussian Naive Bayes parameters. In fact, if the GNB
assumptions hold, then asymptotically (as the number of training examples grows
toward infinity) the GNB and Logistic Regression converge toward identical clas-
sifiers.
The two algorithms also differ in interesting ways:

• When the GNB modeling assumptions do not hold, Logistic Regression and
GNB typically learn different classifier functions. In this case, the asymp-
totic (as the number of training examples approach infinity) classification
accuracy for Logistic Regression is often better than the asymptotic accu-
racy of GNB. Although Logistic Regression is consistent with the Naive
Bayes assumption that the input features Xi are conditionally independent
given Y , it is not rigidly tied to this assumption as is Naive Bayes. Given
data that disobeys this assumption, the conditional likelihood maximization
algorithm for Logistic Regression will adjust its parameters to maximize the
fit to (the conditional likelihood of) the data, even if the resulting parameters
are inconsistent with the Naive Bayes parameter estimates.
• GNB and Logistic Regression converge toward their asymptotic accuracies
at different rates. As Ng & Jordan (2002) show, GNB parameter estimates
converge toward their asymptotic values in order log n examples, where n
is the dimension of X. In contrast, Logistic Regression parameter estimates
converge more slowly, requiring order n examples. The authors also show
that in several data sets Logistic Regression outperforms GNB when many
training examples are available, but GNB outperforms Logistic Regression
when training data is scarce.

5 What You Should Know

The main points of this chapter include:

• We can use Bayes rule as the basis for designing learning algorithms (func-
tion approximators), as follows: Given that we wish to learn some target
function f : X → Y , or equivalently, P(Y |X), we use the training data to
learn estimates of P(X|Y ) and P(Y ). New X examples can then be classi-
fied using these estimated probability distributions, plus Bayes rule. This
Copyright
c 2015, Tom M. Mitchell. 15

type of classifier is called a generative classifier, because we can view the

distribution P(X|Y ) as describing how to generate random instances X con-
ditioned on the target attribute Y .

• Learning Bayes classifiers typically requires an unrealistic number of train-

ing examples (i.e., more than |X| training examples where X is the instance
space) unless some form of prior assumption is made about the form of
P(X|Y ). The Naive Bayes classifier assumes all attributes describing X
are conditionally independent given Y . This assumption dramatically re-
duces the number of parameters that must be estimated to learn the classi-
fier. Naive Bayes is a widely used learning algorithm, for both discrete and
continuous X.

• When X is a vector of discrete-valued attributes, Naive Bayes learning al-

gorithms can be viewed as linear classifiers; that is, every such Naive Bayes
classifier corresponds to a hyperplane decision surface in X. The same state-
ment holds for Gaussian Naive Bayes classifiers if the variance of each fea-
ture is assumed to be independent of the class (i.e., if σik = σi ).

• Logistic Regression is a function approximation algorithm that uses training

data to directly estimate P(Y |X), in contrast to Naive Bayes. In this sense,
Logistic Regression is often referred to as a discriminative classifier because
we can view the distribution P(Y |X) as directly discriminating the value of
the target value Y for any given instance X.

• Logistic Regression is a linear classifier over X. The linear classifiers pro-

duced by Logistic Regression and Gaussian Naive Bayes are identical in
the limit as the number of training examples approaches infinity, provided
the Naive Bayes assumptions hold. However, if these assumptions do not
hold, the Naive Bayes bias will cause it to perform less accurately than Lo-
gistic Regression, in the limit. Put another way, Naive Bayes is a learning
algorithm with greater bias, but lower variance, than Logistic Regression. If
this bias is appropriate given the actual data, Naive Bayes will be preferred.
Otherwise, Logistic Regression will be preferred.

• We can view function approximation learning algorithms as statistical esti-

mators of functions, or of conditional distributions P(Y |X). They estimate
P(Y |X) from a sample of training data. As with other statistical estima-
tors, it can be useful to characterize learning algorithms by their bias and
expected variance, taken over different samples of training data.

6 Further Reading
Wasserman (2004) describes a Reweighted Least Squares method for Logistic
Regression. Ng and Jordan (2002) provide a theoretical and experimental com-
parison of the Naive Bayes classifier and Logistic Regression.
Copyright
c 2015, Tom M. Mitchell. 16

EXERCISES
1. At the beginning of the chapter we remarked that “A hundred training ex-
amples will usually suffice to obtain an estimate of P(Y ) that is within a
few percent of the correct value.” Describe conditions under which the 95%
confidence interval for our estimate of P(Y ) will be ±0.02.
2. Consider learning a function X → Y where Y is boolean, where X = hX1 , X2 i,
and where X1 is a boolean variable and X2 a continuous variable. State the
parameters that must be estimated to define a Naive Bayes classifier in this
case. Give the formula for computing P(Y |X), in terms of these parameters
and the feature values X1 and X2 .
3. In section 3 we showed that when Y is Boolean and X = hX1 . . . Xn i is a
vector of continuous variables, then the assumptions of the Gaussian Naive
Bayes classifier imply that P(Y |X) is given by the logistic function with
appropriate parameters W . In particular:
1
P(Y = 1|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
and
exp(w0 + ∑ni=1 wi Xi )
P(Y = 0|X) =
1 + exp(w0 + ∑ni=1 wi Xi )
Consider instead the case where Y is Boolean and X = hX1 . . . Xn i is a vec-
tor of Boolean variables. Prove for this case also that P(Y |X) follows this
same form (and hence that Logistic Regression is also the discriminative
counterpart to a Naive Bayes generative classifier over Boolean features).
Hints:
• Simple notation will help. Since the Xi are Boolean variables, you
need only one parameter to define P(Xi |Y = yk ). Define θi1 ≡ P(Xi =
1|Y = 1), in which case P(Xi = 0|Y = 1) = (1 − θi1 ). Similarly, use
θi0 to denote P(Xi = 1|Y = 0).
• Notice with the above notation you can represent P(Xi |Y = 1) as fol-
lows
P(Xi |Y = 1) = θXi1i (1 − θi1 )(1−Xi )
Note when Xi = 1 the second term is equal to 1 because its exponent
is zero. Similarly, when Xi = 0 the first term is equal to 1 because its
exponent is zero.
4. (based on a suggestion from Sandra Zilles). This question asks you to con-
sider the relationship between the MAP hypothesis and the Bayes optimal
hypothesis. Consider a hypothesis space H defined over the set of instances
X, and containing just two hypotheses, h1 and h2 with equal prior probabil-
ities P(h1) = P(h2) = 0.5. Suppose we are given an arbitrary set of training
Copyright
c 2015, Tom M. Mitchell. 17

data D which we use to calculate the posterior probabilities P(h1|D) and

P(h2|D). Based on this we choose the MAP hypothesis, and calculate the
Bayes optimal hypothesis. Suppose we find that the Bayes optimal classi-
fier is not equal to either h1 or to h2, which is generally the case because
the Bayes optimal hypothesis corresponds to “averaging over” all hypothe-
ses in H. Now we create a new hypothesis h3 which is equal to the Bayes
optimal classifier with respect to H, X and D; that is, h3 classifies each in-
stance in X exactly the same as the Bayes optimal classifier for H and D.
We now create a new hypothesis space H 0 = {h1, h2, h3}. If we train using
the same training data, D, will the MAP hypothesis from H 0 be h3? Will the
Bayes optimal classifier with respect to H 0 be equivalent to h3? (Hint: the
answer depends on the priors we assign to the hypotheses in H 0 . Can you
give constraints on these priors that assure the answers will be yes or no?)

7 Acknowledgements
I very much appreciate receiving helpful comments on earlier drafts of this chapter
from the following: Nathaniel Fairfield, Rainer Gemulla, Vineet Kumar, Andrew
McCallum, Anand Prahlad, Wei Wang, Geoff Webb, and Sandra Zilles.

REFERENCES
Mitchell, T (1997). Machine Learning, McGraw Hill.
Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A compar-
ison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng,
A.Y., and Jordan, M. (2002).
Wasserman, L. (2004). All of Statistics, Springer-Verlag.

CJAS 2012 Charbonnier&Roussel - Adaptive Performance Scale
No ratings yet
CJAS 2012 Charbonnier&Roussel - Adaptive Performance Scale
15 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
Wk08
No ratings yet
Wk08
10 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Lecture - 4.1 - Bayes Classifier
No ratings yet
Lecture - 4.1 - Bayes Classifier
31 pages
Naive Bayes Classifier in Machine Learning
No ratings yet
Naive Bayes Classifier in Machine Learning
16 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
Lecture 7
No ratings yet
Lecture 7
15 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
11 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
Bayes Classifier
No ratings yet
Bayes Classifier
20 pages
An Introduction to Naive Bayes Algorithm for Beginners
No ratings yet
An Introduction to Naive Bayes Algorithm for Beginners
11 pages
The Naive Bayes Model, Maximum-Likelihood Estimation, and The EM Algorithm
No ratings yet
The Naive Bayes Model, Maximum-Likelihood Estimation, and The EM Algorithm
21 pages
Naive Bayesian Classifier: National Institute of Technology Sikkim
No ratings yet
Naive Bayesian Classifier: National Institute of Technology Sikkim
6 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Practical-3 Ritesh
No ratings yet
Practical-3 Ritesh
5 pages
07_Naive_Bayes
No ratings yet
07_Naive_Bayes
6 pages
Naive Bayes Algorithm
No ratings yet
Naive Bayes Algorithm
46 pages
Naive Bayes
No ratings yet
Naive Bayes
41 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Naive Bayes Classifier in Machine Learning - Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning - Javatpoint
19 pages
Naive Bayes
No ratings yet
Naive Bayes
9 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
lecture3-linear-classifiers
No ratings yet
lecture3-linear-classifiers
36 pages
Pgm5 With Output
No ratings yet
Pgm5 With Output
13 pages
NOTES
No ratings yet
NOTES
15 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
RE
No ratings yet
RE
22 pages
AI NOTES unit 2
No ratings yet
AI NOTES unit 2
9 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
100% (3)
Bayesian Classifier and ML Estimation: 6.1 Conditional Probability
11 pages
Bayesian Classifier Notes
No ratings yet
Bayesian Classifier Notes
9 pages
NBayes-1-20-2011-ann
No ratings yet
NBayes-1-20-2011-ann
21 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
ML Naive Bayes 1
No ratings yet
ML Naive Bayes 1
19 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
651276118-Naive-Bayes-Classifier-in-Machine-Learning-Javatpoint
No ratings yet
651276118-Naive-Bayes-Classifier-in-Machine-Learning-Javatpoint
23 pages
DM NaiveBayes
No ratings yet
DM NaiveBayes
15 pages
Lecture No. 03
No ratings yet
Lecture No. 03
23 pages
Naïve Bayes Classifier: Adopted From Slides by Ke Chen From University of Manchester and Yangqiu Song From Msra
No ratings yet
Naïve Bayes Classifier: Adopted From Slides by Ke Chen From University of Manchester and Yangqiu Song From Msra
25 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
EC994 Naive Bayes
No ratings yet
EC994 Naive Bayes
15 pages
Na Ive Bayes Classifier
No ratings yet
Na Ive Bayes Classifier
3 pages
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
No ratings yet
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
51 pages
Naïve Bayesv1
No ratings yet
Naïve Bayesv1
31 pages
module_3_Last Part
No ratings yet
module_3_Last Part
16 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Naive Bayes Classifier: Coin Toss and Fair Dice Example
No ratings yet
Naive Bayes Classifier: Coin Toss and Fair Dice Example
16 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Unit 2 AAM
No ratings yet
Unit 2 AAM
32 pages
CSL0777 L24
No ratings yet
CSL0777 L24
38 pages
29-Naive Bayes-03-10-2024
No ratings yet
29-Naive Bayes-03-10-2024
48 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
16 pages
Naive-Bayes
No ratings yet
Naive-Bayes
25 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
7415ijcsit06 PDF
No ratings yet
7415ijcsit06 PDF
14 pages
Adapt: Real-Time Adaptive Pedestrian Tracking For Crowded Scenes
No ratings yet
Adapt: Real-Time Adaptive Pedestrian Tracking For Crowded Scenes
8 pages
Lane Detection Techniques: A Review: Gurveen Kaur Dinesh Kumar
No ratings yet
Lane Detection Techniques: A Review: Gurveen Kaur Dinesh Kumar
5 pages
Computer Assignment As, DJH Kajshdkjahsdkjhaskdjhakjdhkashdkajhsdkjhkdahkjs
No ratings yet
Computer Assignment As, DJH Kajshdkjahsdkjhaskdjhakjdhkashdkajhsdkjhkdahkjs
3 pages
P - Theoretical P - PT P - Piezometer Average % Diff P - TH and Average 16 19 28 24 11 19
No ratings yet
P - Theoretical P - PT P - Piezometer Average % Diff P - TH and Average 16 19 28 24 11 19
2 pages
Transactions Template and Instructions On How To Create Your Article
No ratings yet
Transactions Template and Instructions On How To Create Your Article
18 pages
U04L1 - Handout
No ratings yet
U04L1 - Handout
5 pages
U03 - Additional Practise Questions
No ratings yet
U03 - Additional Practise Questions
8 pages
As 4433.1-1997 Guide To The Sampling of Particulate Materials Sampling Procedures
No ratings yet
As 4433.1-1997 Guide To The Sampling of Particulate Materials Sampling Procedures
7 pages
Statistics for the Behavioral Sciences 9th Edition Frederick J Gravetter download
100% (1)
Statistics for the Behavioral Sciences 9th Edition Frederick J Gravetter download
42 pages
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Editioninstant download
100% (2)
(eBook PDF) Business Statistics: For Contemporary Decision Making, 8th Editioninstant download
50 pages
Full download Principles of Econometrics, 5th Ed. R. Carter Hill pdf docx
No ratings yet
Full download Principles of Econometrics, 5th Ed. R. Carter Hill pdf docx
41 pages
Thank You For Taking The Week 3: Assignment 3. Week 3: Assignment 3
No ratings yet
Thank You For Taking The Week 3: Assignment 3. Week 3: Assignment 3
3 pages
BMS Syllabus
No ratings yet
BMS Syllabus
14 pages
Syllabus For Subordinate Accounts/Audit Service (SAS) /revenue Audit and Incentive Examinations
No ratings yet
Syllabus For Subordinate Accounts/Audit Service (SAS) /revenue Audit and Incentive Examinations
16 pages
Genstat Release 12.1 (Pc/Windows Vista) 22 June 2019 12:10:56
No ratings yet
Genstat Release 12.1 (Pc/Windows Vista) 22 June 2019 12:10:56
23 pages
Marketing Complex Financial Products in Emerging Markets Evidence From Rainfall Insurance in India-JMR
No ratings yet
Marketing Complex Financial Products in Emerging Markets Evidence From Rainfall Insurance in India-JMR
14 pages
1st QT Stat
No ratings yet
1st QT Stat
5 pages
An Investasi
No ratings yet
An Investasi
46 pages
MATH 231-Statistics-Hira Nadeem PDF
No ratings yet
MATH 231-Statistics-Hira Nadeem PDF
3 pages
Problem Set On Desc Stats Regression - PGDBA
No ratings yet
Problem Set On Desc Stats Regression - PGDBA
65 pages
Estimation Theory Overview
100% (1)
Estimation Theory Overview
17 pages
Chap 2 Applied Statistic (p1)
No ratings yet
Chap 2 Applied Statistic (p1)
24 pages
Long-Term Actuarial Mathematics Solutions To Sample Written Answer Questions
No ratings yet
Long-Term Actuarial Mathematics Solutions To Sample Written Answer Questions
61 pages
Download full Robust Correlation Theory and Applications 1st Edition Georgy L. Shevlyakov ebook all chapters
100% (14)
Download full Robust Correlation Theory and Applications 1st Edition Georgy L. Shevlyakov ebook all chapters
60 pages
Measures of Dispersion: in This Chapter..
No ratings yet
Measures of Dispersion: in This Chapter..
18 pages
Cep2 Content Module 13
No ratings yet
Cep2 Content Module 13
23 pages
Financial Management Principles and Applications 8th Edition Titman Solutions Manualpdf download
100% (3)
Financial Management Principles and Applications 8th Edition Titman Solutions Manualpdf download
39 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Business Statistics 2nd Edition J. K. Sharma pdf download
100% (3)
Business Statistics 2nd Edition J. K. Sharma pdf download
54 pages
Csir-Net Mathematics Dec 2015
No ratings yet
Csir-Net Mathematics Dec 2015
20 pages
Fstat Help
100% (1)
Fstat Help
12 pages
Math 11 SP LAS 7 02 18 2021
No ratings yet
Math 11 SP LAS 7 02 18 2021
12 pages
MBA503A Statistical Techniques v2.1
No ratings yet
MBA503A Statistical Techniques v2.1
5 pages
Chen10011 Notes
No ratings yet
Chen10011 Notes
58 pages
Variable Factory Overhead Variances
No ratings yet
Variable Factory Overhead Variances
2 pages
Chapter 6
No ratings yet
Chapter 6
16 pages

Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression

Uploaded by

Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression

Uploaded by

CHAPTER 3

GENERATIVE AND DISCRIMINATIVE

*PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION*

1 Learning Classifiers based on Bayes Rule

A NOTE ON NOTATION: We will consistently use upper case symbols (e.g.,

1.1 Unbiased Learning of Bayes Classifiers is Impractical

2 Naive Bayes Algorithm

2.1 Conditional Independence

(∀i, j, k)P(X = xi |Y = y j , Z = zk ) = P(X = xi |Z = zk )

As an example, consider three boolean random variables to describe the current

2.2 Derivation of Naive Bayes Algorithm

2.3 Naive Bayes for Discrete-Valued Inputs

where |D| denotes the number of elements in the training set D.

2.4 Naive Bayes for Continuous Inputs

µik = E[Xi |Y = yk ] (10)

This maximum likelihood estimator is biased, so the minimum variance unbi-

substituting from equations (16) and (17), this becomes

and assigns Y = 1 otherwise.

3.1 Form of P(Y |X) for Gaussian Naive Bayes Classifier

• Y is boolean, governed by a Bernoulli distribution, with parameter π =

• X = hX1 . . . Xn i, where each Xi is a continuous random variable

• For each Xi , P(Xi |Y = yk ) is a Gaussian distribution of the form N(µik , σi )

• For all i and j 6= i, Xi and X j are conditionally independent given Y

Because of our conditional independence assumption we can write this

3.2 Estimating Parameters for Logistic Regression

W ← arg max ∏ P(Y l |X l ,W )

where W = hw0 , w1 . . . wn i is the vector of parameters to be estimated, Y l denotes

W ← arg max ∑ ln P(Y l |X l ,W )

l(W ) = ∑ Y l ln P(Y l = 1|X l ,W ) + (1 −Y l ) ln P(Y l = 0|X l ,W )

wi ← wi + η ∑ Xil (Y l − P̂(Y l = 1|X l ,W ))

3.3 Regularization in Logistic Regression

which adds a penalty proportional to the squared magnitude of W . Here λ is a

our earlier derivative, with one additional penalty term

3.4 Logistic Regression for Functions with Many Discrete Val-

where δ(Y l = y j ) = 1 if the lth training value, Y l , is equal to y j , and δ(Y l = y j ) = 0

4 Relationship Between Naive Bayes Classifiers and

5 What You Should Know

type of classifier is called a generative classifier, because we can view the

• Learning Bayes classifiers typically requires an unrealistic number of train-

• When X is a vector of discrete-valued attributes, Naive Bayes learning al-

• Logistic Regression is a function approximation algorithm that uses training

• Logistic Regression is a linear classifier over X. The linear classifiers pro-

• We can view function approximation learning algorithms as statistical esti-

data D which we use to calculate the posterior probabilities P(h1|D) and

You might also like

PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION