0% found this document useful (0 votes)
3 views

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Machine learning

Uploaded by

vyshukodumuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Machine learning

Uploaded by

vyshukodumuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Topic-1

INTRODUCTION
Notations:
• P(h) - denote the initial probability that hypothesis
h holds, before we have observed the training data.
Topic-2 • P(h) is often called the prior probability of h and
may reflect any background knowledge that h is a
correct hypothesis.
BAYES THEOREM • If we have no such prior knowledge, then we might
simply assign the same prior probability to each
candidate hypothesis.

• Determining the best hypothesis from some space Notations:


H, given the observed training data D. • P(D) to denote the prior probability that training
• we demand the most probable hypothesis, given data D will be observed (i.e., the probability of D
the data D plus any initial knowledge about the prior given no knowledge about which hypothesis holds).
probabilities of the various hypotheses in H. • P(D|h) to denote the probability of observing data D
• Bayes theorem provides a direct method for given some world in which hypothesis h holds.
calculating such probabilities. • P(xly) to denote the probability of x given y.
• Bayes theorem provides a way to calculate the • P (h|D) - h holds given the observed training data D.
probability of a hypothesis based on its prior
probability, the probabilities of observing various • P (h|D) is called the posterior probability of h,
data given the hypothesis, and the observed data because it reflects our confidence that h holds after
itself. we have seen the training data D.
Topic-3

BAYES THEOREM AND Assume the sequence of instances <x1…xm> is fixed, so the training data D
CONCEPT LEARNING can be written as the sequence of target values D=<d1…dm>
May prove impractical for large hypothesis space
Example:
Concept learning algorithm FIND-S searches the hypothesis space
H from specific to general hypotheses, outputting a maximally
specific consistent hypothesis (i.e., a maximally specific member
of the version space) under the probability distributions P(h) and
P(D|h).
we consider the problem of learning a continuous-valued target
function-a problem faced by many learning approaches such as
So far, we have considered the case where P(D|h) takes neural network learning, linear regression, and polynomial curve
on values of only 0 and 1, assumption of noise-free fitting.
training data.

We can also model learning from noisy training data, by


allowing P(D|h) take on values other than 0 and 1, and
by introducing into P(D|h) additional assumptions about
the probability distributions that govern the noise.

Consider the following problem setting.


Topic-4 • Learner L considers an instance space X and a hypothesis
space H consisting of some class of real-valued functions
defined over X (i.e., each h in H is a function of the form
MAXIMUM LIKELIHOOD h : X  R, where R represents the set of real numbers).
• The problem faced by L is to learn an unknown target
AND function f : X  R drawn from H.

LEAST-SQUARED ERROR • A set of m training examples is provided, where the


target value of each example is corrupted by random
noise drawn according to a Normal probability
HYPOTHESES distribution.
We review two basic concepts from probability theory:
• Each training example is a pair of the form (xi, di) Probability densities and Normal distributions.
where di = f (xi) + ei.
f (xi) is the noise-free value of the target function Probability densities
ei is a random variable representing the noise. First, to discuss probabilities over continuous variables such
• Assumed that the values of the ei are drawn as e, we must introduce probability densities.
independently and that they are distributed according to
a Normal distribution with zero mean. Reason: we wish the total probability over all possible
values of the random variable to sum to one.
• The task of the learner is to output a hML, or,
equivalently, a MAP hypothesis assuming all hypotheses In case of continuous variables we cannot achieve this.
are equally probable a priori.
Solution: So we use a probability density for continuous
We consider simple example problem variables such as e and require that the integral of this
i.e, learning a linear function
probability density over all possible values be one.

Probability density function:

p (lower case) – refer to the probability density function


P (upper case) – defines finite probability P or probability mass

The dashed line corresponds to the hypothesis hML with least-


squared training error, hence the maximum likelihood hypothesis.
Why a hypothesis that minimizes the sum of squared errors
in this setting is also a maximum likelihood hypothesis?
Normal distributions
Second, the random noise variable e is generated by a Use Normal distribution to
Normal probability distribution describe p(di|h)

A Normal distribution is a smooth, bell-shaped distribution


that can be completely characterized by its mean (µ) and its
standard deviation (σ).

Issue: least-squared error hypothesis – We will show this by


deriving the maximum likelihood hypothesis using p (lower
case) to refer to the probability density.
• hML might not be the MAP hypothesis, but if one assumes • So far we have determined that the hML is the one that
uniform prior probabilities over the hypotheses then it is. minimizes the sum of squared errors over the training
• Why is it reasonable to choose the Normal distribution to examples.
characterize noise?
• In this topic we derive a setting - learning to predict
• it allows for a mathematically straightforward analysis. probabilities that is common in neural network learning.
• the smooth, bell-shaped distribution is a good
approximation to many types of noise in physical
systems.
• Minimizing the sum of squared errors is a common
approach in many neural network, curve fitting, and other
approaches to approximating real-valued functions.
Limitations - relationship between hML and least-squared
error hypothesis
• The above analysis considers noise only in the target value
of the training example and does not consider noise in the
attributes describing the instances themselves.

Topic-5

MAXIMUM LIKELIHOOD How can we learn f' using a neural network?


HYPOTHESES FOR 1. Use Brute-force to first collect the observed frequencies of
1's and 0's for each possible value of x and to then train
PREDICTING the neural network to output the target frequency for
each x.
PROBABILITIES 2. Instead train a neural network directly from the observed
training examples of f, derive a hML for f '.
Apply hML, Least-squared error analysis
The two weight update rules converge toward hML in two
different settings.
• The rule that minimizes sum of squared error seeks the
hML under the assumption that the training data can be
modeled by Normally distributed noise added to the
target function value.
• The rule that minimizes cross entropy seeks the hML
under the assumption that the observed boolean value
is a probabilistic function of the input instance.

Topic-6

MINIMUM DESCRIPTION
LENGTH PRINCIPLE
Topic-7

BAYES OPTIMAL CLASSIFIER


For example: Learning boolean concepts using VS,
the Bayes optimal classification of a new instance is
obtained by taking a weighted vote among all members of
the VS, with each candidate hypothesis weighted by its
posterior probability.

Topic-8

GIBBS ALGORITHM
Topic-9

NAIVE BAYES CLASSIFIER

Given a new instance to classify,


Gibbs algorithm applies a hypothesis drawn at random
according to the current posterior probability distribution.
Surprisingly, under certain conditions the expected
misclassification error for the Gibbs algorithm is at most
twice the expected error of the Bayes optimal classifier
Under this condition,
So far we have estimated probabilities by the
fraction of times the event is observed to occur
over the total number of opportunities.
Example:

This observed fraction provides a good estimate of the probability


in many cases, it provides poor estimates when nc, is very small
This raises two difficulties

(produces a biased underestimate of the probability)


the Bayes classifier
Reason: The quantity calculated in Equation (3) requires multiplying all
the other probability terms by this zero value.

In the absence of other information we assume uniform priors as p


• Consider an instance space X consisting of all possible text
documents (i.e., all possible strings of words and punctuation of all
possible lengths).

Topic-10 • We are given training examples of some unknown target function


f (x), which can take on any value from some finite set V.
• The task is to learn from these training examples to predict the
target value for subsequent text documents
AN EXAMPLE: LEARNING TO • We consider the target function classifying documents as
interesting or uninteresting to a particular person, using the target
CLASSIFY TEXT values like and dislike to indicate these two classes
Two main design issues
1. how to represent an arbitrary text document in terms of
attribute values
2. how to estimate the probabilities required by the naive
Bayes classifier

• We consider practical problem of learning to classify natural


• Given a text document, i.e., paragraph,
language text documents (Or) practical importance of Bayesian
learning method. • We define an attribute for each word position in the document
and define the value of that attribute to be the English word
• Consider learning problems in which the instances are text found in that position.
documents.
• Thus, the current paragraph would be described by 111 attribute
• For example, we learn the target concept values, corresponding to the 111 word positions.
"electronic news articles that I find interesting," or
"pages on the World Wide Web that discuss ML topics." • The value of the first attribute is the word "our," the value of the
second attribute is the word "approach," and so on.
• In both cases, if a computer could learn the target concept
accurately, it could automatically filter the large volume of • Now apply the naive Bayes classifier
online text documents to present only the most relevant • Assume given a set of 700 training documents that a friend has
documents to the user. classified as dislike and another 300 she has classified as like.
• We use a general algorithm for learning to classify text, based Q. We are now given a new document and asked to classify it?
on the naive Bayes classifier.
Naive Bayes classification VNB is the classification that
maximizes the probability of observing the words that
were actually found in the document, subject to the usual
naive Bayes independence assumption.
Independence assumption Unfortunately, there are 50,000 distinct words in the English
states that the word probabilities for one text position are vocabulary, 2 possible target values, and 111 text positions in the
current example, so we must estimate 2. 111 . 50,000 10 million
independent of the words that occur in other positions, such terms from the training data.
given the document classification vj.
This assumption is clearly incorrect.

Fortunately, make an additional reasonable assumption that


Example
reduces the number of probabilities that must be estimated.
Probability of observing the word "learning" in some
position may be greater if the preceding word is Assume the probability of encountering a specific word wk (e.g.,
"machine.“ "chocolate") is independent of the specific word position being
considered (e.g., a23 versus a95).
This inaccuracy of this independence assumption, i.e
choice (make it/without it), the number of probability
terms computed is prohibitive.
In practice naive Bayes learner performs remarkably well
in many text classification problems despite the
incorrectness of this independence assumption.
• Minor variant of this algorithm was applied to the
problem of classifying usenet news articles.
• The target classification  find name of the usenet
newsgroup in which the article appeared.
• 20 electronic newsgroups were considered.
• 1,000 articles were collected from each newsgroup,
Algorithm uses naive Bayes classifier together with assumption that forming a data set of 20,000 documents.
the probability of word occurrence is independent of position within • The naive Bayes algorithm was then applied with one
the text exception: Only a subset of the words occurring in the
documents were included as the value of the Vocabulary
variable in the algorithm.

• 100 most frequent words were removed (words such as


"the" and "of '), and any word occurring fewer than 3
times was also removed. The resulting vocabulary
contained approximately 38,500 words.
• Another variant of the naive Bayes algorithm and its • A BBN describes the probability distribution governing a set of
application to learning the target concept is
NEWSWEEDER system-a program for reading netnews variables by specifying a set of conditional independence
that allows the user to rate articles as he/she reads assumptions along with a set of conditional probabilities.
them.
• NEWSWEEDER then uses these rated articles as training
examples to learn to predict which subsequent articles
will be of interest to the user, so that it can bring these
to the user's attention.
• NEWSWEEDER used its learned profile of user interests
to suggest the most highly rated new articles each day

Topic-11

BAYESIAN BELIEF NETWORKS • The probability distribution over this joint space is called the

(BBN) joint probability distribution.


• Let us understand the idea of conditional independence
• Let X, Y, and Z be three discrete-valued random variables.
• We might wish to use a Bayesian network to infer the value of
some target variable (e.g., ForestFire) given the observed
values of the other variables.
• What we really wish to infer is the probability distribution for
the target variable, which specifies the probability that it will
take on each of its possible values given the observed values
of the other variables.
• This inference step can be straightforward if values for all of
the other variables in the network are known exactly.
• In the more general case we may wish to infer the probability
distribution for some variable (e.g., ForestFire) given observed
values for only a subset of the other variables (e.g., Thunder
and BusTourGroup may be the only observed values available).
• In general, a Bayesian network can be used to compute the
probability distribution for any subset of network variables
given the values or distributions for any subset of the
remaining variables.

Q. Can we devise effective algorithms for learning Bayesian belief


networks from training data?
• This problem is similar to learning the weights for the
hidden units in an ANN, where the I/O node values are
given but the hidden unit values are left unspecified by
the training examples
Solution
• Gradient ascent procedure that learns the entries in
the conditional probability tables.
• Gradient ascent procedure searches through a space of
hypotheses that corresponds to the set of all possible
entries for the conditional probability tables.
• i.e., P(D|h) of the observed training data D given the
hypothesis h
• This corresponds to searching for the hML for the table
entries.

The gradient ascent rule given by Russell  maximizes P(D|h) by


following the gradient of In P(D|h) w.r.t parameters that define the • Learning Bayesian networks when the network structure is not
conditional probability tables of the Bayesian network known in advance is also difficult.
• Cooper and Herskovits present a Bayesian scoring metric for
choosing among alternative networks.
• They also present a heuristic search algorithm called K2 for
learning network structure when the data is fully observable.
• For learning the structure of Bayesian networks, K2 performs a
greedy search
• Example: Finding potential anesthesia problems in a hospital
operating room.
• Also constraint-based approaches to learning Bayesian network
structure have also been developed.
Topic-12

THE EM ALGORITHM

Expectation/Estimation–
Maximization

• Basis for the widely used Baum-Welch forward-backward algorithm for learning
Partially Observable Markov Models
• EM algorithm  described by Dempster (1977)
Problem involves a mixture of k different Normal distributions,
and we cannot observe which instances were generated by which
distribution. i.e., involving hidden variables
Example

You might also like