0% found this document useful (0 votes)

3 views

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Machine learning

Uploaded by

vyshukodumuri

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Machine learning

Uploaded by

vyshukodumuri

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Topic-1

INTRODUCTION
Notations:
• P(h) - denote the initial probability that hypothesis
h holds, before we have observed the training data.
Topic-2 • P(h) is often called the prior probability of h and
may reflect any background knowledge that h is a
correct hypothesis.
BAYES THEOREM • If we have no such prior knowledge, then we might
simply assign the same prior probability to each
candidate hypothesis.

• Determining the best hypothesis from some space Notations:

H, given the observed training data D. • P(D) to denote the prior probability that training
• we demand the most probable hypothesis, given data D will be observed (i.e., the probability of D
the data D plus any initial knowledge about the prior given no knowledge about which hypothesis holds).
probabilities of the various hypotheses in H. • P(D|h) to denote the probability of observing data D
• Bayes theorem provides a direct method for given some world in which hypothesis h holds.
calculating such probabilities. • P(xly) to denote the probability of x given y.
• Bayes theorem provides a way to calculate the • P (h|D) - h holds given the observed training data D.
probability of a hypothesis based on its prior
probability, the probabilities of observing various • P (h|D) is called the posterior probability of h,
data given the hypothesis, and the observed data because it reflects our confidence that h holds after
itself. we have seen the training data D.
Topic-3

BAYES THEOREM AND Assume the sequence of instances <x1…xm> is fixed, so the training data D
CONCEPT LEARNING can be written as the sequence of target values D=<d1…dm>
May prove impractical for large hypothesis space
Example:
Concept learning algorithm FIND-S searches the hypothesis space
H from specific to general hypotheses, outputting a maximally
specific consistent hypothesis (i.e., a maximally specific member
of the version space) under the probability distributions P(h) and
P(D|h).
we consider the problem of learning a continuous-valued target
function-a problem faced by many learning approaches such as
So far, we have considered the case where P(D|h) takes neural network learning, linear regression, and polynomial curve
on values of only 0 and 1, assumption of noise-free fitting.
training data.

We can also model learning from noisy training data, by

allowing P(D|h) take on values other than 0 and 1, and
by introducing into P(D|h) additional assumptions about
the probability distributions that govern the noise.

Consider the following problem setting.

Topic-4 • Learner L considers an instance space X and a hypothesis
space H consisting of some class of real-valued functions
defined over X (i.e., each h in H is a function of the form
MAXIMUM LIKELIHOOD h : X  R, where R represents the set of real numbers).
• The problem faced by L is to learn an unknown target
AND function f : X  R drawn from H.

LEAST-SQUARED ERROR • A set of m training examples is provided, where the

target value of each example is corrupted by random
noise drawn according to a Normal probability
HYPOTHESES distribution.
We review two basic concepts from probability theory:
• Each training example is a pair of the form (xi, di) Probability densities and Normal distributions.
where di = f (xi) + ei.
f (xi) is the noise-free value of the target function Probability densities
ei is a random variable representing the noise. First, to discuss probabilities over continuous variables such
• Assumed that the values of the ei are drawn as e, we must introduce probability densities.
independently and that they are distributed according to
a Normal distribution with zero mean. Reason: we wish the total probability over all possible
values of the random variable to sum to one.
• The task of the learner is to output a hML, or,
equivalently, a MAP hypothesis assuming all hypotheses In case of continuous variables we cannot achieve this.
are equally probable a priori.
Solution: So we use a probability density for continuous
We consider simple example problem variables such as e and require that the integral of this
i.e, learning a linear function
probability density over all possible values be one.

Probability density function:

p (lower case) – refer to the probability density function

P (upper case) – defines finite probability P or probability mass

The dashed line corresponds to the hypothesis hML with least-

squared training error, hence the maximum likelihood hypothesis.
Why a hypothesis that minimizes the sum of squared errors
in this setting is also a maximum likelihood hypothesis?
Normal distributions
Second, the random noise variable e is generated by a Use Normal distribution to
Normal probability distribution describe p(di|h)

A Normal distribution is a smooth, bell-shaped distribution

that can be completely characterized by its mean (µ) and its
standard deviation (σ).

Issue: least-squared error hypothesis – We will show this by

deriving the maximum likelihood hypothesis using p (lower
case) to refer to the probability density.
• hML might not be the MAP hypothesis, but if one assumes • So far we have determined that the hML is the one that
uniform prior probabilities over the hypotheses then it is. minimizes the sum of squared errors over the training
• Why is it reasonable to choose the Normal distribution to examples.
characterize noise?
• In this topic we derive a setting - learning to predict
• it allows for a mathematically straightforward analysis. probabilities that is common in neural network learning.
• the smooth, bell-shaped distribution is a good
approximation to many types of noise in physical
systems.
• Minimizing the sum of squared errors is a common
approach in many neural network, curve fitting, and other
approaches to approximating real-valued functions.
Limitations - relationship between hML and least-squared
error hypothesis
• The above analysis considers noise only in the target value
of the training example and does not consider noise in the
attributes describing the instances themselves.

Topic-5

MAXIMUM LIKELIHOOD How can we learn f' using a neural network?

HYPOTHESES FOR 1. Use Brute-force to first collect the observed frequencies of
1's and 0's for each possible value of x and to then train
PREDICTING the neural network to output the target frequency for
each x.
PROBABILITIES 2. Instead train a neural network directly from the observed
training examples of f, derive a hML for f '.
Apply hML, Least-squared error analysis
The two weight update rules converge toward hML in two
different settings.
• The rule that minimizes sum of squared error seeks the
hML under the assumption that the training data can be
modeled by Normally distributed noise added to the
target function value.
• The rule that minimizes cross entropy seeks the hML
under the assumption that the observed boolean value
is a probabilistic function of the input instance.

Topic-6

MINIMUM DESCRIPTION
LENGTH PRINCIPLE
Topic-7

BAYES OPTIMAL CLASSIFIER

For example: Learning boolean concepts using VS,
the Bayes optimal classification of a new instance is
obtained by taking a weighted vote among all members of
the VS, with each candidate hypothesis weighted by its
posterior probability.

Topic-8

GIBBS ALGORITHM
Topic-9

NAIVE BAYES CLASSIFIER

Given a new instance to classify,

Gibbs algorithm applies a hypothesis drawn at random
according to the current posterior probability distribution.
Surprisingly, under certain conditions the expected
misclassification error for the Gibbs algorithm is at most
twice the expected error of the Bayes optimal classifier
Under this condition,
So far we have estimated probabilities by the
fraction of times the event is observed to occur
over the total number of opportunities.
Example:

This observed fraction provides a good estimate of the probability

in many cases, it provides poor estimates when nc, is very small
This raises two difficulties

(produces a biased underestimate of the probability)

the Bayes classifier
Reason: The quantity calculated in Equation (3) requires multiplying all
the other probability terms by this zero value.

In the absence of other information we assume uniform priors as p

• Consider an instance space X consisting of all possible text
documents (i.e., all possible strings of words and punctuation of all
possible lengths).

Topic-10 • We are given training examples of some unknown target function

f (x), which can take on any value from some finite set V.
• The task is to learn from these training examples to predict the
target value for subsequent text documents
AN EXAMPLE: LEARNING TO • We consider the target function classifying documents as
interesting or uninteresting to a particular person, using the target
CLASSIFY TEXT values like and dislike to indicate these two classes
Two main design issues
1. how to represent an arbitrary text document in terms of
attribute values
2. how to estimate the probabilities required by the naive
Bayes classifier

• We consider practical problem of learning to classify natural

• Given a text document, i.e., paragraph,
language text documents (Or) practical importance of Bayesian
learning method. • We define an attribute for each word position in the document
and define the value of that attribute to be the English word
• Consider learning problems in which the instances are text found in that position.
documents.
• Thus, the current paragraph would be described by 111 attribute
• For example, we learn the target concept values, corresponding to the 111 word positions.
"electronic news articles that I find interesting," or
"pages on the World Wide Web that discuss ML topics." • The value of the first attribute is the word "our," the value of the
second attribute is the word "approach," and so on.
• In both cases, if a computer could learn the target concept
accurately, it could automatically filter the large volume of • Now apply the naive Bayes classifier
online text documents to present only the most relevant • Assume given a set of 700 training documents that a friend has
documents to the user. classified as dislike and another 300 she has classified as like.
• We use a general algorithm for learning to classify text, based Q. We are now given a new document and asked to classify it?
on the naive Bayes classifier.
Naive Bayes classification VNB is the classification that
maximizes the probability of observing the words that
were actually found in the document, subject to the usual
naive Bayes independence assumption.
Independence assumption Unfortunately, there are 50,000 distinct words in the English
states that the word probabilities for one text position are vocabulary, 2 possible target values, and 111 text positions in the
current example, so we must estimate 2. 111 . 50,000 10 million
independent of the words that occur in other positions, such terms from the training data.
given the document classification vj.
This assumption is clearly incorrect.

Fortunately, make an additional reasonable assumption that

Example
reduces the number of probabilities that must be estimated.
Probability of observing the word "learning" in some
position may be greater if the preceding word is Assume the probability of encountering a specific word wk (e.g.,
"machine.“ "chocolate") is independent of the specific word position being
considered (e.g., a23 versus a95).
This inaccuracy of this independence assumption, i.e
choice (make it/without it), the number of probability
terms computed is prohibitive.
In practice naive Bayes learner performs remarkably well
in many text classification problems despite the
incorrectness of this independence assumption.
• Minor variant of this algorithm was applied to the
problem of classifying usenet news articles.
• The target classification  find name of the usenet
newsgroup in which the article appeared.
• 20 electronic newsgroups were considered.
• 1,000 articles were collected from each newsgroup,
Algorithm uses naive Bayes classifier together with assumption that forming a data set of 20,000 documents.
the probability of word occurrence is independent of position within • The naive Bayes algorithm was then applied with one
the text exception: Only a subset of the words occurring in the
documents were included as the value of the Vocabulary
variable in the algorithm.

• 100 most frequent words were removed (words such as

"the" and "of '), and any word occurring fewer than 3
times was also removed. The resulting vocabulary
contained approximately 38,500 words.
• Another variant of the naive Bayes algorithm and its • A BBN describes the probability distribution governing a set of
application to learning the target concept is
NEWSWEEDER system-a program for reading netnews variables by specifying a set of conditional independence
that allows the user to rate articles as he/she reads assumptions along with a set of conditional probabilities.
them.
• NEWSWEEDER then uses these rated articles as training
examples to learn to predict which subsequent articles
will be of interest to the user, so that it can bring these
to the user's attention.
• NEWSWEEDER used its learned profile of user interests
to suggest the most highly rated new articles each day

Topic-11

BAYESIAN BELIEF NETWORKS • The probability distribution over this joint space is called the

(BBN) joint probability distribution.

• Let us understand the idea of conditional independence
• Let X, Y, and Z be three discrete-valued random variables.
• We might wish to use a Bayesian network to infer the value of
some target variable (e.g., ForestFire) given the observed
values of the other variables.
• What we really wish to infer is the probability distribution for
the target variable, which specifies the probability that it will
take on each of its possible values given the observed values
of the other variables.
• This inference step can be straightforward if values for all of
the other variables in the network are known exactly.
• In the more general case we may wish to infer the probability
distribution for some variable (e.g., ForestFire) given observed
values for only a subset of the other variables (e.g., Thunder
and BusTourGroup may be the only observed values available).
• In general, a Bayesian network can be used to compute the
probability distribution for any subset of network variables
given the values or distributions for any subset of the
remaining variables.

Q. Can we devise effective algorithms for learning Bayesian belief

networks from training data?
• This problem is similar to learning the weights for the
hidden units in an ANN, where the I/O node values are
given but the hidden unit values are left unspecified by
the training examples
Solution
• Gradient ascent procedure that learns the entries in
the conditional probability tables.
• Gradient ascent procedure searches through a space of
hypotheses that corresponds to the set of all possible
entries for the conditional probability tables.
• i.e., P(D|h) of the observed training data D given the
hypothesis h
• This corresponds to searching for the hML for the table
entries.

The gradient ascent rule given by Russell  maximizes P(D|h) by

following the gradient of In P(D|h) w.r.t parameters that define the • Learning Bayesian networks when the network structure is not
conditional probability tables of the Bayesian network known in advance is also difficult.
• Cooper and Herskovits present a Bayesian scoring metric for
choosing among alternative networks.
• They also present a heuristic search algorithm called K2 for
learning network structure when the data is fully observable.
• For learning the structure of Bayesian networks, K2 performs a
greedy search
• Example: Finding potential anesthesia problems in a hospital
operating room.
• Also constraint-based approaches to learning Bayesian network
structure have also been developed.
Topic-12

THE EM ALGORITHM

Expectation/Estimation–
Maximization

• Basis for the widely used Baum-Welch forward-backward algorithm for learning
Partially Observable Markov Models
• EM algorithm  described by Dempster (1977)
Problem involves a mixture of k different Normal distributions,
and we cannot observe which instances were generated by which
distribution. i.e., involving hidden variables
Example

05continuous Univariate Distributions, Vol. 1 PDF
0% (1)
05continuous Univariate Distributions, Vol. 1 PDF
769 pages
Istc 301 Final Udl Lesson Plan
No ratings yet
Istc 301 Final Udl Lesson Plan
9 pages
Johansen, S. (1988) - Statistical Analysis of Cointegration Vectors
100% (2)
Johansen, S. (1988) - Statistical Analysis of Cointegration Vectors
24 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
3maximum-likelyhood
No ratings yet
3maximum-likelyhood
15 pages
MODULE - 4 QB SOLVED-1
No ratings yet
MODULE - 4 QB SOLVED-1
31 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
ml and ls
No ratings yet
ml and ls
2 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
No ratings yet
Learning Theory: Machine Learning 10 - 601B Seyoung Kim
44 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
Chapter 2 Concept Learning
No ratings yet
Chapter 2 Concept Learning
36 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Unit 1-Concept Learning
No ratings yet
Unit 1-Concept Learning
59 pages
ML-Lec4
No ratings yet
ML-Lec4
7 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Concept Learning
No ratings yet
Concept Learning
42 pages
2 concept-learning
No ratings yet
2 concept-learning
42 pages
ML_Lecture_2_Version_Spaces
No ratings yet
ML_Lecture_2_Version_Spaces
32 pages
Module 5
No ratings yet
Module 5
24 pages
Slide 1
No ratings yet
Slide 1
37 pages
KSMF
No ratings yet
KSMF
35 pages
Unit III
No ratings yet
Unit III
19 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
ML Lec. 02
No ratings yet
ML Lec. 02
32 pages
General Model of Learning From Examples
No ratings yet
General Model of Learning From Examples
17 pages
1 - Modelul Supervizat Al Invatarii Din Date
No ratings yet
1 - Modelul Supervizat Al Invatarii Din Date
16 pages
Hypothesis Space and Inductive Bias - Inductive Bias - Inductive Learning - Underfitting and Overfitting
No ratings yet
Hypothesis Space and Inductive Bias - Inductive Bias - Inductive Learning - Underfitting and Overfitting
4 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Colt Tutorial
No ratings yet
Colt Tutorial
43 pages
Computational Learning Theory
No ratings yet
Computational Learning Theory
15 pages
Bayesian Learning Note
No ratings yet
Bayesian Learning Note
20 pages
Lec04 BayesianLearning
No ratings yet
Lec04 BayesianLearning
39 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
No ratings yet
Bayesian Updating With Continuous Priors Class 13, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals
10 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Lecture 5.2
No ratings yet
Lecture 5.2
8 pages
Module 1- Concept Learning (1)
No ratings yet
Module 1- Concept Learning (1)
50 pages
SupervisedLearning 2 33
No ratings yet
SupervisedLearning 2 33
32 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Ba Yes I An Learning
No ratings yet
Ba Yes I An Learning
39 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
ML Unit 3 Bayesian - Learning (Textbook)
No ratings yet
ML Unit 3 Bayesian - Learning (Textbook)
25 pages
CH2 ConceptLearning
No ratings yet
CH2 ConceptLearning
38 pages
1.concept Learning
No ratings yet
1.concept Learning
50 pages
Aiml Lab Exp 1 (Find S)
No ratings yet
Aiml Lab Exp 1 (Find S)
24 pages
CS 2750 Machine Learning
No ratings yet
CS 2750 Machine Learning
14 pages
module_5_notes BAYESIAN learning notes
No ratings yet
module_5_notes BAYESIAN learning notes
24 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Basics of Learning Theory
No ratings yet
Basics of Learning Theory
35 pages
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
24 pages
What Is Neural Network Technology?
No ratings yet
What Is Neural Network Technology?
17 pages
CPSC340: Entropy and Maximum Likelihood
No ratings yet
CPSC340: Entropy and Maximum Likelihood
19 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Horn Clause: Fundamentals and Applications
From Everand
Horn Clause: Fundamentals and Applications
Fouad Sabry
No ratings yet
DPPM Unit-1
No ratings yet
DPPM Unit-1
10 pages
Imgtopdf 0902241401046
No ratings yet
Imgtopdf 0902241401046
24 pages
Imgtopdf 0902241652027
No ratings yet
Imgtopdf 0902241652027
20 pages
Imgtopdf 0902241649017
No ratings yet
Imgtopdf 0902241649017
26 pages
SE Unit 4 Testing
No ratings yet
SE Unit 4 Testing
17 pages
Se Unit 1 Models
No ratings yet
Se Unit 1 Models
18 pages
Baker (2011) Fragility Fitting
No ratings yet
Baker (2011) Fragility Fitting
10 pages
Estimation: Click To Edit Master Subtitle Style
No ratings yet
Estimation: Click To Edit Master Subtitle Style
18 pages
Master - HIRARC Table (Latest)
No ratings yet
Master - HIRARC Table (Latest)
17 pages
Solutions Manual to accompany Applied Multivariate Statistical Analysis 6th edition 0131877151 - PDF Format Is Available With All Chapters
100% (6)
Solutions Manual to accompany Applied Multivariate Statistical Analysis 6th edition 0131877151 - PDF Format Is Available With All Chapters
32 pages
Applied Cognitive Psychology - 2020 - Ebersbach - Comparing The Effects of Generating Questions Testing and Restudyin
No ratings yet
Applied Cognitive Psychology - 2020 - Ebersbach - Comparing The Effects of Generating Questions Testing and Restudyin
13 pages
Download ebooks file Linear Mixed Models A Practical Guide Using Statistical Software Second Edition Brady T. West all chapters
100% (1)
Download ebooks file Linear Mixed Models A Practical Guide Using Statistical Software Second Edition Brady T. West all chapters
77 pages
Circular Data Analysis: NCSS Statistical Software
No ratings yet
Circular Data Analysis: NCSS Statistical Software
28 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
182-Article Text-1391-1-10-20230302
No ratings yet
182-Article Text-1391-1-10-20230302
21 pages
Kiernan Et Al., 2012 - SAS - Tips and Strategies For Mixed Modeling
No ratings yet
Kiernan Et Al., 2012 - SAS - Tips and Strategies For Mixed Modeling
18 pages
Comparison of Fatigue Data Using The Maximum Likelihood Method
No ratings yet
Comparison of Fatigue Data Using The Maximum Likelihood Method
12 pages
Geological Modeling
No ratings yet
Geological Modeling
27 pages
Steyn - Poligraafverslag
No ratings yet
Steyn - Poligraafverslag
160 pages
Bayesian Statistics For Data Science - Towards Data Science
No ratings yet
Bayesian Statistics For Data Science - Towards Data Science
7 pages
Estimating Animal Density With Camera Traps A Practitioners Guide of The REST Model
No ratings yet
Estimating Animal Density With Camera Traps A Practitioners Guide of The REST Model
40 pages
Accuracy Precision Trueness
No ratings yet
Accuracy Precision Trueness
12 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
21 pages
T4 Probability
No ratings yet
T4 Probability
7 pages
Quantum Computing
No ratings yet
Quantum Computing
194 pages
Exercise 01) Article 1 - FB Emoticons and The Users' True Feeling
No ratings yet
Exercise 01) Article 1 - FB Emoticons and The Users' True Feeling
14 pages
BW LME Tutorial2 PDF
No ratings yet
BW LME Tutorial2 PDF
22 pages
The Theory of Response Adaptive Randomization in Clinical Trials Wiley Series in Probability and Statistics 1st Edition Feifang Hu pdf download
No ratings yet
The Theory of Response Adaptive Randomization in Clinical Trials Wiley Series in Probability and Statistics 1st Edition Feifang Hu pdf download
52 pages
Clustering of Time-Series Data
No ratings yet
Clustering of Time-Series Data
20 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
Complete Download Bayesian Analysis with Excel and R 1st Edition Conrad Carlberg PDF All Chapters
100% (1)
Complete Download Bayesian Analysis with Excel and R 1st Edition Conrad Carlberg PDF All Chapters
41 pages
General Concepts of Point Estimation
No ratings yet
General Concepts of Point Estimation
2 pages
Class-Based N-Gram Models of Natural Language
No ratings yet
Class-Based N-Gram Models of Natural Language
13 pages

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Uploaded by

ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes

Uploaded by

Topic-1

• Determining the best hypothesis from some space Notations:

We can also model learning from noisy training data, by

Consider the following problem setting.

LEAST-SQUARED ERROR • A set of m training examples is provided, where the

Probability density function:

p (lower case) – refer to the probability density function

The dashed line corresponds to the hypothesis hML with least-

A Normal distribution is a smooth, bell-shaped distribution

Issue: least-squared error hypothesis – We will show this by

MAXIMUM LIKELIHOOD How can we learn f' using a neural network?

BAYES OPTIMAL CLASSIFIER

NAIVE BAYES CLASSIFIER

Given a new instance to classify,

This observed fraction provides a good estimate of the probability

(produces a biased underestimate of the probability)

In the absence of other information we assume uniform priors as p

Topic-10 • We are given training examples of some unknown target function

• We consider practical problem of learning to classify natural

Fortunately, make an additional reasonable assumption that

• 100 most frequent words were removed (words such as

(BBN) joint probability distribution.

Q. Can we devise effective algorithms for learning Bayesian belief

The gradient ascent rule given by Russell  maximizes P(D|h) by

You might also like