0% found this document useful (0 votes)
10 views

MODULE - 4 QB SOLVED-1

Bayesian theorem calculates the probability of a hypothesis based on prior knowledge and observed data, making it essential for Bayesian learning methods. It allows for the integration of prior knowledge with observed data to update probabilities, although practical difficulties include the need for initial probability estimates and computational costs. Applications of Bayesian methods include spam filtering, weather forecasting, and DNA testing, among others.

Uploaded by

colourfulabhi12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

MODULE - 4 QB SOLVED-1

Bayesian theorem calculates the probability of a hypothesis based on prior knowledge and observed data, making it essential for Bayesian learning methods. It allows for the integration of prior knowledge with observed data to update probabilities, although practical difficulties include the need for initial probability estimates and computational costs. Applications of Bayesian methods include spam filtering, weather forecasting, and DNA testing, among others.

Uploaded by

colourfulabhi12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MODULE - 4

1. Define Bayesian theorem? What is the relevance and features of Bayesian theorem?
Explain practical difficulties of Bayesian theorem
SOL
Bayes theorem provides a way to calculate the probability of a hypothesis based on its
prior probability, the probabilities of observing various data given the hypothesis, and
the observed data itself.
Notations:
• P(h) prior probability of h, reflects any background knowledge about the chance that
h is correct
• P(D) prior probability of D, probability that D will be observed
• P(D|h) probability of observing D given a world in which h holds
• P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed Bayes theorem is the cornerstone of Bayesian learning methods because it
provides a way to calculate the posterior probability P(h|D), from the prior probability
P(h), together with P(D) and P(D|h).

• P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.

Features of Bayesian Learning Methods


• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a probability
distribution over observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.

Practical difficulty in applying Bayesian methods


1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to
determine the Bayes optimal hypothesis in the general case. In certain specialized
situations, this computational cost can be significantly reduced

2. Explain Bayes' theorem and its application in machine learning.


SOL
BAYES THEOREM
Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself. Notations,
• P(h) prior probability of h, reflects any background knowledge about the chance that h is
correct
• P(D) prior probability of D, probability that D will be observed
• P(D|h) probability of observing D given a world in which h holds
• P(h|D) posterior probability of h, reflects confidence that h holds after D has been observed
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).

• P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more probable it is that D will be observed
independent of h, the less evidence D provides in support of h.
Bayes’ Theorem provides a way to update the probability of a hypothesis, event, or condition A
being true after taking into account new evidence or information B. It calculates the revised
probability of A given B by relating the probability of B given A, the initial probability of A, and
the probability of B.
Bayesian inference is very important and has found application in various activities, including
medicine, science, philosophy, engineering, sports, law, etc. Some of the most common
applications of Bayes

Theorem in real life are :


 Spam Filtering
 Weather Forecasting
 DNA Testing
 Financial Forecasting
 Fault Diagnosis in Engineering
 Drug Testing

3. How does Bayes' theorem apply to concept learning? OR Analyze bayes theorem with
appropriate examples
SOL
Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable. Brute-Force Bayes Concept Learning Consider the concept learning problem
• Assume the learner considers some finite hypothesis space H defined over the instance space
X, in which the task is to learn some target concept c : X → {0,1}.
• Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is some
instance from X and where di is the target value of xi (i.e., di = c(xi)).
• The sequence of target values are written as D = (d1 . . . dm).
BRUTE-FORCE MAP LEARNING algorithm:
1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

4. Define Maximum A Posterior (MAP) Maximum Likelihood(ML) Hypothesis.Derives the


relation for hMAP and hML using Bayesian theorem.
SOL
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of candidate hypotheses H and is
interested in finding the most probable hypothesis h ∈ H given the observed data D. Any such
maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
• Bayestheorem to calculate the posterior probability of each candidate hypothesis is hMAP is a
MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h Maximum Likelihood (ML)


Hypothesis
• In some cases, it is assumed that every hypothesis in H is equally probable a priori (P(hi) =
P(hj) for all hi and hj in H).
• In this case the below equation can be simplified and need only consider the term P(D|h) to
find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis
5. Define Maximum Likelihood (ML) and Least Squares (LS) error hypothesis.
SOL
Consider the problem of learning a continuous-valued target function such as neural
network learning, linear regression, and polynomial curve fitting.
A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis
• Learner L considers an instance space X and a hypothesis space H consisting of some
class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi ,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis or a MAP
hypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as
the product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance
σ 2 , each di must also obey a Normal distribution around the true targetvalue f(xi).
Because we are writing the expression for P(D|h), we assume h is the correct
description of f. Hence, µ = f(xi) = h(xi)

E
Maximize the less complicated logarithm, which is justified because of the monotonicity
of function p
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding


positive quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and
the hypothesis predictions h(xi)

6. Discuss the role of ML hypothesis in predicting probabilities.


SOL
• Consider the setting in which we wish to learn a nondeterministic (probabilistic)
function f : X → {0, 1}, which has two discrete output values.
• We want a function approximator whose output is the probability that f(x) = 1. In
other words, learn the target function f ` : X → [0, 1] such that f ` (x) = P(f(x) = 1)

How can we learn f ` using a neural network?


• Use of brute force way would be to first collect the observed frequencies of 1's and 0's
for each possible value of x and to then train the neural network to output the target
frequency for each x.

What criterion should we optimize in order to find a maximum likelihood hypothesis for
f' in this setting?
• First obtain an expression for P(D|h)
• Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the
observed 0 or 1 value for f (xi).
• Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as
Equation (7) describes the quantity that must be maximized in order to obtain the
maximum likelihood hypothesis in our current problem setting.

7. Explain the Minimum Description Length (MDL) principle. Obtain the equation for hMDL.
SOL
MINIMUM DESCRIPTION LENGTH PRINCIPLE
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from
information theory.
which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

This equation (1) can be interpreted as a statement that short hypotheses are preferred,
assuming a particular representation scheme for encoding hypotheses and data
• -log2P(h): the description length of h under the optimal encoding for the hypothesis space H,
LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the
optimal code for describing data D assuming that both the sender and receiver know the
hypothesis h.
• Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by
the description length of the hypothesis plus the description length of the data given the
hypothesis.

Where, CH and CD|h are the optimal encodings for H and for D given h
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and
if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
8. Describe the Naive Bayes Classifier algorithm and its working mechanism. OR build
algorithm flow of navie bayes classifiers give and example use case where Bernoulli
naïve bayes can be applied
SOL
The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
• A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .am).
• The learner is asked to predict the target value, or classification, for this new instance.
The Bayesian approach to classifying the new instance is to assign the most probable
target value, VMAP, given the attribute values (al, a2.. .am) that describe the instance

Use Bayes theorem to rewrite this expression as

• The naive Bayes classifier is based on the assumption that the attribute values are
conditionally independent given the target value. Means, the assumption is that given
the target value of the instance, the probability of observing the conjunction (al, a2..
.am), is just the product of the probabilities for the individual attributes:

Substituting this into Equation (1), Naive Bayes classifier:

Where, VNB denotes the target value output by the naive Bayes classifier
9. How does the Naive Bayes Classifier handle continuous attributes?

Sol: The Naive Bayes classifier handles continuous attributes by assuming that the data are
normally distributed. It estimates the probability density function (PDF) for each attribute using
the Gaussian distribution. Specifically, for a continuous attribute, the probability of a given value
is calculated using the Gaussian (or normal) distribution formula:

This approach, known as Gaussian Naive Bayes, enables the classifier to work with continuous
data by estimating the likelihood of each attribute given the class label based on the normal
distribution assumption.

10. Explain Gradient search to maximize likelihood in Neural Nets.


SOL

 Derive a weight-training rule for neural network learning that seeks to maximize G(h,D)
using gradient ascent
 The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with respect
to the various network weights that define the hypothesis h represented by the learned
network
 In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to
unit j is

 Suppose our neural network is constructed from a single layer of sigmoid units.

 Then, where xijk is the k th input to unit j for the i th training example, and d(x) is the
derivative of the sigmoid squashing function.
 Finally, substituting this expression into Equation (1), we obtain a simple expression
for the derivatives that constitute the gradient.
Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent
rather than gradient descent search. On each iteration of the search the weight vector is
adjusted in the direction of the gradient, using the weight update rule.

Where, η is a small positive constant that determines the step size of the i gradient
ascent search

11. How can Bayesian learning be applied to spam filtering?

Sol: Bayesian learning, particularly the Naive Bayes classifier, is effectively used in spam
filtering due to its ability to handle the probabilistic classification of emails as either spam or
not spam. Here's how Bayesian learning is applied to spam filtering:

1. Feature Extraction: First, emails are analyzed to extract features such as specific words,
phrases, headers, and metadata. Common features might include the presence of words
like "free," "winner," or "click here," which are more likely to appear in spam emails.
2. Training: The Naive Bayes classifier is trained on a labeled dataset containing both spam
and non-spam emails. For each email, the classifier learns the probability of each feature
appearing in spam and non-spam messages. This involves calculating the likelihood of
each feature given the class (spam or non-spam) and the prior probability of each class.
3. Classification: When a new email arrives, the classifier uses Bayes' theorem to compute
the posterior probability that the email belongs to each class (spam or not spam). This is
done by multiplying the probabilities of the features (given the class) and the prior
probabilities of the classes. The class with the highest posterior probability is chosen as
the classification for the email.
4. Handling Unknown Words: Since new words may appear in emails that were not
present in the training set, Bayesian spam filters often employ techniques like Laplace
smoothing. This ensures that unknown words don't cause the probability to become
zero, which would otherwise heavily bias the classification.
5. Incremental Learning: Bayesian spam filters can be updated incrementally as new
emails are classified. This allows the filter to adapt to new types of spam and user
preferences over time.

Bayesian spam filtering is favored for its simplicity, efficiency, and the fact that it doesn't
require large amounts of data to perform well
12. How can Bayesian learning be integrated with decision trees?

Sol: Integrating Bayesian learning with decision trees can enhance the decision-making process
by incorporating probabilistic reasoning. This integration can be achieved through several
methods:

1. Bayesian Decision Trees: In this approach, each leaf node of the decision tree represents
a probability distribution over the possible outcomes rather than a single deterministic
outcome. The probabilities are calculated using Bayes' theorem, considering both prior
probabilities and the likelihood of observing the data given a particular class. This
allows the model to provide more nuanced predictions, reflecting the uncertainty
inherent in the data.
2. Bayesian Networks and Decision Trees: A Bayesian network can be used to represent
the dependencies between variables, while a decision tree can be used for classification
or regression. The decision tree can incorporate information from the Bayesian
network, using the network's structure and conditional probabilities to inform the splits
at each node of the tree. This approach leverages the strengths of both models: the
interpretability and hierarchical structure of decision trees and the probabilistic
reasoning of Bayesian networks.
3. Minimum Description Length (MDL) Principle: This principle can be used to integrate
Bayesian learning with decision trees by selecting the tree that minimizes the sum of the
description lengths of the tree (hypothesis) and the data given the tree. The MDL
principle is closely related to Bayesian reasoning and provides a way to avoid overfitting
by balancing model complexity and fit to the data .

These methods allow for the incorporation of prior knowledge and probabilistic reasoning into
the decision-making process, making the resulting models more robust and interpretable,
especially in cases with uncertain or incomplete data.

13. Analyze various features of Bayesian learning methods OR analyze the features of
Bayesian learning methods used for probabilistic learning

Sol: Bayesian learning methods are key in probabilistic learning due to several distinctive
features:

1. Incremental Learning: Bayesian methods allow each observed training example to


incrementally update the estimated probability of a hypothesis. This is in contrast to
methods that might discard a hypothesis entirely based on inconsistency with a single
example. This feature supports more nuanced adjustments and learning from data over
time.
2. Incorporation of Prior Knowledge: Bayesian learning integrates prior knowledge with
observed data to form the final probability of a hypothesis. This is done by setting a
prior probability for each hypothesis and a probability distribution over the data for
each hypothesis. This allows for more informed decision-making, especially when prior
knowledge is substantial.
3. Probabilistic Predictions: These methods can handle hypotheses that predict
probabilities rather than deterministic outcomes. This feature is particularly useful in
situations where uncertainty is inherent in the predictions.
4. Combining Predictions: When classifying new instances, Bayesian methods can combine
the predictions of multiple hypotheses, weighting them by their respective probabilities.
This leads to a more robust and reliable decision-making process.
5. Standard of Optimal Decision Making: Even when not directly applicable due to
computational constraints, Bayesian methods provide a benchmark for optimal
decision-making. They serve as a theoretical standard against which other practical
methods can be measured.

However, there are also practical difficulties in applying Bayesian methods, such as the need for
initial knowledge of many probabilities and the computational cost involved in determining the
optimal hypothesis in general cases .

14. Evaluate brute force bayes concept learning

Sol: Brute Force Bayes Concept Learning is a method used in machine learning to find the
maximum a posteriori (MAP) hypothesis using Bayes' theorem. This approach involves the
following steps:

1. Hypothesis Space (H): The algorithm considers a finite set of hypotheses defined over
the instance space X, aiming to learn a target concept c:X→{0,1}
2. Training Data (D): The learner receives a sequence of training examples ((x 1,d 1 ),(x 2,d
2),…,(x m ,d m)),where each x I is an instance from X and d I is the corresponding target
value.
3. MAP Hypothesis Selection:
Assumptions and Consideration

Practical Considerations:

 Computational Complexity: The brute force approach involves calculating the posterior
probability for each hypothesis, which can be computationally expensive for large
hypothesis spaces.
 MAP Hypotheses and Consistency: A consistent learner outputs a hypothesis with zero
errors on the training data, which under the uniform prior and noise-free assumption, is
also a MAP hypothesis .

15. Summarize the brute force MAP learning algorithm


SOL
16. With appropriate equations evaluate probability density function

Gaussian (Normal) Distribution

The Gaussian distribution is defined by two parameters: the mean (μ) and the standard
deviation (σ). The probability density function for a Gaussian distribution is given by:

1 (𝑥 − 𝜇)
𝑓( 𝑥 ∣ 𝜇, 𝜎 ) = exp −
√2𝜋𝜎 2𝜎

where:

 x is the variable,
 μ is the mean of the distribution,
 σ is the standard deviation,
 exp is the exponential function.
This PDF provides the likelihood of observing a particular value of x given the parameters μ and
σ.
17. Summarize the basic probability formulas used for bayes theorem and concept learning

1. Bayes' Theorem

Bayes' Theorem is a fundamental formula that relates the conditional and marginal probabilities
of random events. It is expressed as:

𝑃( 𝐷 ∣ 𝐻 ) ⋅ 𝑃(𝐻)
𝑃( 𝐻 ∣ 𝐷 ) =
𝑃(𝐷)

where:

 P(H∣D) is the posterior probability: the probability of the hypothesis H given the data D.
 P(D∣H) is the likelihood: the probability of observing the data D given that the
hypothesis H is true.
 P(H) is the prior probability: the initial probability of the hypothesis H before observing
the data.
 P(D) is the marginal likelihood or evidence: the total probability of the data D under all
possible hypotheses, calculated as P(D)=∑ ∈ 𝑃( 𝐷 ∣ ℎ )𝑃(ℎ)

2. Joint Probability

The joint probability of two events AAA and BBB is the probability of both events occurring
simultaneously. It is denoted as:

P(A,B)=P(A∩B)

This can be expressed in terms of conditional probability:


P(A,B)=P(A∣B)⋅P(B)=P(B∣A)⋅P(A)

3. Marginal Probability

The marginal probability of an event is the probability of the event occurring, irrespective of
other variables. For an event A, the marginal probability is:

𝑃(𝐴) = (𝐴, 𝑏)

where B is a set of events that partition the sample space.

4. Conditional Probability

Conditional probability is the probability of an event A given that another event B has occurred.
It is denoted as:

( , )
P(A∣B)= ( )

This is used to express how the probability of A changes when we know B has occurred.

5. Maximum A Posteriori (MAP) Hypothesis

In Bayesian learning, the MAP hypothesis is the hypothesis H that maximizes the posterior
probability:

HMAP=arg max H∈HP(H∣D)

This hypothesis is considered the best explanation of the observed data, considering both prior
knowledge and the likelihood of the data.

6. Maximum Likelihood Estimation (MLE)

The MLE is a method to estimate the parameters of a statistical model that maximizes the
likelihood of the observed data:

𝜃MLE = arg max ⬚ 𝑃( 𝐷 ∣ 𝜃 )

where θ represents the parameters of the model, and D is the observed data.

18. Interpret the utilities of “numpy” and “matplotlib” modules in ml


In machine learning (ML), the Python libraries NumPy and Matplotlib are essential tools
for data manipulation, analysis, and visualization.

NumPy

NumPy (Numerical Python) is a library for numerical computing in Python. Its main utility in
ML includes:
1. Efficient Data Structures:
a. Arrays: NumPy provides the ndarray data structure, a highly efficient and flexible
array for storing numerical data. Unlike Python lists, NumPy arrays allow for
efficient computation and require less memory.
2. Mathematical Operations:
a. NumPy supports a wide range of mathematical operations, including element-wise
operations, linear algebra (e.g., matrix multiplication, eigenvalues), statistics (e.g.,
mean, median, standard deviation), and Fourier transforms. These operations are
optimized and can handle large datasets efficiently.
3. Data Manipulation:
a. NumPy allows for easy manipulation of data, such as reshaping arrays, slicing,
filtering, and broadcasting. These features make it easy to preprocess and transform
data before feeding it into machine learning algorithms.
4. Integration with Other Libraries:
a. NumPy arrays are the standard data format used in many other scientific and
machine learning libraries, such as Pandas, SciPy, TensorFlow, and Scikit-Learn. This
compatibility makes it easier to integrate different tools in an ML pipeline.

In machine learning (ML), the Python libraries NumPy and Matplotlib are essential tools for
data manipulation, analysis, and visualization. Here's a breakdown of their utilities:

Matplotlib

Matplotlib is a plotting library in Python that is widely used for data visualization. Its utilities in
ML include:

1. Data Visualization:
a. Exploratory Data Analysis (EDA): Matplotlib allows for the creation of a wide variety of
plots (e.g., line plots, scatter plots, histograms, bar charts) to visualize data distributions,
trends, and patterns. This helps in understanding the dataset and identifying potential
issues such as outliers or missing values.
b. Model Evaluation: Visualization of model performance, such as plotting confusion
matrices, ROC curves, precision-recall curves, and learning curves, helps in assessing the
effectiveness of ML models.
A. Customization and Styling:
c. Matplotlib offers extensive customization options, enabling fine control over the
appearance of plots, including labels, legends, colors, and styles. This is useful for
creating professional-quality visualizations for reports and presentations.
B. Integration with Other Libraries:
d. Matplotlib works well with other libraries, such as NumPy for numerical data and
Pandas for data frames. It also integrates with Seaborn, a higher-level library built on
top of Matplotlib, for more complex visualizations.
C. Interactive Plots:
e. Matplotlib supports interactive plotting, which can be useful for exploring data
dynamically and making real-time adjustments to plots.
Together, NumPy and Matplotlib form a powerful combination for data handling and
visualization in machine learning workflows. They enable efficient data processing and
clear, informative visualizations, which are crucial for developing, debugging, and
communicating ML models.
19. I)create a numpy array from list of float data type
SOL
import numpy as np
float_list = [1.5, 2.5, 3.5, 4.5]
float_array = np.array(float_list, dtype=float)
print(float_array)

ii) create a numpy array with random values


SOL
import numpy as np
random_array = np.random.rand(5)
print(random_array)

iii)create a sequence of integers 0 to 10 with steps of 2


SOL
import numpy as np
sequence_array = np.arange(0, 11, 2)
print(sequence_array)

iv)create a sequence of 10 values in range 0 to 5


SOL
import numpy as np
sequence_array = np.linspace(0, 5, 10)
print(sequence_array)

v)draw a line plot using matplotlib functions and numpy array


SOL
import matplotlib.pyplot as plt
import numpy as np
x = np.array([0, 6])
y = np.array([0, 250])
plt.plot(x, y)
plt.show()
x = np.array([1, 2, 6, 8])
y = np.array([3, 8, 1, 10])
plt.plot(x, y,'*r--')
plt.title("Example")
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.grid() plt.show()
20. consider the dataset shown below:

i) using navie bayes approach, predict the class label for the test sample (A=0,B=1,C=0)
ii)using m-estimate approach , predict class label of (a=0,B=1,C=0) with P=1/2 and m=4
solve the above, by showing appropriate formulas.
21. Apply bayes rule to diagnose whether a new patient whose lab-test is positive has
cancer or not.Given P(Cancer)=0.008,P(+|cancer)=0.98,P(-|cancer)=0.97
540*2
OR
22. Consider a medical diagnosis problem in which there are two alternative hypothesis:1.
That the patient has a particular form of cancer(+) and 2. That the patient does not (-).
A patient takes a lab test and the result comes back positive. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a
correct negative result in only 97% of the cases in which the disease is not present .
Furthermore , .008 of the entire population have this cancer. Determine whether the
patient has cancer or not using MAP hypothesis.
23. Write a note on
 Gibbs algorithm:
SOL:
Although the Bayes optimal classifier obtains the best performance that can be achieved
from the given training data, it can be quite costly to apply. The expense is due to the
fact that it computes the posterior probability for every hypothesis in H and then
combines the predictions of each hypothesis to classify each new instance.
An alternative, less optimal method is the Gibbs algorithm (see Opper and Haussler
1991), defined as follows:
1. Choose a hypothesis h from H at random, according to the posterior probability
distribution over H.
2. Use h to predict the classification of the next instance x.
Given a new instance to classify, the Gibbs algorithm simply applies a hypothesis drawn
at random according to the current posterior probability distribution.
It can be shown that under certain conditions the expected misclassification error for
the Gibbs algorithm is at most twice the expected error of the Bayes optimal classifier.
More precisely, the expected value is taken over target concepts drawn at random
according to the prior probability distribution assumed by the learner. Under this
condition, the expected value of the error of the Gibbs algorithm is at worst twice the
expected value of the error of the Bayes optimal classifier.
 Bayes optimal classifier:
SOL:
Consider a hypothesis space containing three hypotheses, h1, h2, and h3. Suppose that the
posterior probabilities of these hypotheses given the training data are .4, .3, and .3
respectively. Thus, h1 is the MAP hypothesis. Suppose a new instance x is encountered,
which is classified positive by h1, but negative by h2 and h3. Taking all hypotheses into
account, the probability that x is positive is .4 (the probability associated with h1), and
the probability that it is negative is therefore .6. The most probable classification
(negative) in this case is different from the classification generated by the MAP
hypothesis.
In general, the most probable classification of the new instance is obtained by combining
the predictions of all hypotheses, weighted by their posterior probabilities. If the
possible classification of the new example can take on any value vj from some set V, then
the probability P(vjlD) that the correct classification for the new instance is vj, is just

The optimal classification of the new instance is the value v,, for which P (vj|D) is
maximum

Any system that classifies new instances according to Equation (6.18) is called a Bayes
optimal classifier, or Bayes optimal learner. No other classification method using the
same hypothesis space and same prior knowledge can outperform this method on
average. This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over the
hypotheses.

 Consistent learners:
SOL:
• A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero
errors over the training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior
probability distribution over H (P(hi) = P(hj) for all i, j), and deterministic, noise free
training data (P(D|h) =1 if D and h are consistent, and 0 otherwise).

24. Describe the relationship between bayes theorem and the problem of concept learning
Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable. Brute-Force Bayes Concept Learning.
• Assume the learner considers some finite hypothesis space H defined over the instance space
X, in which the task is to learn some target concept c : X → {0,1}.
• Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is some
instance from X and where di is the target value of xi (i.e., di = c(xi)).
• The sequence of target values are written as D = (d1 . . . dm)

25. discuss practical difficulties in Bayesian learning. Also state the bayes theorem with
appropriate equation
Practical difficulty in applying Bayesian methods
1. One practical difficulty in applying Bayesian methods is that they typically require initial
knowledge of many probabilities. When these probabilities are not known in advance they are
often estimated based on background knowledge, previously available data, and assumptions
about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine the
Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).

• P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more probable it is that D will be observed
independent of h, the less evidence D provides in support of h.
26. as you know, covid-19 tests are common nowadays, but some results of tests are not
true. Let's assume a diagnostic test has 99% accuracy and 60% of all people have covid-
19. If a patient tests positive, what is the probability that they actually have the disease?
I. Bayes optimal classifier
II. MAP hypothesis and consistent learners

To determine the probability that a patient actually has COVID-19 given that they tested
positive, we can use Bayes' Theorem. Let's break it down step by step:
Given Data:
Test accuracy (sensitivity + specificity): 99%
Sensitivity (True Positive Rate): Probability of testing positive if having the disease.
Specificity (True Negative Rate): Probability of testing negative if not having the disease.
Prevalence of disease (prior probability of having COVID-19): 60% or 0.60.
The terms:

Bayes' Theorem:

So, if a patient tests positive, the probability that they actually have COVID-19 is approximately
99.3%.
Bayes Optimal Classifier
The Bayes optimal classifier makes predictions based on the posterior probability P(D | T+)
.Given the high posterior probability (99.3%), the classifier would predict that the patient has
COVID-19 if they test positive.

MAP Hypothesis and Consistent Learners


-MAP Hypothesis (Maximum A Posteriori): In the context of this problem, the MAP hypothesis
would be the diagnosis (disease or no disease) that maximizes the posterior probability. Here,
given a positive test result, the MAP hypothesis is that the patient has COVID-19, as P(D | T+) is
99.3%.
- Consistent Learners: These are models that make predictions consistent with the observed
data. In this scenario, a consistent learner would also predict that a patient has COVID-19 given
a positive test result, in alignment with the high posterior probability
By using the Bayes optimal classifier and considering the MAP hypothesis, we conclude that a
positive test result indicates a very high likelihood (99.3%) that the patient actually has COVID-
19.

27. Apply navie bayes classifier to classify the instance


<outlook=sunny,temperature=cool,humidity=high,wind=strong> based on the
training data given below. Also,compute the m-estimate for
P(Wind=strong|Playtennis=no) and P(outlook=sunny|Playtennis=no)

SOL: Refer notes


28. What are Bayesian belief nets? Where are they used?

Bayesian belief networks (BBNs), also known as Bayesian networks, are probabilistic graphical
models that capture relationships between random variables. Here’s a brief overview:

Graphical Representation:
 BBNs consist of interconnected nodes representing variables and directed edges
representing causal relationships.
 These networks explicitly model conditional dependence between variables.
Applications:
 BBNs are widely used in various domains:
o Healthcare: Diagnostics, personalized medicine, and treatment planning.
o Finance: Risk assessment, fraud detection, and portfolio optimization.
o Environmental Management: Modeling ecological systems and predicting
environmental impacts.
o Decision Making Under Uncertainty: Assessing options with incomplete
information.
o Anomaly Detection: Identifying unusual events.
o Automated Insight: Extracting insights from data.
o Prediction: Time series forecasting and event prediction.
In summary, BBNs provide a powerful framework for representing and analyzing complex
systems, especially when dealing with uncertainty.

29. Bayesian Belief Network is a graphical representation of different probabilistic


relationships among random variables in a particular set. It is a classifier with no
dependency on attributes i.e it is condition independent. Due to its feature of joint
probability, the probability in Bayesian Belief Network is derived, based on a condition
— P(attribute/parent) i.e probability of an attribute, true over parent attribute.

Bayesian Belief Networks (BBNs), also known as Bayesian Networks or Belief Networks, are
graphical models that represent the probabilistic relationships among a set of random
variables. They use directed acyclic graphs (DAGs) where nodes represent the random
variables and edges represent the dependencies between them.

Key Features of Bayesian Belief Networks:


1. Graphical Representation: BBNs use DAGs to represent variables and their conditional
dependencies. Each node in the graph represents a random variable, and each directed
edge represents a probabilistic dependence between variables.
2. Conditional Independence: One of the main advantages of BBNs is their ability to
explicitly model the conditional independence relationships among variables. This
reduces the complexity of the joint probability distribution that needs to be specified,
making the model more manageable and interpretable.
3. Local Probability Distributions: For each variable in the network, a Conditional
Probability Distribution (CPD) is defined, which specifies the probability of the variable
given its parent variables. If a variable has no parents, it is associated with a prior
probability distribution.
4. Joint Probability Distribution: The joint probability distribution over all variables can be
factored into the product of the local probability distributions. This is expressed as:
𝑃(𝑋 , 𝑋 , … , 𝑋 ) = 𝑃 𝑋 ∣∣ Parents(𝑋 )

where Parents(Xi) represents the set of parent variables for Xi in the network.
5. Inference: BBNs allow for both probabilistic inference and decision-making. Given
evidence (observed variables), the network can be used to compute posterior
probabilities for unobserved variables.

Example: Medical Diagnosis


Consider a BBN for diagnosing a disease. The nodes might represent symptoms, diseases,
and test results. The edges represent causal relationships, such as a disease causing certain
symptoms. Given some symptoms and test results, the BBN can be used to compute the
probability of different diseases.
Independence and Conditional Independence in BBNs
 Independence: Variables that are not connected in any way in the graph are
independent of each other.
 Conditional Independence: Given the parents of a variable, the variable is conditionally
independent of its non-descendants. This is a crucial aspect of BBNs that simplifies
computation and understanding.
Bayesian Belief Networks are powerful tools in various domains, including medical
diagnosis, fault detection, decision support systems, and more. They provide a structured
and intuitive way to model complex probabilistic relationships and perform reasoning
under uncertainty.

30. Explain gradient ascent training of Bayesian networks.


 Objective: The goal is to maximize the likelihood of the observed data by adjusting the
parameters of the Bayesian network.
 Gradient Ascent: This is an iterative optimization algorithm used to find the maximum of
a function. In this context, it adjusts the parameters in the direction of the gradient of
the likelihood function.
 Steps:
a. Initialize Parameters: Start with initial values for the parameters.
b. Compute Gradient: Calculate the gradient of the likelihood function with respect
to the parameters.
c. Update Parameters: Adjust the parameters in the direction of the gradient.
d. Iterate: Repeat the process until convergence.

This method helps in finding the optimal parameters that best explain the observed data.

31. Explain the concept of EM algorithm. Discuss what are gaussian mixtures.

Expectation-Maximization (EM) Algorithm:

The Expectation-Maximization (EM) algorithm is a powerful technique used in machine


learning and statistics. It’s particularly useful for parameter estimation in situations where data
is incomplete or involves latent (hidden) variables. Here are the key points:
Objective: EM aims to find the maximum likelihood estimates (MLE) of model parameters when
dealing with incomplete data or latent variables.
Two Steps:
 Expectation (E-Step): In this step, we compute the expected value of the log-likelihood
function with respect to the latent variables. Essentially, we estimate the missing
information.
 Maximization (M-Step): In this step, we maximize the expected log-likelihood with
respect to the model parameters. We update the parameter estimates.

Applications: EM is commonly used in data clustering, computer vision, natural language


processing (NLP), and quantitative genetics.

Gaussian Mixture Models (GMMs):

A Gaussian mixture model (GMM) is a probabilistic model that represents data as a mixture of
several Gaussian distributions (also known as normal distributions). Here’s what you need to
know:

Motivation: GMMs are useful when data comes from multiple groups, and each group can be
well-modeled by a Gaussian distribution.
Example:
 Imagine we have data on book prices. Paperback book prices follow one Gaussian
distribution with a mean of $10.00 and a standard deviation of $1.00. Hardback book
prices follow another Gaussian distribution with a mean of $17.00 and a variance of
$1.50.
 However, the overall distribution of book prices is not a single Gaussian. It’s bimodal
because it combines both paperback and hardback prices.

Modeling Approach:
 GMMs assume that each data point belongs to one of several Gaussian components
(clusters).
 The overall distribution is a weighted sum of these Gaussian components.
 The parameters include the means, variances, and mixing coefficients (weights) for each
component.

Parameter Estimation:
 EM is commonly used to estimate the parameters of GMMs.
 The E-step computes the expected responsibilities (probabilities of data points
belonging to each component).
 The M-step updates the parameters based on these responsibilities.

Challenges:
 GMMs can capture complex data distributions but may suffer from local optima during
parameter estimation.

EM convergence depends on the properties of the training data.

EM and GMMs play crucial roles in understanding and modeling complex data distributions.

You might also like