0% found this document useful (0 votes)
5 views

PGM Theory Notes

Uploaded by

J P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

PGM Theory Notes

Uploaded by

J P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Evidential & Causal Reasoning, Markov Models, and Bayesian Networks

a. Evidential and Causal Reasoning with Examples


Evidential reasoning, also known as diagnostic reasoning, starts with observed evidence and
reasons about the possible causes. Causal reasoning starts with known causes and reasons
about their potential effects.
Let's illustrate these concepts with the familiar "Student" Bayesian Network example from
the sources. Imagine a simplified version with nodes "Intelligence" (I), "Grade" (G), and
"Letter of Recommendation" (L). The network structure implies intelligence influences
grades, which in turn influence the recommendation letter.
Evidential Reasoning Example: We observe a student received a strong letter of
recommendation (L = strong). Evidential reasoning leads us to infer that the student likely got
a good grade (G = A), and may be intelligent (I = high). We are reasoning "backward" from
the evidence to the potential causes.
Causal Reasoning Example: We know a student is highly intelligent (I = high). Causal
reasoning leads us to infer they are likely to get a good grade (G = A) and therefore a strong
letter of recommendation (L = strong). We are reasoning "forward" from cause to effect.
b. Markov Models and Factor Tables
Markov Models are a type of probabilistic model where the probability of transitioning to a
future state depends only on the current state, not the entire history. This is called the
Markov Property.
Factor tables are essential for representing information in Markov Models. They encode:
• Transition probabilities: Each entry in the table represents the probability of
transitioning from one state to another.
• State information: The rows and columns of the table represent the possible states of
the system.
Justification for Factor Tables in Markov Models:
• Compact representation: Factor tables allow for a concise representation of the
transition probabilities between all possible states.
• Computational efficiency: This compact form makes it easier to compute
probabilities and perform inference tasks.
• Modularity: We can combine factor tables to represent more complex models.
c. Application of Bayesian Networks: Medical Diagnosis
Bayesian networks are a powerful tool within probabilistic graphical models (PGMs) for
representing and reasoning with uncertain knowledge.
Medical Diagnosis is a key application area for Bayesian Networks.
Example:
• Network Structure: Nodes in the network represent concepts like diseases,
symptoms, and test results. Links represent probabilistic relationships between them,
like the likelihood a symptom is caused by a particular disease.
• Probabilistic Reasoning: Given observed symptoms (evidence), the network can be
used to calculate the probabilities of different diseases. This helps doctors narrow
down possible diagnoses and guide further testing.
• Benefits:
o Handles uncertainty: Medical diagnosis often involves dealing with
uncertainty about a patient's condition. Bayesian networks provide a
framework for quantifying and reasoning with this uncertainty.
o Combines evidence: It integrates various sources of evidence, like symptoms,
medical history, and test results, to provide a more informed diagnosis.
o Transparency: The network structure visually represents the relationships
between medical concepts, making the reasoning process more transparent.
This is just one example; Bayesian networks find applications in many areas like fault
diagnosis, bioinformatics, and risk assessment.

e. Marginal vs. Joint Distributions


The sources provide a clear distinction between marginal and joint probability distributions:
• Marginal distribution: The marginal distribution of a random variable focuses on
that variable in isolation, assigning probabilities to the different values it can take. It
is derived from the joint distribution by "summing out" or "marginalizing" the other
variables.
• Joint distribution: The joint distribution describes the probabilities of events
involving multiple random variables simultaneously. It assigns probabilities to each
possible combination of values for those variables.
Let's illustrate this with the example from source, which discusses a joint distribution over the
variables "Intelligence" (with values "high" or "low") and "Grade" (with values "A", "B", or
"C").
• Joint distribution: The table in Figure 2.1 shows the joint probabilities for all
combinations of Intelligence and Grade. For example, P(Intelligence = high, Grade =
A) = 0.18.
• Marginal distributions:
o The marginal distribution of Intelligence assigns probabilities to
P(Intelligence = high) and P(Intelligence = low). These are found by summing
the probabilities along each column in Figure 2.1, resulting in P(Intelligence =
high) = 0.3 and P(Intelligence = low) = 0.7.
o The marginal distribution of Grade assigns probabilities to P(Grade = A),
P(Grade = B), and P(Grade = C). These are found by summing the
probabilities along each row in Figure 2.1, resulting in P(Grade = A) = 0.25,
P(Grade = B) = 0.37, and P(Grade = C) = 0.38.
Key points:
• The joint distribution contains more information than any of the marginal
distributions, as it captures the relationships between the variables.
• We can compute the marginal distribution of any variable from the joint distribution.
• The joint distribution must be consistent with the marginal distributions; the sum of
joint probabilities over all values of one variable must equal the marginal probability
of the other variable.

Maximal Cliques and Parameterization


What is a Maximal Clique?
A clique in a graph is a subset of nodes where every pair of nodes is connected by an edge. In
simpler terms, it's a fully connected subgraph.
A maximal clique is a clique that cannot be expanded any further. That is, there's no other
node in the graph that can be added to the clique while maintaining the fully connected
property.
Source gives the example of the induced subgraph K[C,D,I]. If this subgraph were fully
connected, it would be a clique. However, if adding another node from the graph would break
the fully connected property, then C, D, and I would form a maximal clique.
Representing Parameterization Using Cliques
Parameterization in the context of graphical models refers to associating numerical values,
often called potentials, with the graph structure to represent the probability distribution.
These potentials encode the "compatibilities" between different values of the variables.
Cliques play a crucial role in parameterizing Markov Networks, a type of undirected
graphical model. Here's why and how:
• Factors and Cliques: We can represent the joint probability distribution as a product
of factors, each defined over a subset of variables. To ensure consistency with the
independence relationships implied by the graph structure, we associate factors only
with complete subgraphs, i.e., cliques.
• Clique Potentials: These factors associated with cliques are called clique potentials.
They capture the interactions between variables within the clique.
• Maximal Cliques for Efficiency: Since any complete subgraph is part of some
maximal clique, we can simplify parameterization by using potentials only on
maximal cliques. This is more efficient as it reduces the number of factors.
Example:
Source illustrates this with the "Misconception" example, a Markov Network with cliques
like {A,B}, {B,C}, {C,D}, and {A,D}. Each of these cliques can have its own potential,
defining a joint probability distribution over the variables in the network.
Trade-offs:
While using maximal clique potentials is generally efficient, it can sometimes obscure the
underlying structure of the original set of factors. For example, a fully connected pairwise
Markov network would require an exponentially large potential on the single maximal clique
(the entire graph). In such cases, representing the parameterization with smaller factors
corresponding to pairs of variables would be more compact and efficient.
Additional Considerations:
• Factor Graphs: For finer-grained parameterization, we can use factor graphs, which
explicitly represent individual factors and their connections to variables, even if they
don't form maximal cliques.
• Feature-Based Representation: Another approach is to use features – functions of
the variables – and associate weights with these features to represent the potentials.
This offers even greater flexibility and expressiveness.
The choice of representation depends on the specific structure of the problem and the trade-
off between efficiency and the level of detail required in the parameterization.

Application of Bayesian Networks in Classification: The Naive Bayes Model


The sources, particularly, extensively discuss Bayesian Networks and highlight the Naive
Bayes Model as a prominent application for classification tasks.
What is the Naive Bayes Model?
The Naive Bayes Model is a simplified type of Bayesian network specifically designed for
classification. Its structure is simple:
• Class Node (C): This node represents the variable we want to predict. For example,
in spam detection, this could be "Spam" or "Not Spam".
• Feature Nodes (X1, X2,... Xn): These nodes represent the observed features of the
data. In the spam detection example, these could be words in the email.
The Key Assumption:
The Naive Bayes Model makes a strong conditional independence assumption: given the
class label, all features are independent of each other. This simplifies the model drastically,
allowing for efficient computation even with many features.
How it Works:
1. Training: We learn the probabilities from the training data:
o Prior probabilities: P(C) - the probability of each class.
o Conditional probabilities: P(Xi | C) - the probability of each feature given
each class.
2. Classification: Given a new instance with features (x1, x2,... xn), we calculate the
probability of it belonging to each class using Bayes' theorem:
P(C | x1, x2,... xn) ∝ P(C) * P(x1 | C) * P(x2 | C) * ... * P(xn | C)
3. Prediction: We assign the instance to the class with the highest probability.
Example: Spam Detection
Let's say we want to classify emails as spam or not spam. Here's how a Naive Bayes model
might work:
• Class node (C): Spam (yes/no)
• Feature nodes (X1, X2, X3): Presence of words like "free", "money", "urgent".
Training:
• From a dataset of labelled emails, we learn:
o P(Spam = yes), P(Spam = no)
o P("free" | Spam = yes), P("money" | Spam = yes), P("urgent" | Spam = yes)
o P("free" | Spam = no), P("money" | Spam = no), P("urgent" | Spam = no)
Classification:
• For a new email containing "free" and "urgent", we calculate:
o P(Spam = yes | "free", "urgent")
o P(Spam = no | "free", "urgent")
• We compare the two probabilities and classify the email accordingly.
Advantages of Naive Bayes:
• Simple and efficient: Computationally inexpensive, even with many features.
• Performs well with limited data: Can be surprisingly effective, even with small
training datasets.
Limitations:
• Strong independence assumption: The assumption of feature independence is often
unrealistic.
• Sensitivity to irrelevant features: Performance can be affected by the presence of
irrelevant features.
Despite its limitations, the Naive Bayes Model remains a popular and useful tool for
classification, particularly in text processing and other domains with high-dimensional
data.
Note: While the sources don't provide detailed examples of Naive Bayes for spam detection,
this application is a classic use case and widely discussed in the machine learning literature,
which you might want to independently verify.

Plate Models

Plate models are a type of graphical model that can be used to represent complex
probabilistic models in a compact way. They are particularly useful for representing models
that involve multiple objects of the same type.
Plate models use the concept of a plate, which is a rectangular box that represents a set of
objects. Within a plate, we can define attributes, which are random variables that are
associated with each object in the set. For example, if we have a plate representing students,
we might have attributes for their intelligence, grade, and difficulty of their courses.
Connections between attributes in a plate model can represent probabilistic dependencies
between the corresponding random variables. These connections can be either directed or
undirected, depending on the type of model we are trying to represent.
Example of a Plate Model
Source provides an example of a plate model for a simplified Student example. In this
example, we have two plates: one representing students and one representing courses. The
attributes in the model are:
• Intelligence (I): An attribute of the Students plate, representing the intelligence of
each student.
• Grade (G): An attribute of both the Students and Courses plates, representing the
grade that a student receives in a course.
• Difficulty (D): An attribute of the Courses plate, representing the difficulty of each
course.
The connections in the model indicate that:
• A student's grade in a course depends on their intelligence and the difficulty of the
course.
• The intelligence of different students is independent.
• The difficulty of different courses is independent.
This plate model can be used to generate a ground Bayesian network, which is a Bayesian
network that represents the joint probability distribution over all the random variables in the
model. The ground Bayesian network for this plate model would have a node for each student
and each course, and an edge between a student node and a course node if the student is
taking that course.
Rule-Based CPD vs. Tree-Based CPD
Both rule-based and tree-based CPDs are ways of representing context-specific
independence in a CPD. This means that the conditional probability distribution of a variable
can vary depending on the values of its parent variables.
Tree-Based CPDs represent context-specific independence using a tree structure. The
internal nodes of the tree correspond to tests on the parent variables, and the leaves of the tree
correspond to different conditional probability distributions. To find the conditional
probability distribution for a particular instantiation of the parent variables, we traverse the
tree from the root to a leaf, following the branches that correspond to the values of the parent
variables.
Rule-Based CPDs represent context-specific independence using a set of rules. Each rule
specifies a context, which is a partial assignment to the parent variables, and a conditional
probability distribution for the child variable given that context. The rules in a rule-based
CPD must be mutually exclusive and exhaustive, meaning that every possible instantiation
of the parent variables is covered by exactly one rule.
Differences
Here's a table summarizing the key differences:

Feature Rule-Based CPD Tree-Based CPD

Set of rules, each with a context Tree structure with tests on parent
Representation and a conditional probability variables at internal nodes and conditional
distribution distributions at leaves

Finer-grained, can represent Global representation, captures the entire


Structure
individual entries in the CPD CPD in a single data structure

Can represent very specific Contexts are defined by paths in the tree,
Specificity
contexts which can be less specific

Ease of Easier to reason about individual Can be easier to visualize the overall
Reasoning dependencies structure of the CPD

Expected Log-Likelihood Metric


The expected log-likelihood is a metric used to evaluate the performance of a probabilistic
model on a dataset with missing values. It measures the average log-likelihood of the data,
taking into account the uncertainty arising from the missing values.
To compute the expected log-likelihood, we need to:
1. Estimate the distribution over the missing values: This can be done using any
inference algorithm, such as variable elimination or belief propagation.
2. Compute the log-likelihood of each data point: We use the estimated distribution
over the missing values to compute the likelihood of each data point.
3. Average the log-likelihoods: The expected log-likelihood is the average of the log-
likelihoods over all the data points.
The expected log-likelihood is often used as a scoring function for structure learning in
Bayesian networks with missing data. The idea is to find the network structure that
maximizes the expected log-likelihood of the data.
It's important to note that the expected log-likelihood is an approximation to the true log-
likelihood of the data, which cannot be computed directly because of the missing values.
However, it is a useful metric for evaluating and comparing different models.
Variable Elimination with Example
Variable elimination is an exact inference algorithm used in graphical models to compute
marginal probabilities and conditional probabilities. It works by systematically eliminating
variables from the joint probability distribution, one at a time, until we are left with a
distribution over only the query variables.
Here are the key steps:
1. Choose an elimination ordering: The order in which we eliminate variables can
significantly impact the efficiency of the algorithm.
2. Factor the joint distribution: We express the joint distribution as a product of
factors, where each factor involves a subset of the variables.
3. Eliminate variables: For each variable in the elimination ordering, we perform the
following:
o Collect relevant factors: Identify factors that involve the variable being
eliminated.
o Multiply the factors: Multiply the collected factors together to get a new
factor.
o Sum out the variable: Sum out the variable from the new factor to obtain a
factor that does not involve the eliminated variable.
4. Normalize: After eliminating all non-query variables, we normalize the resulting
factor to obtain the marginal probability distribution over the query variables.
Example
Source provides an example of variable elimination applied to the "Extended-Student"
Bayesian network (Figure 9.8). The goal is to compute P(D). Here's a simplified explanation:
1. Elimination ordering: Let's say we choose the ordering {C, I, S, L, J, G}.
2. Factorization: The joint distribution is already factorized as a product of CPDs,
representing conditional probabilities in the Bayesian network.
3. Elimination process: We eliminate variables one by one, multiplying and summing
out as needed. For example, when eliminating C:
o Relevant factors: P(C | D), P(G | I, D, C).
o Multiplication: We multiply these factors to get a new factor involving C, D,
I, G.
o Summation: We sum out C from this factor, leaving a factor involving D, I,
G.
4. Continue eliminating: We repeat this process for the remaining variables in the
ordering.
5. Final result: After eliminating all variables except D, we are left with a factor
representing P(D).
Key points:
• Efficiency: The efficiency of variable elimination depends heavily on the chosen
elimination ordering and the structure of the graph. Choosing a good elimination
ordering can significantly reduce computation.
• Complexity: In the worst case, variable elimination can be computationally
expensive, especially for dense graphs with many connections between variables.
• Graph Transformations: Variable elimination can be viewed as a series of graph
transformations that simplify the graphical model while preserving the marginal
probabilities of interest.
This explanation, along with the detailed illustration in the sources, should provide a good
understanding of how variable elimination works.

D-separation in Bayesian Networks

D-separation, short for directed separation, is a concept used in Bayesian networks (BNs) to
determine conditional independence relationships between variables. It leverages the
graphical structure of the BN to efficiently identify independencies that hold for all
distributions factorizing over that graph structure. In simpler terms, D-separation allows us to
"read off" independencies directly from the BN graph without needing to analyse the specific
numerical values in the conditional probability distributions (CPDs).
The core idea behind D-separation is the notion of "flow of influence". We analyse how
information or influence might flow between variables along the paths or trails in the BN
graph. If the flow of influence between two variables, X and Y, is blocked given a set of
observed variables Z, then we say that X and Y are d-separated given Z. This implies that X
and Y are conditionally independent given Z in any distribution that factorises over the BN
structure.
Active Trails and Blocking
To determine whether influence can flow between variables, we need to consider the concept
of an active trail. A trail between two variables is a sequence of connected edges in the
graph, regardless of their direction.
A trail is considered active given a set of observed variables Z if:
• For every v-structure (a structure like X → Z ← Y) along the trail, either Z or one of
its descendants is in the set of observed variables. Observing a descendant of Z
"activates" the v-structure, allowing influence to flow through it.
• No other node along the trail (except potentially the start and end nodes) is in the set
of observed variables Z. Observing any other node on the trail "blocks" the flow of
influence.
D-separation is then defined as follows:
• Two sets of variables X and Y are d-separated given a set Z if there is no active trail
between any node in X and any node in Y given Z.
Example
Let's illustrate D-separation with an example using the Student BN from our conversation
history.
Imagine the following structure:
• Difficulty (D): The difficulty of a course.
• Intelligence (I): The intelligence of a student.
• Grade (G): The grade a student receives in a course.
• SAT (S): The student's SAT score.
• Letter (L): Whether the student receives a recommendation letter.
Here are some D-separation statements and explanations:
1. D and I are d-separated. There is only one trail between D and I (D → G ← I),
which forms a v-structure. Since neither G nor any of its descendants are observed,
this trail is not active. Hence, D and I are d-separated, meaning they are marginally
independent.
2. D and S are d-separated given G. The only trail connecting D and S is D → G ← I
→ S. Observing G blocks this trail, as it's not part of a v-structure. Therefore, D and S
are d-separated given G, implying that the difficulty of a course and a student's SAT
score are conditionally independent given the student's grade in the course.
3. D and S are not d-separated given L. Even though L is a descendant of G, observing
L activates the v-structure D → G ← I. The entire trail D → G ← I → S becomes
active, meaning that D and S are not d-separated given L. This suggests that knowing
whether a student received a recommendation letter can induce a dependence between
the course difficulty and the student's SAT score.
Soundness and Completeness of D-separation
• Soundness: If D-separation identifies X and Y as d-separated given Z, then X and Y
are guaranteed to be conditionally independent given Z in any distribution that
factorizes over the BN structure. This property assures us that D-separation won't
mistakenly identify independencies that don't hold.
• Completeness: The completeness of D-separation is a bit more nuanced. While D-
separation identifies all independencies that hold for every distribution factorizing
over a given BN structure, there might be additional independencies that hold for
specific distributions due to particular numerical values in the CPDs. These specific
independencies might not be identifiable solely through D-separation.
To address this, the concept of faithfulness is introduced. A distribution is faithful to a BN
structure if all independencies in the distribution are reflected in the D-separation properties
of the graph. In other words, for a faithful distribution, D-separation is a complete method for
identifying independencies. However, it's important to note that not all distributions are
faithful, so there might be cases where D-separation misses some independencies specific to a
particular distribution.
D-separation is a fundamental tool for understanding and reasoning with Bayesian networks.
It allows us to leverage the graphical structure to efficiently determine conditional
independencies and gain insights into the relationships between variables in the model.

Using HMMs for Part-of-Speech Tagging

Hidden Markov Models (HMMs) are widely used in natural language processing, particularly
for Part-of-Speech (POS) tagging, which involves assigning grammatical tags (like noun,
verb, adjective) to words in a sentence. HMMs offer a probabilistic framework to capture the
sequential dependencies inherent in language.
The Basic Model: Words as Observations, POS Tags as Hidden States
In the context of POS tagging, the HMM structure reflects the underlying assumption that the
probability of a word's POS tag depends on the tags of the preceding words.
• Hidden States: Each hidden state in the HMM represents a possible POS tag (e.g.,
noun, verb, adjective).
• Observations: The observed variables are the words in a sentence.
• Transition Probabilities: These probabilities capture the likelihood of transitioning
from one POS tag to another. For instance, the probability of transitioning from a
noun (NN) to a verb (VB) might be lower than from a determiner (DT) to a noun.
• Emission Probabilities: These probabilities reflect the likelihood of observing a
particular word given a certain POS tag. For example, the word "dog" is more likely
to be emitted from a noun state than a verb state.
Training and Decoding
• Training: During training, the HMM's parameters (transition and emission
probabilities) are learned from a corpus of text where words are already tagged with
their correct POS tags. This can be done using the Baum-Welch algorithm, which is a
special case of the Expectation-Maximization (EM) algorithm.
• Decoding: Once the HMM is trained, it can be used to tag new, unseen sentences. The
Viterbi algorithm, a dynamic programming algorithm, is typically used to find the
most likely sequence of POS tags for a given sentence, effectively decoding the
sentence's grammatical structure.
Example
Consider a simplified HMM trained on a small corpus to tag the sentence "The cat sat."

POS Tag (State) Emission Probabilities Transition Probabilities

DT (Determiner) P("The" \ DT) = 0.8, P("cat" \

NN (Noun) P("cat" \ NN) = 0.6, P("sat" \

VB (Verb) P("sat" \ VB) = 0.9, P("The" \

Given this model, the Viterbi algorithm would determine the most probable tag sequence for
"The cat sat" by considering all possible tag combinations (DT-NN-VB, DT-VB-NN, etc.)
and calculating their probabilities based on the emission and transition probabilities.
Strengths of HMMs in POS Tagging
• Simplicity and Efficiency: HMMs are relatively simple to implement and can be
trained and used for decoding efficiently, even on large datasets.
• Ability to Capture Sequential Dependencies: HMMs effectively model the
dependence of a word's POS tag on the tags of previous words.
Limitations of HMMs
• Limited Contextual Information: Basic HMMs only consider a limited window of
previous tags (usually one or two).
• Difficulty Handling Long-Range Dependencies: HMMs struggle to capture
relationships between words that are far apart in a sentence.
Extensions and Alternatives
While basic HMMs are useful for POS tagging, more advanced models like Conditional
Random Fields (CRFs) and deep learning architectures can be employed to address some of
their limitations. These models allow for the incorporation of richer features and more
complex dependencies between words.
Relevance of HMMs
Despite their limitations, HMMs remain an important baseline model for POS tagging and
have been instrumental in the development of more sophisticated natural language processing
techniques.

a) Temporal Models

Temporal models are used to represent systems that evolve over time. A common example is
a patient in an intensive care unit, where sensor readings like heart rate and blood pressure
are taken at regular intervals, and the goal is to track the patient's state over time. Another
example is tracking a robot's location as it moves and gathers observations.
A temporal model can be constructed using a system state, which represents a snapshot of
the system's relevant attributes at a particular time. The system state is usually represented as
an assignment of values to a set of random variables. For example, in the robot tracking
example, the system state might include variables for the robot's location and orientation.
One way to represent temporal models is using dynamic Bayesian networks (DBNs). A
DBN consists of a sequence of time slices, each of which is a Bayesian network that
represents the system state at a particular time. The time slices are connected by edges that
represent the temporal dependencies between variables.
A simple example of a temporal model is the hidden Markov model (HMM). An HMM has
a single state variable and a single observation variable. For example, in POS tagging, the
hidden state represents the POS tag, and the observation is the word.
The HMM structure reflects the assumption that the probability of a word's POS tag depends
on the tags of the preceding words. The HMM's parameters, which include transition
probabilities (the likelihood of transitioning from one POS tag to another) and emission
probabilities (the likelihood of observing a particular word given a certain POS tag), are
learned from a corpus of text where words are already tagged with their correct POS tags.
[From our conversation history].
Once trained, an HMM can be used to tag new, unseen sentences. The Viterbi algorithm is
used to find the most likely sequence of POS tags for a given sentence. [From our
conversation history].
b) Log-Linear Parameterisation
Log-linear parameterisation is a way of representing factors in a graphical model using a log-
linear model. In this representation, the factors are derived from a number of feature
functions. The advantage of this approach is that it can make certain types of structure more
explicit and easier to see.
A log-linear model is a statistical model that describes the probability of an event as a linear
function of a set of features. The features can be any functions of the variables in the model.
The probability of an event is then given by the exponential of the linear combination of the
features.
Example:
Let's consider the task of named-entity recognition (NER) in natural language processing. In
NER, we want to identify and classify named entities in text, such as people, organisations,
and locations.
We could represent this task using a log-linear model, where the features are functions of the
words in the text and their surrounding context. For example, one feature might be whether
the current word is capitalised. Another feature might be whether the previous word is "the".
We can define a feature function f(Yt, Xt) that equals 1 if the target variable Yt (the named
entity tag) is "B-Organization" and the context word Xt is "Times", and 0 otherwise. This
type of feature function, used in a log-linear model for sequence labelling, is similar to a
Conditional Random Field (CRF). This is analogous to how logistic CPDs are the conditional
analogue of the naive Bayes classifier.
The probability of a particular named entity tag sequence would then be proportional to the
exponential of the sum of the weighted features for that sequence. The weights of the features
are learned from training data.
This is just one example of how log-linear parameterisation can be used in graphical models.
It is a powerful and flexible technique that can be used to represent a wide variety of
probabilistic models.

Bayesian Networks for Classification

Bayesian networks can be applied effectively to classification problems. This involves using
the network to predict the probability of an instance belonging to a particular class based on
the values of its observed features.
Here's how it works:
• Structure: The Bayesian network structure encodes the relationships between the
class variable and the features. For example, in a medical diagnosis setting, the class
variable might be the presence or absence of a disease, and the features might be
symptoms or test results. The network structure would then represent the causal
relationships between the disease and its symptoms.
• Parameters: The parameters of the Bayesian network specify the conditional
probability distributions (CPDs) associated with each variable. For example, a CPD
for a symptom variable might specify the probability of observing that symptom given
the presence or absence of the disease.
• Training: The network's parameters are learned from a training dataset, which
consists of instances with known class labels. The goal of training is to find the
parameter values that maximise the likelihood of the training data. This can be
achieved using techniques like maximum likelihood estimation (MLE) or Bayesian
estimation.
• Classification: Once the Bayesian network is trained, it can be used to classify new
instances. This is done by inferring the probability of the class variable given the
values of the observed features. The network's structure and parameters allow it to
reason about the dependencies between the features and the class variable, providing a
more informed prediction than a simple model that treats the features independently.
Example: Medical Diagnosis
Let's consider a simplified example of diagnosing a patient's condition based on their
symptoms. We'll use a Bayesian network with the following variables:
• Flu (F): A binary variable indicating whether the patient has the flu (F = 1) or not (F
= 0).
• Fever (Fe): A binary variable indicating whether the patient has a fever (Fe = 1) or
not (Fe = 0).
• Cough (C): A binary variable indicating whether the patient has a cough (C = 1) or
not (C = 0).
The network structure is F → Fe, F → C, reflecting the assumption that flu can cause fever
and cough.
The parameters of the network are:
• P(F): The prior probability of having the flu.
• P(Fe | F): The probability of having a fever given the presence or absence of flu.
• P(C | F): The probability of having a cough given the presence or absence of flu.
These parameters can be learned from a dataset of patients with known flu status and their
corresponding symptoms.
Now, suppose we have a new patient who presents with a fever but no cough. We can use the
Bayesian network to calculate the probability of them having the flu:
P(F = 1 | Fe = 1, C = 0) ∝ P(F = 1) * P(Fe = 1 | F = 1) * P(C = 0 | F = 1)
This calculation uses Bayes' rule and takes into account the dependencies between the
variables encoded in the network structure.
By comparing this probability to a threshold, we can classify the patient as having flu or not.
Benefits of Bayesian Networks for Classification
• Handling Uncertainty: Bayesian networks explicitly model uncertainty, providing a
probability distribution over possible class labels.
• Explaining Predictions: The network structure makes the reasoning process
transparent, allowing us to understand how the model arrived at a particular
prediction.
• Incorporating Prior Knowledge: The network structure can be used to incorporate
prior knowledge about the relationships between variables.
• Handling Missing Data: Bayesian networks can handle missing data gracefully,
using inference techniques to marginalise over the missing values.
Limitations
• Computational Complexity: Inference in Bayesian networks can be computationally
expensive, especially for networks with a large number of variables or complex
structures.
• Structure Learning: Learning the optimal network structure from data can be
challenging, especially when the number of variables is large.
Overall, Bayesian networks offer a powerful and flexible framework for classification
problems, particularly in domains where uncertainty and dependencies between variables are
important considerations.

You might also like