PGM Theory Notes
PGM Theory Notes
Plate Models
Plate models are a type of graphical model that can be used to represent complex
probabilistic models in a compact way. They are particularly useful for representing models
that involve multiple objects of the same type.
Plate models use the concept of a plate, which is a rectangular box that represents a set of
objects. Within a plate, we can define attributes, which are random variables that are
associated with each object in the set. For example, if we have a plate representing students,
we might have attributes for their intelligence, grade, and difficulty of their courses.
Connections between attributes in a plate model can represent probabilistic dependencies
between the corresponding random variables. These connections can be either directed or
undirected, depending on the type of model we are trying to represent.
Example of a Plate Model
Source provides an example of a plate model for a simplified Student example. In this
example, we have two plates: one representing students and one representing courses. The
attributes in the model are:
• Intelligence (I): An attribute of the Students plate, representing the intelligence of
each student.
• Grade (G): An attribute of both the Students and Courses plates, representing the
grade that a student receives in a course.
• Difficulty (D): An attribute of the Courses plate, representing the difficulty of each
course.
The connections in the model indicate that:
• A student's grade in a course depends on their intelligence and the difficulty of the
course.
• The intelligence of different students is independent.
• The difficulty of different courses is independent.
This plate model can be used to generate a ground Bayesian network, which is a Bayesian
network that represents the joint probability distribution over all the random variables in the
model. The ground Bayesian network for this plate model would have a node for each student
and each course, and an edge between a student node and a course node if the student is
taking that course.
Rule-Based CPD vs. Tree-Based CPD
Both rule-based and tree-based CPDs are ways of representing context-specific
independence in a CPD. This means that the conditional probability distribution of a variable
can vary depending on the values of its parent variables.
Tree-Based CPDs represent context-specific independence using a tree structure. The
internal nodes of the tree correspond to tests on the parent variables, and the leaves of the tree
correspond to different conditional probability distributions. To find the conditional
probability distribution for a particular instantiation of the parent variables, we traverse the
tree from the root to a leaf, following the branches that correspond to the values of the parent
variables.
Rule-Based CPDs represent context-specific independence using a set of rules. Each rule
specifies a context, which is a partial assignment to the parent variables, and a conditional
probability distribution for the child variable given that context. The rules in a rule-based
CPD must be mutually exclusive and exhaustive, meaning that every possible instantiation
of the parent variables is covered by exactly one rule.
Differences
Here's a table summarizing the key differences:
Set of rules, each with a context Tree structure with tests on parent
Representation and a conditional probability variables at internal nodes and conditional
distribution distributions at leaves
Can represent very specific Contexts are defined by paths in the tree,
Specificity
contexts which can be less specific
Ease of Easier to reason about individual Can be easier to visualize the overall
Reasoning dependencies structure of the CPD
D-separation, short for directed separation, is a concept used in Bayesian networks (BNs) to
determine conditional independence relationships between variables. It leverages the
graphical structure of the BN to efficiently identify independencies that hold for all
distributions factorizing over that graph structure. In simpler terms, D-separation allows us to
"read off" independencies directly from the BN graph without needing to analyse the specific
numerical values in the conditional probability distributions (CPDs).
The core idea behind D-separation is the notion of "flow of influence". We analyse how
information or influence might flow between variables along the paths or trails in the BN
graph. If the flow of influence between two variables, X and Y, is blocked given a set of
observed variables Z, then we say that X and Y are d-separated given Z. This implies that X
and Y are conditionally independent given Z in any distribution that factorises over the BN
structure.
Active Trails and Blocking
To determine whether influence can flow between variables, we need to consider the concept
of an active trail. A trail between two variables is a sequence of connected edges in the
graph, regardless of their direction.
A trail is considered active given a set of observed variables Z if:
• For every v-structure (a structure like X → Z ← Y) along the trail, either Z or one of
its descendants is in the set of observed variables. Observing a descendant of Z
"activates" the v-structure, allowing influence to flow through it.
• No other node along the trail (except potentially the start and end nodes) is in the set
of observed variables Z. Observing any other node on the trail "blocks" the flow of
influence.
D-separation is then defined as follows:
• Two sets of variables X and Y are d-separated given a set Z if there is no active trail
between any node in X and any node in Y given Z.
Example
Let's illustrate D-separation with an example using the Student BN from our conversation
history.
Imagine the following structure:
• Difficulty (D): The difficulty of a course.
• Intelligence (I): The intelligence of a student.
• Grade (G): The grade a student receives in a course.
• SAT (S): The student's SAT score.
• Letter (L): Whether the student receives a recommendation letter.
Here are some D-separation statements and explanations:
1. D and I are d-separated. There is only one trail between D and I (D → G ← I),
which forms a v-structure. Since neither G nor any of its descendants are observed,
this trail is not active. Hence, D and I are d-separated, meaning they are marginally
independent.
2. D and S are d-separated given G. The only trail connecting D and S is D → G ← I
→ S. Observing G blocks this trail, as it's not part of a v-structure. Therefore, D and S
are d-separated given G, implying that the difficulty of a course and a student's SAT
score are conditionally independent given the student's grade in the course.
3. D and S are not d-separated given L. Even though L is a descendant of G, observing
L activates the v-structure D → G ← I. The entire trail D → G ← I → S becomes
active, meaning that D and S are not d-separated given L. This suggests that knowing
whether a student received a recommendation letter can induce a dependence between
the course difficulty and the student's SAT score.
Soundness and Completeness of D-separation
• Soundness: If D-separation identifies X and Y as d-separated given Z, then X and Y
are guaranteed to be conditionally independent given Z in any distribution that
factorizes over the BN structure. This property assures us that D-separation won't
mistakenly identify independencies that don't hold.
• Completeness: The completeness of D-separation is a bit more nuanced. While D-
separation identifies all independencies that hold for every distribution factorizing
over a given BN structure, there might be additional independencies that hold for
specific distributions due to particular numerical values in the CPDs. These specific
independencies might not be identifiable solely through D-separation.
To address this, the concept of faithfulness is introduced. A distribution is faithful to a BN
structure if all independencies in the distribution are reflected in the D-separation properties
of the graph. In other words, for a faithful distribution, D-separation is a complete method for
identifying independencies. However, it's important to note that not all distributions are
faithful, so there might be cases where D-separation misses some independencies specific to a
particular distribution.
D-separation is a fundamental tool for understanding and reasoning with Bayesian networks.
It allows us to leverage the graphical structure to efficiently determine conditional
independencies and gain insights into the relationships between variables in the model.
Hidden Markov Models (HMMs) are widely used in natural language processing, particularly
for Part-of-Speech (POS) tagging, which involves assigning grammatical tags (like noun,
verb, adjective) to words in a sentence. HMMs offer a probabilistic framework to capture the
sequential dependencies inherent in language.
The Basic Model: Words as Observations, POS Tags as Hidden States
In the context of POS tagging, the HMM structure reflects the underlying assumption that the
probability of a word's POS tag depends on the tags of the preceding words.
• Hidden States: Each hidden state in the HMM represents a possible POS tag (e.g.,
noun, verb, adjective).
• Observations: The observed variables are the words in a sentence.
• Transition Probabilities: These probabilities capture the likelihood of transitioning
from one POS tag to another. For instance, the probability of transitioning from a
noun (NN) to a verb (VB) might be lower than from a determiner (DT) to a noun.
• Emission Probabilities: These probabilities reflect the likelihood of observing a
particular word given a certain POS tag. For example, the word "dog" is more likely
to be emitted from a noun state than a verb state.
Training and Decoding
• Training: During training, the HMM's parameters (transition and emission
probabilities) are learned from a corpus of text where words are already tagged with
their correct POS tags. This can be done using the Baum-Welch algorithm, which is a
special case of the Expectation-Maximization (EM) algorithm.
• Decoding: Once the HMM is trained, it can be used to tag new, unseen sentences. The
Viterbi algorithm, a dynamic programming algorithm, is typically used to find the
most likely sequence of POS tags for a given sentence, effectively decoding the
sentence's grammatical structure.
Example
Consider a simplified HMM trained on a small corpus to tag the sentence "The cat sat."
Given this model, the Viterbi algorithm would determine the most probable tag sequence for
"The cat sat" by considering all possible tag combinations (DT-NN-VB, DT-VB-NN, etc.)
and calculating their probabilities based on the emission and transition probabilities.
Strengths of HMMs in POS Tagging
• Simplicity and Efficiency: HMMs are relatively simple to implement and can be
trained and used for decoding efficiently, even on large datasets.
• Ability to Capture Sequential Dependencies: HMMs effectively model the
dependence of a word's POS tag on the tags of previous words.
Limitations of HMMs
• Limited Contextual Information: Basic HMMs only consider a limited window of
previous tags (usually one or two).
• Difficulty Handling Long-Range Dependencies: HMMs struggle to capture
relationships between words that are far apart in a sentence.
Extensions and Alternatives
While basic HMMs are useful for POS tagging, more advanced models like Conditional
Random Fields (CRFs) and deep learning architectures can be employed to address some of
their limitations. These models allow for the incorporation of richer features and more
complex dependencies between words.
Relevance of HMMs
Despite their limitations, HMMs remain an important baseline model for POS tagging and
have been instrumental in the development of more sophisticated natural language processing
techniques.
a) Temporal Models
Temporal models are used to represent systems that evolve over time. A common example is
a patient in an intensive care unit, where sensor readings like heart rate and blood pressure
are taken at regular intervals, and the goal is to track the patient's state over time. Another
example is tracking a robot's location as it moves and gathers observations.
A temporal model can be constructed using a system state, which represents a snapshot of
the system's relevant attributes at a particular time. The system state is usually represented as
an assignment of values to a set of random variables. For example, in the robot tracking
example, the system state might include variables for the robot's location and orientation.
One way to represent temporal models is using dynamic Bayesian networks (DBNs). A
DBN consists of a sequence of time slices, each of which is a Bayesian network that
represents the system state at a particular time. The time slices are connected by edges that
represent the temporal dependencies between variables.
A simple example of a temporal model is the hidden Markov model (HMM). An HMM has
a single state variable and a single observation variable. For example, in POS tagging, the
hidden state represents the POS tag, and the observation is the word.
The HMM structure reflects the assumption that the probability of a word's POS tag depends
on the tags of the preceding words. The HMM's parameters, which include transition
probabilities (the likelihood of transitioning from one POS tag to another) and emission
probabilities (the likelihood of observing a particular word given a certain POS tag), are
learned from a corpus of text where words are already tagged with their correct POS tags.
[From our conversation history].
Once trained, an HMM can be used to tag new, unseen sentences. The Viterbi algorithm is
used to find the most likely sequence of POS tags for a given sentence. [From our
conversation history].
b) Log-Linear Parameterisation
Log-linear parameterisation is a way of representing factors in a graphical model using a log-
linear model. In this representation, the factors are derived from a number of feature
functions. The advantage of this approach is that it can make certain types of structure more
explicit and easier to see.
A log-linear model is a statistical model that describes the probability of an event as a linear
function of a set of features. The features can be any functions of the variables in the model.
The probability of an event is then given by the exponential of the linear combination of the
features.
Example:
Let's consider the task of named-entity recognition (NER) in natural language processing. In
NER, we want to identify and classify named entities in text, such as people, organisations,
and locations.
We could represent this task using a log-linear model, where the features are functions of the
words in the text and their surrounding context. For example, one feature might be whether
the current word is capitalised. Another feature might be whether the previous word is "the".
We can define a feature function f(Yt, Xt) that equals 1 if the target variable Yt (the named
entity tag) is "B-Organization" and the context word Xt is "Times", and 0 otherwise. This
type of feature function, used in a log-linear model for sequence labelling, is similar to a
Conditional Random Field (CRF). This is analogous to how logistic CPDs are the conditional
analogue of the naive Bayes classifier.
The probability of a particular named entity tag sequence would then be proportional to the
exponential of the sum of the weighted features for that sequence. The weights of the features
are learned from training data.
This is just one example of how log-linear parameterisation can be used in graphical models.
It is a powerful and flexible technique that can be used to represent a wide variety of
probabilistic models.
Bayesian networks can be applied effectively to classification problems. This involves using
the network to predict the probability of an instance belonging to a particular class based on
the values of its observed features.
Here's how it works:
• Structure: The Bayesian network structure encodes the relationships between the
class variable and the features. For example, in a medical diagnosis setting, the class
variable might be the presence or absence of a disease, and the features might be
symptoms or test results. The network structure would then represent the causal
relationships between the disease and its symptoms.
• Parameters: The parameters of the Bayesian network specify the conditional
probability distributions (CPDs) associated with each variable. For example, a CPD
for a symptom variable might specify the probability of observing that symptom given
the presence or absence of the disease.
• Training: The network's parameters are learned from a training dataset, which
consists of instances with known class labels. The goal of training is to find the
parameter values that maximise the likelihood of the training data. This can be
achieved using techniques like maximum likelihood estimation (MLE) or Bayesian
estimation.
• Classification: Once the Bayesian network is trained, it can be used to classify new
instances. This is done by inferring the probability of the class variable given the
values of the observed features. The network's structure and parameters allow it to
reason about the dependencies between the features and the class variable, providing a
more informed prediction than a simple model that treats the features independently.
Example: Medical Diagnosis
Let's consider a simplified example of diagnosing a patient's condition based on their
symptoms. We'll use a Bayesian network with the following variables:
• Flu (F): A binary variable indicating whether the patient has the flu (F = 1) or not (F
= 0).
• Fever (Fe): A binary variable indicating whether the patient has a fever (Fe = 1) or
not (Fe = 0).
• Cough (C): A binary variable indicating whether the patient has a cough (C = 1) or
not (C = 0).
The network structure is F → Fe, F → C, reflecting the assumption that flu can cause fever
and cough.
The parameters of the network are:
• P(F): The prior probability of having the flu.
• P(Fe | F): The probability of having a fever given the presence or absence of flu.
• P(C | F): The probability of having a cough given the presence or absence of flu.
These parameters can be learned from a dataset of patients with known flu status and their
corresponding symptoms.
Now, suppose we have a new patient who presents with a fever but no cough. We can use the
Bayesian network to calculate the probability of them having the flu:
P(F = 1 | Fe = 1, C = 0) ∝ P(F = 1) * P(Fe = 1 | F = 1) * P(C = 0 | F = 1)
This calculation uses Bayes' rule and takes into account the dependencies between the
variables encoded in the network structure.
By comparing this probability to a threshold, we can classify the patient as having flu or not.
Benefits of Bayesian Networks for Classification
• Handling Uncertainty: Bayesian networks explicitly model uncertainty, providing a
probability distribution over possible class labels.
• Explaining Predictions: The network structure makes the reasoning process
transparent, allowing us to understand how the model arrived at a particular
prediction.
• Incorporating Prior Knowledge: The network structure can be used to incorporate
prior knowledge about the relationships between variables.
• Handling Missing Data: Bayesian networks can handle missing data gracefully,
using inference techniques to marginalise over the missing values.
Limitations
• Computational Complexity: Inference in Bayesian networks can be computationally
expensive, especially for networks with a large number of variables or complex
structures.
• Structure Learning: Learning the optimal network structure from data can be
challenging, especially when the number of variables is large.
Overall, Bayesian networks offer a powerful and flexible framework for classification
problems, particularly in domains where uncertainty and dependencies between variables are
important considerations.