0% found this document useful (0 votes)
30 views138 pages

ECE 368 Course Review: Probabilistic Reasoning 2023

The document provides an overview of probabilistic reasoning and Bayesian statistics. It discusses how probabilistic models can be used for decision-making with uncertain information by observing data, analyzing the evidence, deciding on a course of action, and accumulating learning over time. It then covers key probabilistic concepts like maximum likelihood estimation, Bayesian parameter estimation, hypothesis testing, and graphical models. The document aims to teach probabilistic inference, classification, regression, and learning from a Bayesian perspective.

Uploaded by

Jialun Lyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views138 pages

ECE 368 Course Review: Probabilistic Reasoning 2023

The document provides an overview of probabilistic reasoning and Bayesian statistics. It discusses how probabilistic models can be used for decision-making with uncertain information by observing data, analyzing the evidence, deciding on a course of action, and accumulating learning over time. It then covers key probabilistic concepts like maximum likelihood estimation, Bayesian parameter estimation, hypothesis testing, and graphical models. The document aims to teach probabilistic inference, classification, regression, and learning from a Bayesian perspective.

Uploaded by

Jialun Lyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

ECE 368 Course Review

Probabilistic Reasoning
2023
1. What is Probabilistic Reasoning
• Why? Decision-Making with Uncertain Information
• Events not directly observable, Measurement errors, …
• Analytics Pipelines: Observe, Analyze, Decide, Act
• Probabilistic Inference: Classification and Regression
• and Learning

Data Evidence
Events Observe Analyze Decide Action

Learning and Knowledge Accumulation


Data, Learning and Model Development

• Data: Scalar, Vector, Temporal, Spatial, Composite


• Training: Use data to develop models
• Supervised: Labeled Dataset (input and output data)
• Unsupervised (output data)
• Periodic Retraining
• Reinforcement (state, action, reward)

Dataset Training Model

Data Model Inference


Bayesian Statistics and Frequentist Statistics

𝑃 𝐷 𝐻𝑖 𝑃 𝐻𝑖
𝑃 𝐻𝑖 𝐷 = ~𝑃 𝐷 𝐻𝑖 𝑃 𝐻𝑖
𝑃𝐷

Frequentist Bayesian
• A priori probability not used • A priori probability over hypotheses
• Hypotheses not usually result of an • Must know or construct
experiment “subjective” prior
• Objective assessment of evidence • Can explore different priors
• Confidence intervals & p-values • Computationally intensive
• A posteriori probability not used • Learns as data accumulated
• Less computationally intensive • Aposteriori enables decisions
Learning Objectives
1. Joint distributions, marginals, conditionals and Bayes rule.
2. Vector-based probabilistic models, e.g., jointly Gaussian vectors,
binomials, multinomials, conjugate priors
3. Hypothesis Testing: Naïve Bayes, Gaussian Discriminants, Likelihood
Ratio test, Bayesian Testing, Type I/II errors, cost function
4. Estimation: Likelihood function; Linear regression, Bayesian, LMS
5. Graphical models, message-passing inference.
6. Hidden Markov Models (HMMs), the forward-backward algorithm
and Viterbi algorithm
Maximum Likelihood & Bayesian Parameter Estimation

• Sample Mean and Convergence Properties


• Estimating parameters of a Random Variable
• ML Estimation and Frequentist Interpretation
• MAP Estimation and Bayesian Interpretation
• Bayesian Least Mean Square (LMS) Estimation
Parameter Estimation
• Assume IID sequence of RVs
• Estimate a parameter of X

• Example: Bernoulli RV
Properties of Estimators
• Estimation error and Bias
Maximum Likelihood Estimation
• Likelihood
Log Likelihood Function
• Log likelihood
MLE Bernoulli RV
MLE Bernoulli RV
Laplace: Will the Sun Rise Tomorrow?
Frequentist Bayesian
Estimation Using Conditional Expectation
Bayes Inference
• Prior Distribution on ⍬ • Conditional distribution given ⍬
Maximum A Posteriori Probability Rule
MAP Estimate for Binomial with Beta Prior
MAP Rule for Prediction
Maximum Likelihood & Bayesian Parameter Estimation
More on ML, MAP, LMS Estimators
• Comparison of ML, MAP, LMS & Conditional Expectation
• Poisson RV with Gamma Prior
• Gaussian RV with Gaussian Prior
• Multinomial RV with Dirichlet Prior
Frequentist and Bayesian Inference

• Frequentist • Bayesian
LMS & Conditional Expectation
Important Conjugate Priors
Sample Variance
MAP for Gaussian RV with Gaussian Prior
ML Estimator for Multinomial RV
MAP for Multinomial RV with Dirichlet Prior
Estimation of Gaussian Vectors
• Gaussian Vector Estimation Problems
• Conditional Gaussian Distributions
• Marginal Gaussian Distributions
• Gaussian Systems
• ML Estimation
• MAP Estimation

• Bishop: Section 2.3


• Murphy: Section 4.3, 4.4
Conditional pdf of 2D Gaussian
fX ,Y (x, y)
fX ,Y (x|y) =
fY (y)
ì
ï -1 é s1 ù
2
ü
ï
exp í ê x - r (y - m ) - m1ú ý
ï
î 2(1 - r 2
X , Y
)s 2
1 ë
X ,Y
s 2
2
û ï
þ
=
2ps 12 (1- r X2 ,Y )

• X given Y=y is a Gaussian RV with mean & variance

s1
E[ X|Y ] = r X ,Y (y - m2 ) + m1 VAR[ X|Y ] = (1- r 2 X ,Y )s 12
s2

• Max of pdf of X given Y=y at E[X|y]


• Least Mean Square (LMS) estimate also at E[X|y]
Conditional Gaussian Distributions

X and Y are now VECTOR random variables


MAP Estimators
Gaussian Systems
Hypothesis Testing

• Binary Hypothesis Testing


• Likelihood Ratio Test & Neyman Pearson Lemma
• Significance Testing
• Bayesian Hypothesis Testing
• MAP Rule
• Minimum Cost Decisions
• ROC Curves
• Naïve Bayes’ Classifier Next Week
• ML Estimator
• Laplace Smoothing
Binary Hypothesis Testing
Likelihood Ratio Test

The ML Decision Rule compares the likelihood ratio to 1; other decision rules result by comparing L(x) to other thresholds
The ML Decision Rule compares the likelihood ratio to 1; the corresponding log likelihood ratio test
compares to the log of the original threshold.
We obtain a class of decision rules by varying the threshold.
The ML rule picks the threshold where
the two pdf’s are equal; As the
threshold gamma approaches infinity,
alpha approaches zero and beta
approaches 1. As gamma approaches
infinity, alpha goes to 1, beta to zero.
Neyman Pearson Lemma
Explanation of the Derivation:
• First assume there is a rule that achieves type 1 error alpha
• Next consider minimizing the type 2 error, given that the rule attains type 1 error alpha
• This involved minimizing type 2 error with constraint on type 1, leading to the Lagrangian expression
• The expression is minimized by assigning all values of x to the acceptance region when the integrand in the
previous page is negative; We note that this implies a likelihood ratio test with threshold lambda
• Finally pick lambda so that the type 1 error constraint is met.
Bayesian Hypothesis Testing
Bayesian Binary Hypothesis tests can be designed to minimize the average cost of the decision rule
• Example 1: The cost could be the probability of error, i.e. the sum of type 1 error and type 2 error
• General case: Reward correct decisions C00, C11, and penalize error C01, C10
• Both of these are solved by likelihood ratio tests (see next two charts)
Minimum Cost Decisions

See Section 8.6.2 in Leon-Garcia for proof


Bayesian K-ary Hypothesis tests extend the binary case to K hypotheses
• ML, MAP and Minimum cost decision rules can be derived
Naïve Bayes Assumption
Gaussian Discriminant Analysis

• Consider Gaussian classes c in C


• with priors
• Mean & Covariance Matrices
• Given data X decide which class c is present
• Bayesian Hypothesis Test (Classification)
Case 1: Equal Covariance Matrices
Case 2: Unequal Covariance Matrices
Quadratic Discriminant
Training Gaussian Parameters
Linear Regression
• Gauss and Least Squares Method
• Regression to the Mean
• Linear Regression: Orthogonal Projection
• Linear Regression: Curve-Fitting
• Bayesian Regression
• Regularization and Ridge Regression
Linear Regression
Linear Regression: Curve Fitting
Bayesian Regression
Discrete-Time Markov Chains

ECE 368

Reference: Leon-Garcia, Probability, Statistics,


and Random Processes, Chapter 11
Markov Chain

• A discrete-valued random sequence Xn is a


Markov chain if the future of the process given
the present is independent of the past, that is,
• if Xn is discrete-valued,

future present past

P[Xn+1 = xn+1 Xn = xn , Xn−1 = xn−1,..., X1 = x1]


= P[Xn+1 = xn+1 Xn = xn ]
Discrete-Time Markov Chains
• Let Xn be a discrete-time integer-valued
Markov chain that starts at n = 0 with pmf
p j (0) ! P[X0 = j] j = 0,1,2,…

• The joint pmf for the first n + 1 values of the


process is
P[Xn = in ,…, X0 = i0 ]
= P[Xn = in Xn−1 = in−1]!P[X1 = i1 X0 = i0 ]P[X0 = i0 ]

• Joint pmf for a sequence is product of


– probability for the initial state
– probabilities for subsequent one-step state transitions.
Homogeneous Transition Probabilities
• Assume that the one-step state transition
probabilities are fixed and do not change with
time, that is,
P[Xn+1 = j Xn = i] = pij for all n
• Xn is said to have homogeneous transition
probabilities.

• The joint pmf for Xn,…, X0 is then given by

P[Xn = in ,…, X0 = i0 ] = pi ,i
!pi ,i pi (0)
n−1 n 0 1 0
Transition Probability Matrix
• Xn is completely specified by the initial pmf pi(0)
and the matrix of one-step transition probabi-
lities P, or transition probability matrix:
⎡ p p01 p02 ! ⎤
⎢ 00

⎢ p10 p11 p12 ! ⎥
⎢ ⎥
P=⎢ ⋅ ⋅ ⋅ ⎥
⎢ p pi1 ! ⎥
i0
⎢ ⎥
⎢ ⋅ ⋅ ! ⎥
⎣ ⎦
• Note that each row of P must add to 1 since
1 = å P[ Xn+1 = j Xn = i] = å pij
j j
n-Step Transition Probabilities

• The matrix of two-step transition probabilities P(2) is:

P(2) = P(1)P(1) = P2

• Using the preceding arguments, P(n) is found by multiplying


P(n – 1) by P:

P(n) = P(n - 1)P or P(n) = Pn

• The n-step transition probability matrix is the nth power of the


one-step transition probability matrix.
The State Probabilities

• Let p(n) = {pj(n)} denote the row vector of state probabilities


at time n. The probability pj(n) is related to p(n – 1) by

pj (n) = å P[ Xn = j Xn-1 = i]P[ Xn-1 = i] = å pij pi (n - 1)


i i
The State Probabilities II
• p(n) is obtained by multiplying the row vector p(n – 1) by
the matrix P:

p(n) = p(n - 1)P

• Similarly, pj(n) is related to p(0) by

pj (n) = å P[ Xn = j X0 = i]P[ X0 = i] = å pij(n)pi (0)


i i

p(n) = p(0)Pn
Steady-State Probabilities

• Many Markov chains settle into stationary


behavior after the process has been running for
a long time; the initial state becomes irrelevant.
• As n → ∞, Pn approaches a matrix in which all
the rows are equal to the same pmf, i.e.,

pij (n) = p j for all i

pj (n) = å p j pi (0) = p j å pi (0) = p j


i i
Steady-State Probabilities II

• Consequently, probability of state j approaches a constant


independent of time and of the initial state probabilities:

pj (n) = p j for all i

• We say the system reaches “equilibrium” or “steady state.”


• The steady state pmf summarizes our knowledge about the process after it
has been running for a long time.
Steady-State Probabilities III
• If a Markov chain has a steady state, then the
steady state pmf π !{πj} is found by noting that as
n → ∞, pj(n) → πj and pi(n – 1) → πi, so

p j = å pijp i
i

p = pP

åp
i
i =1

• We refer π to as the stationary state pmf of the Markov chain.


Recurrence & Classes

• The behavior of a Markov chain is determined by


its transition matrix.
• The states in a discrete-time Markov chain can
be divided into one or more separate classes,
where each class is of a different type.
• The long-term behavior of a Markov chain is
related to the types of its state classes.
Classes of States

• State j is accessible from state i if there is a sequence


of transitions from i to j that has non-zero probability.
• States i and j communicate if they are accessible to
each other, we write: i<->j
– A state always communicates with itself.
• If i<->j and j<->k, then i<->k.
• Two states belong to the same class if they
communicate with each other.
• States in the same class share the same fate.
Classes of States II
• Two different classes must be completely disjoint, that is, they
cannot have any state in common.

• The states of a Markov chain consist of one or more disjoint


classes
• A Markov chain that consists of a single class is said to be
irreducible.
Recurrence Properties

• Start a Markov chain at state i.


• State i is said to be recurrent if the process returns to the state
with probability one, i.e.,
fi = P[ever returning to state i] = 1

• State i is transient if
fi < 1.

• Start a Markov chain in a transient state. The state does not


reoccur after some finite number of returns.
Irreducible Markov Chains

• If a Markov chain is irreducible then either all its states


are transient or recurrent.
• If the # of states in the chain is finite, it is impossible for
all the states to be transient.
• Thus the states of a finite-state, irreducible Markov
chain are all recurrent.
Periodic & Aperiodic Classes

• State i has period d if it can only reoccur at times that


are multiples of d.
• It can be shown that all the states in a class have the
same period.
• An irreducible Markov chain is said to be aperiodic if
the states in its single class have period one.
Stationary Probabilities & Limiting Probabilities

• The stationary state pmf is defined by the


global balance equations:

p j = å pijp i åp
i
i =1
i

• The stationary state probabilities correspond to the long-term


proportion of time spent in the states.
Theorem 1
For an irreducible, aperiodic, and positive recurrent Markov
chain:

lim pij (n) = p j , for all j,


n®¥

where πj is the unique nonnegative solution of the global


balance equations.

• For these Markov chains, the state probabilities approach


steady state values that are independent of the initial
condition.
• These Markov chains are called ergodic.
Theorem 2
For an irreducible, periodic, and positive recurrent Markov chain
with period d,

lim pjj (nd) = dp j , for all j,


n®¥

where πj is the unique nonnegative solution of the global


balance equations.

• For these periodic Markov chains, the state occupancies are


constrained to 1/d of the time instants, so the long-term
proportion of time spent in these states is 1/d the proportion of
time spent in the periodic recurrence times.
Types of Classes
State
j

Transient Recurrent
π
j
=0
Positive Null
recurrent recurrent
π
j
>0 π
j
=0

Aperiodic Periodic
lim p jj (n) = p j lim pjj (nd) = dp j
n ®¥ n ®¥
Bayesian Networks
Conditional Independence
Random Fields
Inference on Markov Chains:
Brute Force:
General Case:
Inference of Maximum Likelihood Sequence
Summary: Inference on graphical models
Hidden Markov Model HMM
Viterbi Algorithm
Expectation Maximization
ECE 368
Estimating Gaussian Mixture Model
Estimating HMM

You might also like