[Probability] Hidden Markov Model

Notations

States are hidden, observations are noisy and just partial reflections of the states.

Computations

Inference
– How to compute likelihood
– How to compute most likely (hidden) state sequence

Learning
– How to estimate parameters that maximize (multiple) observation sequences

Computing likelihood

Goal: computing likelihood of a state at time t+1 given all observations to time t+1

Fill up table of alpha [nxT]

Inference in HMMs

Goal: computing the most likely path of the states given observation to time T

Keyword: Viterbi path, Viterbi algorithm

Fill up table of l [nxT], which is the log probability of most likely t-step “path” that ends in state i at time t.

Start with the base case at t=1, find relationship between time t and time t+1 and recurse to T.

Base case for maximum log probability

Recursion for maximum log probability

Computing Viterbi path

Most likely state at time t given stat j at time t+1 with observations to time t.

Recursion gives us the Viterbi path

Learning in HMMs

Goal: given sequence of observations, estimate parameters (initial state, transition matrix, emission matrix) to maximize the likelihood.

Keyword: “forward-backward” algorithm (Baum-Welch)

Time complexity of this process is O(n^2T)

Gaussian Mixture Model (GMM)

Easy to understand ML estimation of Gaussian Mixture Model by making an analogy with K-means clustering, and EM algorithm is just an extension of it to cover the incomplete data case.

Belief Updating

Recursion relation for real-time updating of beliefs based on incoming evidence. Very useful for agents that must monitor their environments in real-time

Keyword: Kalman filtering

[Probability] EM Algorithm

EM Algorithm

• For incomplete data case of Maximum Likelihood
• Auxiliary function
• No learning rate
• Monotonic convergence (each iteration improves L)
• Comparison with other numerical methods
• Gradient Descent: Asymptotic but not monotonic convergence
• Newton’s Method: Monotonic convergence, but highly unstable

Insight

We can find an auxiliary function that guarantees monotonic convergence

Algorithm

1. Initialize CPT to any value (non-zero)
2. Repeat following until convergence

[Probability] Maximum Likelihood

Belief Network = DAG + CPTs, but CPTs are not always known. If CPTs cannot be given by the experts, we may want to learn them from examples.

Maximum Likelihood Estimation (MLE)

• Simplest form of BN in learning
• Choose (estimate) model (DAG + CPTs) to maximize P(observed data|DAG+CPTs)

Case I

Fixed DAG, Discrete nodes, Lookup CPTs, Complete data

Properties

• Asymptotically correct
• Problematic in non-asymptotic regine (sparse data)
• 0 if count of pair is 0
• undefined if count of parent is 0

Case II

Fixed DAG, Parametric CPTs, Complete data

II-A Gaussian CPT (Linear Regression)

Ill-posed when:

• Input dimensionality exceeds number of examples, d > T
• Inputs not in general position
• option is to go for minimum norm solution

II-B Sigmoid CPT (Logistic Regression)

Case III

Fixed DAG, Discrete nodes, Lookup CPTs, Incomplete data

EM Algorithm (Next post)

[Probability] Approximate inference in BN

Exact inference is NP-hard, so approximate methods are best option for loopy BN

Stochastic simulation in BN

Belief network defines “generative model”, so we can easily draw samples from the joint distribution.

However, it is hard to draw samples from posterior distribution

Rejection sampling

• Generate N joint samples from joint distribution.
• Count number of samples N(Q=q, E=e) and N(E=e)
• Estimate P(Q=q|E=e) ~ N(Q=q, E=e) / N(E=e)

This will converge as N reaches infinity, but it is very inefficient:

• Discards all samples without E=e
• Takes large number of samples for queries with unlikely evidence

Likelihood weighting

• Instantiate evidence nodes instead of sampling them
• Weight each sample using CPTs
• For evidence nodes in numerator and denominator. P(E=e|Parent(E))

This will also converge as N reaches infinity.
Likelihood weighting is much faster than Rejection sampling since all samples are used. However, it is still slow for rare events. MCMC Simulation

• Fix evidence nodes to observed values
• Initialize non-evidence nodes to any values
• Repeat N times:
• Pick non-evidence nodes at random
• Use Bayes rule to compute P(X=x|Bx) where Bx is fixed to current values
• Resample X~P(X=x|Bx)

MCMC simulation is about reformulating the sampling as a Markov Chain between samples… This term is rather ambiguous, but the following link seems helpful!

LW sample non-evidence nodes from P(X|Parent(X))
MCMC sample non-evidence nodes from P(X|Bx)

[Probability] Exact inference in loopy BN

Node clustering

• Merge nodes to form polytrees
• Merge CPTs

Catch

• Size of merged node values = size of merged CPT grows exponential to (NO. VALUES)^(MERGED NODES)
• Polytree algorithm is linear in CPT size
• Choosing optimal (efficient inference) clustering (make polytree) scheme is hard to answer

Cutset conditioning

• Instantiate node so that remaining nodes form a polytree
• Apply polytree algorithm separately on each subtree

Catch

• Cutset: set of instantiated nodes
• Number of sub-polytrees grows exponentially with the size of cutset
• Choosing optimal cutset is also hard to answer

[Probability] Polytree

Polytree

• (1) singly connected, (2) no loops, (3) at most one path between any two nodes
• Efficient algorithms for exact inference

Notes

• Termination
• Root node does not have parents, so probability will be given.
• Leaf node does not have children, so no recursive call can happen down stream.
• Computing Evidence node’s probability is trivial, because it can be done by recursive call of marginalization. (Trivial)
• Runtime analysis
• Linear in number of nodes and in size of CPTs.
• Problem solving in general
• Start off with Bayes rule if the order is wierd, Product rule to remove something, Marginalization if something is missing…

[Probability] Inference

Types of inference

Diagnostic reasoning: to infer causes from effects
ex) P(B=1|M=1), given B–>A–>M

Causal reasoning: from causes to effects
ex) P(M=1|B=1), given B–>A–>M

ex) P(E=1|J=1,B=1) < P(E=1|J=1), given B,E–>A–>J,M

* Earthquake being true would be less likely when you already know that Burglary was what may have caused John to call.
* “Explaining away” is a common pattern of reasoning in which confirmation of one cause reduces the need to invoke other causes

Mixing causes/effects: P(B=1,M=1|A=1,E=1)

Strategy for inference

Bayes rule: express P(Q|E) in terms of conditional probabilities that respect DAG ordering

Product rule: express joint predictions in terms of individual predictions

Marginalization: introduce nodes so that we predict children from parents

Marginal or conditional independence: use to remove terms on RHS of conditioning bar

[Probability] Conditional Probability Table (CPT)

Lookup Table

Table with 2^k rows can specify an arbitrary CPT, where k is number of parents of a node.

However, this is too complicated

Deterministic Node

“AND” gate will check whether all of them are 1,
“OR” gate will check whether all of them are 0.

However, this is too simple

Noisy-OR

Use k numbers to parametrize all 2^k entries. More parents are equal to 1, the more probable that Y=1.

However, this only takes into account excitation!

Sigmoid CPT

Use k real numbers to parametrize 2^k elements in CPT Also known as logistic regression in statistics,
activation function in neural networks.

Different from Noisy-OR in that this can mix both inhibition and excitation!

[Probability] Conditional Independence

Conditional Independence

A node X is conditionally independent of its non-parent ancestors given its parents.

Example

Top researchers in computer architecture publish their research to both ASPLOS and ISCA, so getting paper accepted to them is not marginally independent.

However, if it is some specific researcher we are talking about, getting paper accepted to them is an independent event, ergo conditionally independent.

[Probability] d-Separation

Serial Connections p(a,b) = p(a) sum_c p(c|a)p(b|c) = p(a)p(b|a) ~= p(a)p(b), so a and b are not marginally independent (causal chain)

if c is given,

p(a,b|c) = p(a,b,c)/p(c) = p(a)p(c|a)p(b|c)/p(c) = p(a|c)p(b|c), so a and b are conditionally independent given c (d-Separable by case I)

Diverging Connections p(a,b) = sum_c p(a|c)p(b|c)p(c) ~= p(a)p(b), so a and b are not marginally independent (common cause)

if c is given,
p(a,b|c) = p(a,b,c)/p(c) = p(a|c)p(b|c), so a and b are conditionally independent given c (d-Separable by case II)

Converging Connections

p(a,b) = p(a)p(b), so a and b are marginally independent (unobserved common effect)

if c is given,

p(a,b|c) = p(a,b,c)/p(c) = p(a)p(b)p(c|a,b)/p(c) ~= p(a|c)p(b|c), so a and b are conditionally dependent given c

Also, if any of c’s child is given, same applies. (d-Separable by case III)