Probabilis*c
Graphical Models
Lecture 8: State-Space Models
Based on slides by Richard Zemel
Sequential data
Turn a@en*on to sequen*al data
– Time-series: stock market, speech, video analysis
– Ordered: text, gene
Simple example: Dealer A is fair; Dealer B is not C=h
C=t C=t
Process (let Z be dealer A or B): A B
Loop until tired:
1. Flip coin C, use it to decide whether to switch dealer C=h
2. Chosen dealer rolls die, record result
Fully observable formulation: data is sequence of dealer selections
AAAABBBBAABBBBBBBAAAAABBBBB
Simple example: Markov model
• If underlying process unknown, can construct model to predict
next le@er in sequence
• In general, product rule expresses joint distribu*on for sequence
• First-order Markov chain: each observa*on independent of all
previous observa*ons except most recent
• ML parameter es*mates are easy
• Each pair of outputs is a training case; in this example:
P(Xt =B| Xt-1=A) = #[t s.t. Xt = B, Xt-1 = A] / #[t s.t. Xt-1 = A]
Higher-order Markov models
• Consider example of text
• Can capture some regularities with bigrams (e.g., q nearly
always followed by u, very rarely by j) – probability of a
le?er given just its preceding le?er
• But probability of a le?er depends on more than just
previous le?er
• Can formulate as second-order Markov model (trigram
model)
• Need to take care: many counts may be zero in training
dataset
Character
recognition:
Transition
probabilities
Hidden Markov model (HMM)
• Return to casino example -- now imagine that do not observe
ABBAA, but instead just sequence of die rolls (1-6)
• Genera*ve process:
Loop un*l *red:
1. Flip coin C (Z = A or B)
2. Chosen dealer rolls die, record result X
Z is now hidden state variable – 1st order Markov chain generates
state sequence (path), governed by transi2on matrix A
Observa*ons governed by emission probabili2es, convert state path
into sequence of observable symbols or vectors:
Relationship to other models
• Can think of HMM as:
– Markov chain with stochastic measurements
– Mixture model with states coupled across time
• Hidden state is 1st-order Markov, but output not Markov of any
order
• Future is independent of past give present, but conditioning on
observations couples hidden states
Character Recognition Example
Which le?ers are these?
HMM: Character Recognition Example
Context ma?ers: recognition easier based on sequence of
characters
How to apply HMM to this character string?
Main elements: states? emission, transition probabilities?
HMM: Semantics
z1 = {a,..., z} z2 = {a,..., z} z3 = {a,..., z} z4 = {a,..., z} z5 = {a,..., z}
x1 = x2 = x3 = x4 = x5 =
Need 3 distributions:
1. Initial state: P(Z1)
2. Transition model: P(Zt|Zt-1)
3. Observation model (emission probabilities): P(Xt|Zt)
HMM: Main tasks
• Joint probabilities of hidden states and outputs:
T
P(x, z) = P( z1 ) P( x1 | z1 )∏t =2 P( zt | zt −1 ) P( xt | zt )
• Three problems
1. Computing probability of observed sequence: forward-
backward algorithm [good for recognition]
2. Infer most likely hidden state sequence: Viterbi algorithm
[useful for interpretation]
3. Learning parameters: Baum-Welch algorithm (version of
EM)
Fully observed HMM
Learning fully observed HMM (observe both X and Z) is easy:
1. Initial state: P(Z1) – proportion of words start with each
le?er
2. Transition model: P(Zt|Zt-1) – proportion of times a given
le?er follows another (bigram statistics)
3. Observation model (emission probabilities): P(Xt|Zt) – how
often particular image represents specific character, relative
to all images
But still have to do inference at test time: work out states given
observations
HMMs often used where hidden states are identified: words in
speech recognition; activity recognition; spatial position of
rat; genes; POS tagging
HMM: Inference tasks
Important to infer distributions over hidden states:
§ If states are interpretable, infer interpretations
§ Also essential for learning
Can break down hidden state inference tasks to solve (each
based on all observations up to current time, X0:t)
1. Filtering: compute posterior over current hidden state:
P(Zt| X0:t)
2. Prediction: compute posterior over future hidden state:
P(Zt+k| X0:t)
3. Smoothing: compute posterior over past hidden state:
P(Zk| X0:t), 0<k<t
4. Fixed-lag smoothing: P(Zt-a| X0:t): compute posterior
over hidden state a few steps back
Filtering, Smoothing & Prediction
P( Z t | X 1:t ) = P( Z t | X t , X 1:t −1 )
∝ P( X t | Z t , X 1:t −1 ) P( Z t | X 1:t −1 )
= P( X t | Z t ) P( Z t | X 1:t −1 )
= P( X t | Z t )∑ z P( Z t | zt −1 , X 1:t −1 ) P( zt −1 | X 1:t −1 )
t −1
= P( X t | Z t )∑ z P( Z t | zt −1 ) P( zt −1 | X 1:t −1 )
t −1
Filtering: for online estimation of state
Pr(state) =observation probability * transition-model
Smoothing: post hoc estimation of state (similar computation)
Prediction is filtering, but with no new evidence:
P( Z t + k | X 1:t ) = ∑ P( Z t +k | zt + k −1 ) P( zt + k −1 | X 1:t )
zt +k −1
HMM: Maximum likelihood
Having observed some dataset, use ML to learn the parameters
of the HMM
Need to marginalize over the latent variables:
X
p(X|✓) = p(X, Z|✓)
Z
Difficult:
– does not factorize over time steps
– involves generalization of a mixture model
Approach: utilize EM for learning
Focus first on how to do inference efficiently
Forward recursion (α)
Clever recursion can compute huge sum efficiently
Backward recursion (β)
α(zt,j): total inflow of prob. to node (t,j)
β(zt,j): total outflow of prob. from node (t,j)
Forward-Backward algorithm
Estimate hidden state given observations
One forward pass to compute all α(zt,i), one backward
pass to compute all β(zt,i): total cost O(K2T)
Can compute likelihood at any time t based on α (zt,j)
and β(zt,j)
Baum-Welch training algorithm: Summary
Can estimate HMM parameters using maximum
likelihood
If state path known, then parameter estimation easy
Instead must estimate states, update parameters, re-
estimate states, etc. -- Baum-Welch (form of EM)
State estimation via forward-backward, also need
transition statistics (see next slide)
Update parameters (transition matrix A, emission
parameters) to maximize likelihood
Transition statistics
Need statistics for adjacent time-steps:
Expected number of transitions from state i to state j that
begin at time t-1, given the observations
Can be computed with the same α(zt,j) and β(zt,j)
recursions
Parameter updates
Initial state distribution: expected counts in state k at time 1
Estimate transition probabilities:
Emission probabilities are expected number of times observe
symbol in particular state:
Using HMMs for recognition
Can train an HMM to classify a sequence:
1. train a separate HMM per class
2. evaluate prob. of unlabelled sequence under each
HMM
3. classify: HMM with highest likelihood
Assumes can solve two problems:
1. estimate model parameters given some training
sequences (we can find local maximum of
parameter space near initial position)
2. given model, can evaluate prob. of a sequence
Probability of observed sequence
Want to determine if given observation sequence is likely
under the model (for learning, or recognition)
Compute marginals to evaluate prob. of observed seq.: sum
across all paths of joint prob. of observed outputs and state
Take advantage of factorization to avoid exp. cost (#paths = KT)
Variants on basic HMM
• Input-output HMM
– Have additional observed variables U
• Semi-Markov HMM
– Improve model of state duration
• Autoregressive HMM
– Allow observations to depend on some previous
observations directly
• Factorial HMM
– Expand dim. of latent state
State Space Models
Instead of discrete latent state of the HMM, model Z as a
continuous latent variable
Standard formulation: linear-Gaussian (LDS), with (hidden
state Z, observation Y, other variables U)
– Transition model is linear
zt = A t zt 1 + Bt u t + ✏ t
– with Gaussian noise
✏t = N (0, Qt )
– Observation model is linear
y t = C t zt + D t u t + t
– with Gaussian noise
t = N (0, Rt )
Model parameters typically independent of time: stationary
Kalman Filter
Algorithm for filtering in linear-Gaussian state space model
Everything is Gaussian, so can compute updates exactly
Dynamics update: predict next belief state
Z
p(zt |y1:t 1 , u1:t ) = N (zt |At zt 1 + Bt ut , Qt )N (zt 1 |µt 1 , ⌃t 1 )dzt 1
= N (zt |µt|t 1 , ⌃t|t 1 )
µt|t 1 = At µt 1 + Bt u t
T
⌃t|t 1 = A t ⌃t 1 At + Qt
Kalman Filter: Measurement Update
Key step: update hidden state given new measurement:
p(zt |y1:t , u1:t ) / p(yt |zt , ut )p(zt |y1:t 1 , u1:t )
First term a bit complicated, but can apply various identities
(such as the matrix inversion lemma, Bayes rule), obtain:
p(z |y , u ) = N (z |µ , ⌃ )
t 1:t 1:t t t t
The mean update depends on Kalman gain matrix K, and the
residual or innovation r = y – E[y]
µt = µt|t 1 + Kt rt
Kt = ⌃t|t T
1 Ct S t
1
ŷ = E[yt |y1:t 1 , ut ] = Ct µt|t + Dt ut
1
T
St = cov[rt |y1:t 1 , u1:t ] = Ct ⌃t|t 1 t + Rt
C
Kalman Filter: Extensions
Learning similar to HMM
– Need to solve inference problem – local posterior marginals
for latent variables
– Use Kalman smoothing instead of forward-backward in E
step, re-derive updates in M step
Many extensions and elaborations
– Non-linear models: extended KF, unscented KF
– Non-Gaussian noise
– More general posteriors (multi-modal, discrete, etc.)
– Large systems with sparse structure (sparse information
filter)
Viterbi decoding
How to choose single best path through state space?
Choose state with largest probability at each time t: maximize
expected number of correct states
But this may not be the best path, with highest likelihood of
generating the data
To find best path – Viterbi decoding, form of dynamic
programming (forward-backward algorithm)
Same recursions, but replace ∑ with max (“brace” example)
Forward: retain best path into each node at time t
Backward: retrace path back from state where most
probable path ends