0% found this document useful (0 votes)
88 views29 pages

State-Space Models in HMMs and Kalman Filters

1. State-space models are probabilistic graphical models used to model sequential data using hidden states. 2. Hidden Markov models (HMMs) are state-space models where the hidden states form a Markov chain and observations depend only on the current state. 3. The forward-backward and Viterbi algorithms can be used for inference in HMMs to compute state probabilities and find the most likely state sequence. The Baum-Welch algorithm is used for learning model parameters.

Uploaded by

Rohit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views29 pages

State-Space Models in HMMs and Kalman Filters

1. State-space models are probabilistic graphical models used to model sequential data using hidden states. 2. Hidden Markov models (HMMs) are state-space models where the hidden states form a Markov chain and observations depend only on the current state. 3. The forward-backward and Viterbi algorithms can be used for inference in HMMs to compute state probabilities and find the most likely state sequence. The Baum-Welch algorithm is used for learning model parameters.

Uploaded by

Rohit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probabilis*c

Graphical Models

Lecture 8: State-Space Models

Based on slides by Richard Zemel




Sequential data
Turn a@en*on to sequen*al data
–  Time-series: stock market, speech, video analysis
–  Ordered: text, gene

Simple example: Dealer A is fair; Dealer B is not C=h


C=t C=t
Process (let Z be dealer A or B): A B
Loop until tired:
1.  Flip coin C, use it to decide whether to switch dealer C=h
2.  Chosen dealer rolls die, record result

Fully observable formulation: data is sequence of dealer selections


AAAABBBBAABBBBBBBAAAAABBBBB
Simple example: Markov model

•  If underlying process unknown, can construct model to predict


next le@er in sequence
•  In general, product rule expresses joint distribu*on for sequence

•  First-order Markov chain: each observa*on independent of all


previous observa*ons except most recent

•  ML parameter es*mates are easy



•  Each pair of outputs is a training case; in this example:
P(Xt =B| Xt-1=A) = #[t s.t. Xt = B, Xt-1 = A] / #[t s.t. Xt-1 = A]
Higher-order Markov models
•  Consider example of text
•  Can capture some regularities with bigrams (e.g., q nearly
always followed by u, very rarely by j) – probability of a
le?er given just its preceding le?er
•  But probability of a le?er depends on more than just
previous le?er
•  Can formulate as second-order Markov model (trigram
model)


•  Need to take care: many counts may be zero in training
dataset

Character
recognition:
Transition
probabilities
Hidden Markov model (HMM)
•  Return to casino example -- now imagine that do not observe
ABBAA, but instead just sequence of die rolls (1-6)

•  Genera*ve process:
Loop un*l *red:
1. Flip coin C (Z = A or B)
2. Chosen dealer rolls die, record result X

Z is now hidden state variable – 1st order Markov chain generates
state sequence (path), governed by transi2on matrix A


Observa*ons governed by emission probabili2es, convert state path
into sequence of observable symbols or vectors:
Relationship to other models
•  Can think of HMM as:
–  Markov chain with stochastic measurements

–  Mixture model with states coupled across time

•  Hidden state is 1st-order Markov, but output not Markov of any


order
•  Future is independent of past give present, but conditioning on
observations couples hidden states
Character Recognition Example

Which le?ers are these?



HMM: Character Recognition Example

Context ma?ers: recognition easier based on sequence of


characters
How to apply HMM to this character string?
Main elements: states? emission, transition probabilities?

HMM: Semantics

z1 = {a,..., z} z2 = {a,..., z} z3 = {a,..., z} z4 = {a,..., z} z5 = {a,..., z}

x1 = x2 = x3 = x4 = x5 =

Need 3 distributions:
1.  Initial state: P(Z1)
2.  Transition model: P(Zt|Zt-1)
3.  Observation model (emission probabilities): P(Xt|Zt)

HMM: Main tasks

•  Joint probabilities of hidden states and outputs:

T
P(x, z) = P( z1 ) P( x1 | z1 )∏t =2 P( zt | zt −1 ) P( xt | zt )
•  Three problems
1.  Computing probability of observed sequence: forward-
backward algorithm [good for recognition]
2.  Infer most likely hidden state sequence: Viterbi algorithm
[useful for interpretation]
3.  Learning parameters: Baum-Welch algorithm (version of
EM)
Fully observed HMM
Learning fully observed HMM (observe both X and Z) is easy:
1.  Initial state: P(Z1) – proportion of words start with each
le?er
2.  Transition model: P(Zt|Zt-1) – proportion of times a given
le?er follows another (bigram statistics)
3.  Observation model (emission probabilities): P(Xt|Zt) – how
often particular image represents specific character, relative
to all images

But still have to do inference at test time: work out states given
observations

HMMs often used where hidden states are identified: words in
speech recognition; activity recognition; spatial position of
rat; genes; POS tagging
HMM: Inference tasks
Important to infer distributions over hidden states:
§  If states are interpretable, infer interpretations
§  Also essential for learning

Can break down hidden state inference tasks to solve (each


based on all observations up to current time, X0:t)
1.  Filtering: compute posterior over current hidden state:
P(Zt| X0:t)
2.  Prediction: compute posterior over future hidden state:
P(Zt+k| X0:t)
3.  Smoothing: compute posterior over past hidden state:
P(Zk| X0:t), 0<k<t
4.  Fixed-lag smoothing: P(Zt-a| X0:t): compute posterior
over hidden state a few steps back
Filtering, Smoothing & Prediction
P( Z t | X 1:t ) = P( Z t | X t , X 1:t −1 )
∝ P( X t | Z t , X 1:t −1 ) P( Z t | X 1:t −1 )
= P( X t | Z t ) P( Z t | X 1:t −1 )
= P( X t | Z t )∑ z P( Z t | zt −1 , X 1:t −1 ) P( zt −1 | X 1:t −1 )
t −1

= P( X t | Z t )∑ z P( Z t | zt −1 ) P( zt −1 | X 1:t −1 )
t −1

Filtering: for online estimation of state


Pr(state) =observation probability * transition-model

Smoothing: post hoc estimation of state (similar computation)
Prediction is filtering, but with no new evidence:
P( Z t + k | X 1:t ) = ∑ P( Z t +k | zt + k −1 ) P( zt + k −1 | X 1:t )
zt +k −1
HMM: Maximum likelihood
Having observed some dataset, use ML to learn the parameters
of the HMM

Need to marginalize over the latent variables:
X
p(X|✓) = p(X, Z|✓)
Z
Difficult:
–  does not factorize over time steps
–  involves generalization of a mixture model

Approach: utilize EM for learning



Focus first on how to do inference efficiently
Forward recursion (α)

Clever recursion can compute huge sum efficiently


Backward recursion (β)

α(zt,j): total inflow of prob. to node (t,j)


β(zt,j): total outflow of prob. from node (t,j)
Forward-Backward algorithm
Estimate hidden state given observations

One forward pass to compute all α(zt,i), one backward


pass to compute all β(zt,i): total cost O(K2T)
Can compute likelihood at any time t based on α (zt,j)
and β(zt,j)
Baum-Welch training algorithm: Summary

Can estimate HMM parameters using maximum


likelihood
If state path known, then parameter estimation easy
Instead must estimate states, update parameters, re-
estimate states, etc. -- Baum-Welch (form of EM)
State estimation via forward-backward, also need
transition statistics (see next slide)
Update parameters (transition matrix A, emission
parameters) to maximize likelihood
Transition statistics
Need statistics for adjacent time-steps:

Expected number of transitions from state i to state j that


begin at time t-1, given the observations
Can be computed with the same α(zt,j) and β(zt,j)
recursions
Parameter updates
Initial state distribution: expected counts in state k at time 1

Estimate transition probabilities:

Emission probabilities are expected number of times observe


symbol in particular state:
Using HMMs for recognition
Can train an HMM to classify a sequence:
1. train a separate HMM per class
2. evaluate prob. of unlabelled sequence under each
HMM
3. classify: HMM with highest likelihood

Assumes can solve two problems:
1. estimate model parameters given some training
sequences (we can find local maximum of
parameter space near initial position)
2. given model, can evaluate prob. of a sequence
Probability of observed sequence
Want to determine if given observation sequence is likely
under the model (for learning, or recognition)

Compute marginals to evaluate prob. of observed seq.: sum
across all paths of joint prob. of observed outputs and state

Take advantage of factorization to avoid exp. cost (#paths = KT)


Variants on basic HMM
•  Input-output HMM
–  Have additional observed variables U

•  Semi-Markov HMM
–  Improve model of state duration

•  Autoregressive HMM
–  Allow observations to depend on some previous
observations directly

•  Factorial HMM
–  Expand dim. of latent state
State Space Models
Instead of discrete latent state of the HMM, model Z as a
continuous latent variable
Standard formulation: linear-Gaussian (LDS), with (hidden
state Z, observation Y, other variables U)
–  Transition model is linear
zt = A t zt 1 + Bt u t + ✏ t
–  with Gaussian noise
✏t = N (0, Qt )
–  Observation model is linear
y t = C t zt + D t u t + t
–  with Gaussian noise
t = N (0, Rt )

Model parameters typically independent of time: stationary
Kalman Filter
Algorithm for filtering in linear-Gaussian state space model
Everything is Gaussian, so can compute updates exactly

Dynamics update: predict next belief state
Z
p(zt |y1:t 1 , u1:t ) = N (zt |At zt 1 + Bt ut , Qt )N (zt 1 |µt 1 , ⌃t 1 )dzt 1

= N (zt |µt|t 1 , ⌃t|t 1 )

µt|t 1 = At µt 1 + Bt u t
T
⌃t|t 1 = A t ⌃t 1 At + Qt
Kalman Filter: Measurement Update
Key step: update hidden state given new measurement:
p(zt |y1:t , u1:t ) / p(yt |zt , ut )p(zt |y1:t 1 , u1:t )

First term a bit complicated, but can apply various identities
(such as the matrix inversion lemma, Bayes rule), obtain:
p(z |y , u ) = N (z |µ , ⌃ )
t 1:t 1:t t t t

The mean update depends on Kalman gain matrix K, and the
residual or innovation r = y – E[y]
µt = µt|t 1 + Kt rt
Kt = ⌃t|t T
1 Ct S t
1

ŷ = E[yt |y1:t 1 , ut ] = Ct µt|t + Dt ut


1
T
St = cov[rt |y1:t 1 , u1:t ] = Ct ⌃t|t 1 t + Rt
C
Kalman Filter: Extensions
Learning similar to HMM
–  Need to solve inference problem – local posterior marginals
for latent variables
–  Use Kalman smoothing instead of forward-backward in E
step, re-derive updates in M step

Many extensions and elaborations


–  Non-linear models: extended KF, unscented KF
–  Non-Gaussian noise
–  More general posteriors (multi-modal, discrete, etc.)
–  Large systems with sparse structure (sparse information
filter)
Viterbi decoding
How to choose single best path through state space?
Choose state with largest probability at each time t: maximize
expected number of correct states
But this may not be the best path, with highest likelihood of
generating the data

To find best path – Viterbi decoding, form of dynamic
programming (forward-backward algorithm)
Same recursions, but replace ∑ with max (“brace” example)
Forward: retain best path into each node at time t
Backward: retrace path back from state where most
probable path ends

You might also like