0% found this document useful (0 votes)
421 views41 pages

State Space Analysis and Kalman Filter

The document introduces state space analysis as a unified methodology for time series problems, focusing on inferring unobserved vectors from observed data. It presents a systematic treatment beginning with simple models, such as the local level model, and progresses to more complex linear and nonlinear models, covering both classical and Bayesian perspectives. The book aims to make the concepts accessible to readers with basic knowledge of statistics and matrix algebra, while also providing computational techniques and examples from real data.

Uploaded by

Faaeza Atiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
421 views41 pages

State Space Analysis and Kalman Filter

The document introduces state space analysis as a unified methodology for time series problems, focusing on inferring unobserved vectors from observed data. It presents a systematic treatment beginning with simple models, such as the local level model, and progresses to more complex linear and nonlinear models, covering both classical and Bayesian perspectives. The book aims to make the concepts accessible to readers with basic knowledge of statistics and matrix algebra, while also providing computational techniques and examples from real data.

Uploaded by

Faaeza Atiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 Introduction

1.1 Basic ideas of state space analysis


State space modelling provides a unified methodology for treating a wide range
of problems in time series analysis. In this approach it is assumed that the
development over time of the system under study is determined by an unobserved
series of vectors α1 , . . . , αn , with which are associated a series of observations
y1 , . . . , yn ; the relation between the αt ’s and the yt ’s is specified by the state
space model. The main purpose of state space analysis is to infer the relevant
properties of the αt ’s from a knowledge of the observations y1 , . . . , yn . Other
purposes include forecasting, signal extraction and estimation of parameters.
This book presents a systematic treatment of this approach to problems of time
series analysis.
Our starting point when deciding the structure of the book was that we
wanted to make the basic ideas of state space analysis easy to understand for
readers with no previous knowledge of the approach. We felt that if we had begun
the book by developing the theory step by step for a general state space model,
the underlying ideas would be obscured by the complicated appearance of many
of the formulae. We therefore decided instead to devote Chapter 2 of the book
to a particularly simple example of a state space model, the local level model,
and to develop as many as possible of the basic state space techniques for this
model. Our hope is that this will enable readers new to the techniques to gain
insights into the ideas behind state space methodology that will help them when
working through the greater complexities of the treatment of the general case.
With this purpose in mind, we introduce topics such as Kalman filtering, state
smoothing, disturbance smoothing, simulation smoothing, missing observations,
forecasting, initialisation, maximum likelihood estimation of parameters and
diagnostic checking for the local level model. We present the results from both
classical and Bayesian standpoints. We demonstrate how the basic theory that
is needed for both cases can be developed from elementary results in regression
theory.

1.2 Linear models


Before going on to develop the theory for the general model, we present a series
of examples that show how the linear state space model relates to problems
of practical interest. This is done in Chapter 3 where we begin by showing how
structural time series models can be put into state space form. By structural time
2 Introduction

series models we mean models in which the observations are made up of trend,
seasonal, cycle and regression components plus error. We go on to put Box–
Jenkins ARIMA models into state space form, thus demonstrating that these
models are special cases of state space models. Next we discuss the history of
exponential smoothing and show how it relates to simple forms of state space and
ARIMA models. We follow this by considering various aspects of regression with
or without time-varying coefficients or autocorrelated errors. We also present a
treatment of dynamic factor analysis. Further topics discussed are simultaneous
modelling series from different sources, benchmarking, continuous time models
and spline smoothing in discrete and continuous time. These considerations apply
to minimum variance linear unbiased systems and to Bayesian treatments as well
as to classical models.
Chapter 4 begins with a set of four lemmas from elementary multivariate
regression which provides the essentials of the theory for the general linear state
space model from both a classical and a Bayesian standpoint. These have the
useful property that they produce the same results for Gaussian assumptions of
the model and for linear minimum variance criteria where the Gaussian assump-
tions are dropped. The implication of these results is that we only need to prove
formulae for classical models assuming normality and they remain valid for lin-
ear minimum variance and for Bayesian assumptions. The four lemmas lead to
derivations of the Kalman filter and smoothing recursions for the estimation of
the state vector and its conditional variance matrix given the data. We also derive
recursions for estimating the observation and state disturbances. We derive the
simulation smoother which is an important tool in the simulation methods we
employ later in the book. We show that allowance for missing observations and
forecasting are easily dealt with in the state space framework.
Computational algorithms in state space analyses are mainly based on recur-
sions, that is, formulae in which we calculate the value at time t + 1 from earlier
values for t, t − 1, . . . , 1. The question of how these recursions are started up at
the beginning of the series is called initialisation; it is dealt with in Chapter 5. We
give a general treatment in which some elements of the initial state vector have
known distributions while others are diffuse, that is, treated as random variables
with infinite variance, or are treated as unknown constants to be estimated by
maximum likelihood.
Chapter 6 discusses further computational aspects of filtering and smooth-
ing and begins by considering the estimation of a regression component of the
model and intervention components. It next considers the square root filter and
smoother which may be used when the Kalman filter and smoother show signs
of numerical instability. It goes on to discuss how multivariate time series can
be treated as univariate series by bringing elements of the observational vec-
tors into the system one at a time, with computational savings relative to the
multivariate treatment in some cases. Further modifications are discussed where
the observation vector is high-dimensional. The chapter concludes by discussing
computer algorithms.
Non-Gaussian and nonlinear models 3

In Chapter 7, maximum likelihood estimation of parameters is considered


both for the case where the distribution of the initial state vector is known and
for the case where at least some elements of the vector are diffuse or are treated
as fixed and unknown. The use of the score vector and the EM algorithm is
discussed. The effect of parameter estimation on variance estimation is examined.
Up to this point the exposition has been based on the classical approach
to inference in which formulae are worked out on the assumption that param-
eters are known, while in applications unknown parameter values are replaced
by appropriate estimates. In Bayesian analysis the parameters are treated as
random variables with a specified or a noninformative prior joint density which
necessitates treatment by simulation techniques which are not introduced until
Chapter 13. Chapter 13 partly considers a Bayesian analysis of the linear Gaus-
sian model both for the case where the prior density is proper and for the
case where it is noninformative. We give formulae from which the posterior
mean can be calculated for functions of the state vector, either by numeri-
cal integration or by simulation. We restrict attention to functions which, for
given values of the parameters, can be calculated by the Kalman filter and
smoother.
In Chapter 8 we illustrate the use of the methodology by applying the tech-
niques that have been developed to a number of analyses based on real data.
These include a study of the effect of the seat belt law on road accidents in Great
Britain, forecasting the number of users logged on to an Internet server, fitting
acceleration against time for a simulated motorcycle accident and a dynamic
factor analysis for the term structure of US interest rates.

1.3 Non-Gaussian and nonlinear models


Part II of the book extends the treatment to state space models which are not
both linear and Gaussian. Chapter 9 illustrates the range of non-Gaussian and
nonlinear models that can be analysed using the methods of Part II. This includes
exponential family models such as the Poisson distribution for the conditional
distribution of the observations given the state. It also includes heavy-tailed dis-
tributions for the observational and state disturbances, such as the t-distribution
and mixtures of normal densities. Departures from linearity of the models are
studied for cases where the basic state space structure is preserved. Financial
models such as stochastic volatility models are investigated from the state space
point of view.
Chapter 10 considers approximate methods for analysis of non-Gaussian
and nonlinear models, that is, extended Kalman filter methods and unscented
methods. It also discusses approximate methods based on first and second order
Taylor expansions. We show how to calculate the conditional mode of the state
given the observations for the non-Gaussian model by iterated use of the Kalman
filter and smoother. We then find the linear Gaussian model with the same
conditional mode given the observations.
4 Introduction

The simulation techniques for exact handling of non-Gaussian and nonlin-


ear models are based on importance sampling and are described in Chapter 11.
We then find the linear Gaussian model with the same conditional mode given
the observations. We use the conditional density of the state given the obser-
vations for an approximating linear Gaussian model as the importance density.
We draw random samples from this density for the simulation using the simu-
lation smoother described in Chapter 4. To improve efficiency we introduce two
antithetic variables intended to balance the simulation sample for location and
scale.
In Chapter 12 we emphasise the fact that simulation for time series can be
done sequentially, that is, instead of selecting an entire new sample for each time
point t, which is the method suggested in Section 12.2, we fix the sample at the
values previously obtained at time . . . , t−2, t−1, and choose a new value at time
t only. New recursions are required for the resulting simulations. This method is
called particle filtering.
In Chapter 13 we discuss the use of importance sampling for the estimation of
parameters in Bayesian analysis for models of Part I and Part II. An alternative
simulation technique is Markov chain Monte Carlo. We prefer to use importance
sampling for the problems considered in this book but a brief description is given
for comparative purposes.
We provide examples in Chapter 14 which illustrate the methods that have
been developed in Part II for analysing observations using non-Gaussian and
nonlinear state space models. The illustrations include the monthly number of
van drivers killed in road accidents in Great Britain, outlying observations in
quarterly gas consumption, the volatility of exchange rate returns and analysis
of the results of the annual boat race between teams of the universities of Oxford
and Cambridge.

1.4 Prior knowledge


Only basic knowledge of statistics and matrix algebra is needed in order to under-
stand the theory in this book. In statistics, an elementary knowledge is required
of the conditional distribution of a vector y given a vector x in a multivariate
normal distribution; the central results needed from this area for much of the
theory of the book are stated in the lemmas in Section 4.2. Little previous knowl-
edge of time series analysis is required beyond an understanding of the concepts
of a stationary time series and the autocorrelation function. In matrix algebra
all that is needed are matrix multiplication and inversion of matrices, together
with basic concepts such as rank and trace.

1.5 Notation
Although a large number of mathematical symbols are required for the exposition
of the theory in this book, we decided to confine ourselves to the standard
Other books on state space methods 5

English and Greek alphabets. The effect of this is that we occasionally need to
use the same symbol more than once; we have aimed however at ensuring that
the meaning of the symbol is always clear from the context. We present below a
list of the main conventions we have employed.

• The same symbol 0 is used to denote zero, a vector of zeros or a matrix of


zeros.
• The symbol Ik denotes an identity matrix of dimension k.
• We use the generic notation p(·), p(·, ·), p(·|·) to denote a probability density,
a joint probability density and a conditional probability density.
• If x is a random vector with µ and variance matrix V and which is not
necessarily normal, we write x ∼ (µ, V ).
• If x is a random vector which is normally distributed with mean vector µ and
variance matrix V , we write x ∼ N(µ, V ).
• If x is a random variable with the chi-squared distribution with ν degrees of
freedom, we write x ∼ χ2ν .
• We use the same symbol Var(x) to denote the variance of a scalar random
variable x and the variance matrix of a random vector x.
• We use the same symbol Cov(x, y) to denote the covariance between scalar
random variables x and y, between a scalar random variable x and a random
vector y, and between random vectors x and y.
• The symbol E(x|y) denotes the conditional expectation of x given y; similarly
for Var(x|y) and Cov(x, y|z) for random vectors x, y and z.
• The symbol diag(a1 , . . . , ak ) denotes the  ×  matrix with nonsingular matrix
elements a1 , . . . , ak down the leading diagonal and zeros elsewhere where  =
 k
i=1 rank(ai ).

1.6 Other books on state space methods


Without claiming complete coverage, we list here a number of books which
contain treatments of state space methods.
First we mention three early books written from an engineering standpoint:
Jazwinski (1970), Sage and Melsa (1971) and Anderson and Moore (1979).
A later book from a related standpoint is Young (1984).
Books written from the standpoint of statistics and econometrics include
Harvey (1989), who gives a comprehensive state space treatment of structural
time series models together with related state space material, West and Harrison
(1997), who give a Bayesian treatment with emphasis on forecasting, Kitagawa
and Gersch (1996) and Kim and Nelson (1999). A complete Bayesian treatment
for specific classes of time series models including the state space model is given
by Frühwirth-Schnatter (2006). Fundamental and rigorous statistical treatments
of classes of hidden Markov models, which include our nonlinear non-Gaussian
state space model of Part II, is presented in Cappé, Moulines and Rydén
6 Introduction

(2005). An introductory and elementary treatment of state space methods from


a practioners’ perspective is provided by Commandeur and Koopman (2007).
More general books on time series analysis and related topics which cover
partial treatments of state space topics include Brockwell and Davis (1987) (39
pages on state space out of about 570), Chatfield (2003) (14 pages out of about
300), Harvey (1993) (48 pages out of about 300), Hamilton (1994) (37 pages on
state space out of about 800 pages) and Shumway and Stoffer (2000) (112 pages
out of about 545 pages). The monograph of Jones (1993) on longitudinal models
has three chapters on state space (66 pages out of about 225). The book by
Fahrmeir and Tutz (1994) on multivariate analysis based on generalised linear
modelling has a chapter on state space models (48 pages out of about 420).
Finally, the book by Teräsvirta, Tjostheim and Granger (2011) on nonlinear
time series modelling has one chapter on the treatment of nonlinear state space
models (32 pages out of about 500 pages).
Books on time series analysis and similar topics with minor treatments of
state space analysis include Granger and Newbold (1986) and Mills (1993). We
mention finally the book edited by Doucet, De Freitas and Gordon (2001) which
contains a collection of articles on Monte Carlo (particle) filtering and the book
edited by Akaike and Kitagawa (1999) which contains 6 chapters (88 pages) on
illustrations of state space analysis out of a total of 22 chapters (385 pages).

1.7 Website for the book


We will maintain a website for the book at
[Link]
for data, code, corrections and other relevant information. We will be grateful
to readers if they inform us about their comments and errors in the book so
corrections can be placed on the site.
2 Local level model

2.1 Introduction
The purpose of this chapter is to introduce the basic techniques of state space
analysis, such as filtering, smoothing, initialisation and forecasting, in terms of
a simple example of a state space model, the local level model. This is intended
to help beginners grasp the underlying ideas more quickly than they would if
we were to begin the book with a systematic treatment of the general case. We
shall present results from both the classical and Bayesian perspectives, assuming
normality, and also from the standpoint of minimum variance linear unbiased
estimation when the normality assumption is dropped.
A time series is a set of observations y1 , . . . , yn ordered in time. The basic
model for representing a time series is the additive model

y t = µt + γ t + ε t , t = 1, . . . , n. (2.1)

Here, µt is a slowly varying component called the trend , γt is a periodic


component of fixed period called the seasonal and εt is an irregular compo-
nent called the error or disturbance. In general, the observation yt and the
other variables in (2.1) can be vectors but in this chapter we assume they
are scalars. In many applications, particularly in economics, the components
combine multiplicatively, giving

y t = µ t γ t εt . (2.2)

By taking logs however and working with logged values model (2.2) reduces to
model (2.1), so we can use model (2.1) for this case also.
To develop suitable models for µt and γt we need the concept of a random
walk . This is a scalar series αt determined by the relation αt+1 = αt + ηt where
the ηt ’s are independent and identically distributed random variables with zero
means and variances ση2 .
Consider a simple form of model (2.1) in which µt = αt where αt is a random
walk, no seasonal is present and all random variables are normally distributed.
We assume that εt has constant variance σε2 . This gives the model
 
yt = αt + εt , εt ∼ N 0, σε2 ,
  (2.3)
αt+1 = αt + ηt , ηt ∼ N 0, ση2 ,
10 Local level model

for t = 1, . . . , n where the εt ’s and ηt ’s are all mutually independent and are
independent of α1 . This model is called the local level model. Although it has a
simple form, this model is not an artificial special case and indeed it provides the
basis for the analysis of important real problems in practical time series analysis;
for example, the local level model provides the basis for our analysis of the Nile
data that we start in Subsection 2.2.5. It exhibits the characteristic structure
of state space models in which there is a series of unobserved values α1 , . . . , αn ,
called the states, which represents the development over time of the system
under study, together with a set of observations y1 , . . . , yn which are related to
the αt ’s by the state space model (2.3). The object of the methodology that
we shall develop is to infer relevant properties of the αt ’s from a knowledge of
the observations y1 , . . . , yn . The model (2.3) is suitable for both classical and
Bayesian analysis. Where the εt ’s and the ηt ’s are not normally distributed
we obtain equivalent results from the standpoint of minimum variance linear
unbiased estimation.
We assume initially that α1 ∼ N(a1 ,P1 ) where a1 and P1 are known and
that σε2 and ση2 are known. Since random walks are non-stationary the model
is non-stationary. By non-stationary here we mean that distributions of random
variables yt and αt depend on time t.
For applications of model (2.3) to real series, we need to compute quantities
such as the mean of αt given y1 , . . . , yt−1 or the mean of αt given y1 , . . . , yn ,
together with their variances; we also need to fit the model to data by calculating
maximum likelihood estimates of the parameters σε2 and ση2 . In principle, this
could be done by using standard results from multivariate normal theory as
described in books such as Anderson (2003). In this approach the observations yt
generated by the local level model are represented as the n×1 vector Yn such that
⎛ ⎞ ⎛ ⎞
y1 1
⎜ .. ⎟ ⎜ .. ⎟
Yn ∼ N(1a1 , Ω), with Yn = ⎝ . ⎠, 1 = ⎝ . ⎠, Ω = 11 P1 + Σ, (2.4)
yn 1

where the (i, j)th element of the n × n matrix Σ is given by


⎧ 2
⎨(i − 1)ση ,
⎪ i<j
Σij = σε + (i − 1)ση2 ,
2
i = j, i, j = 1, . . . , n, (2.5)


(j − 1)ση2 , i>j

which follows since the local level model implies that


t−1
y t = α1 + η j + εt , t = 1, . . . , n. (2.6)
j=1

Starting from this knowledge of the distribution of Yn , estimation of conditional


means, variances and covariances is in principle a routine matter using standard
Filtering 11

results in multivariate analysis based on the properties of the multivariate normal


distribution. However, because of the serial correlation between the observations
yt , the routine computations rapidly become cumbersome as n increases. This
naive approach to estimation can be improved upon considerably by using the
filtering and smoothing techniques described in the next three sections. In effect,
these techniques provide efficient computing algorithms for obtaining the same
results as those derived by multivariate analysis theory. The remaining sections
of this chapter deal with other important issues such as fitting the local level
model and forecasting future observations.

2.2 Filtering
2.2.1 The Kalman filter
The object of filtering is to update our knowledge of the system each time a
new observation yt is brought in. We shall first develop the theory of filtering for
the local level model (2.3) where the εt ’s and ηt ’s are assumed normal from the
standpoint of classical analysis. Since in this case all distributions are normal,
conditional joint distributions of one set of observations given another set are

also normal. Let Yt−1 be the vector of observations (y1 , . . . , yt−1 ) for t = 2, 3, . . .
and assume that the conditional distribution of αt given Yt−1 is N(at ,Pt ) where
at and Pt are known. Assume also that the conditional distribution of αt given
Yt is N(at|t , Pt|t ). The distribution of αt+1 given Yt is N(at+1 , Pt+1 ). Our object
is to calculate at|t , Pt|t , at+1 and Pt+1 when yt is brought in. We refer to at|t as
the filtered estimator of the state αt and at+1 as the one-step ahead predictor of
αt+1 . Their respective associated variances are Pt|t and Pt+1 .
An important part is played by the one-step ahead prediction error vt of yt .
Then vt = yt − at for t = 1, . . . , n, and

E(vt |Yt−1 ) = E(αt + εt − at |Yt−1 ) = at − at = 0,


Var(vt |Yt−1 ) = Var(αt + εt − at |Yt−1 ) = Pt + σε2 ,
(2.7)
E(vt |αt , Yt−1 ) = E(αt + εt − at |αt , Yt−1 ) = αt − at ,
Var(vt |αt , Yt−1 ) = Var(αt + εt − at |αt , Yt−1 ) = σε2 ,

for t = 2, . . . , n. When Yt is fixed, Yt−1 and yt are fixed so Yt−1 and vt are fixed
and vice versa. Consequently, p(αt |Yt ) = p(αt |Yt−1 , vt ). We have

p(αt |Yt−1 , vt ) = p(αt , vt |Yt−1 )/p(vt |Yt−1 )


= p(αt |Yt−1 )p(vt |αt , Yt−1 )/p(vt |Yt−1 ) (2.8)
1
= constant × exp(− Q),
2
12 Local level model

where

Q = (αt − at )2 /Pt + (vt − αt + at )2 /σε2 − vt2 /(Pt + σε2 )


   
1 1 2 vt 1 1
= + 2 (αt − at ) − 2(αt − at ) 2 + − vt2
Pt σε σε σε2 Pt + σε2 (2.9)
 2
Pt + σε2 Pt vt
= αt − at − .
Pt σε2 Pt + σε2

Thus  
Pt Pt σε2
p(αt |Yt ) = N at + vt , . (2.10)
Pt + σε2 Pt + σε2
But at|t and Pt|t have been defined such that p(αt |Yt ) = N(at|t , Pt|t ). It follows
that
Pt
at|t = at + vt , (2.11)
Pt + σε2
Pt σε2
Pt|t = . (2.12)
Pt + σε2

Since at+1 = E(αt+1 |Yt ) = E(αt + ηt |Yt ) and Pt+1 = Var(αt+1 |Yt ) = Var(αt +
ηt |Yt ) from (2.3), we have

at+1 = E(αt |Yt ) = at|t ,


Pt+1 = Var(αt |Yt ) + ση2 = Pt|t + ση2 ,

giving

Pt
at+1 = at + vt , (2.13)
Pt + σε2
Pt σε2
Pt+1 = + ση2 , (2.14)
Pt + σε2

for t = 2, . . . , n. For t = 1 we delete the symbol Yt−1 in the above derivation


and we find that all results from (2.7) to (2.13) hold for t = 1 as well as for
t = 2, . . . , n.
In order to make these results consistent with the treatment of filtering for the
general linear state space model in Subsection 4.3.1, we introduce the notation

Ft = Var(vt |Yt−1 ) = Pt + σε2 , Kt = Pt /Ft ,

where Ft is referred to as the variance of the prediction error vt and Kt is known


as the Kalman gain. Using (2.11) to (2.14) we can then write the full set of
relations for updating from time t to time t + 1 in the form
Filtering 13

v t = y t − at , Ft = Pt + σε2 ,
at|t = at + Kt vt , Pt|t = Pt (1 − Kt ), (2.15)
at+1 = at + Kt vt , Pt+1 = Pt (1 − Kt ) + ση2 ,

for t = 1, . . . , n, where Kt = Pt / Ft .
We have assumed that a1 and P1 are known; however, more general initial
specifications for a1 and P1 will be dealt with in Section 2.9. Relations (2.15)
constitute the celebrated Kalman filter for the local level model. It should be
noted that Pt depends only on σε2 and ση2 and does not depend on Yt−1 . We
include the case t = n in (2.15) for convenience even though an+1 and Pn+1
are not normally needed for anything except forecasting. A set of relations such
as (2.15) which enables us to calculate quantities for t + 1 given those for t is
called a recursion.

2.2.2 Regression lemma


The above derivation of the Kalman filter can be regarded as an application of
a regression lemma for the bivariate normal distribution. Suppose that x and y
are jointly normally distributed variables with
       2 
x µx x σx σxy
E = , Var = ,
y µy y σxy σy2

with means µx and µy , variances σx2 and σy2 , and covariance σxy . The joint
distribution is
p(x, y) = p(y) p(x|y),
by the definition of the conditional density p(x|y). But it can also be verified by
direct multiplication. We have
 
A 1 1  2
p(x, y) = exp − σy−2 (y − µy )2 − σx−2 x − µx − σxy σy−2 (y − µy ) ,
2π 2 2

where A = σx2 − σy−2 σxy . It follows that the conditional distribution of x given y
is normal and independent of y with mean and variance given by
2
σxy σxy
E(x|y) = µx + (y − µy ), Var(x|y) = σx2 − .
σy2 σy2

To apply this lemma to the Kalman filter, let vt = yt − at and keep Yt−1 fixed.
Take x = αt so that µx = at and y = vt . It follows that µy = E(vt ) = 0. Then,
σx2 = Var(αt ) = Pt , σy2 = Var(vt ) = Var(αt − at + εt ) = Pt + σε2 and σxy = Pt .
We obtain the conditional distribution for αt given vt by
Pt Pt
E(αt |vt ) = at|t = at + (yt − at ), Var(αt |vt ) = Pt|t = .
Pt + σε2 Pt + σε2
14 Local level model

In a similar way we can obtain the equations for at+1 and Pt+1 by application
of this regression lemma.

2.2.3 Bayesian treatment


To analyse the local level model from a Bayesian standpoint, we assume that
the data are generated by model (2.3). In this approach αt and yt are regarded
as a parameter and a constant, respectively. Before the observation yt is taken,
the prior distribution of αt is p(αt |Yt−1 ). The likelihood of αt is p(yt |αt , Yt−1 ).
The posterior distribution of αt given yt is given by the Bayes theorem which is
proportional to the product of these. In particular we have
p(αt |Yt−1 , yt ) = p(αt |Yt−1 ) p(yt |αt , Yt−1 ) / p(yt |Yt−1 ).
Since yt = αt + εt we have E(yt |Yt−1 ) = at and Var(yt |Yt−1 ) = Pt + σε2 , so
1
p(αt |Yt−1 , yt ) = constant × exp(− Q),
2
where
Q = (αt − at )2 /Pt + (αt − at )2 /σε2 − (yt − at )2 /(Pt + σε2 )
 2 (2.16)
Pt + σε2 Pt
= α t − at − (yt − at ) .
Pt σε2 Pt + σ2
This is a normal density which we denote by N(at|t , Pt|t ). Thus the posterior
mean and variance are
Pt
at|t = at + (yt − at ),
Pt + σε2
(2.17)
Pt σε2
Pt|t = ,
Pt + σε2
which are the same as (2.11) and (2.12) on putting vt = yt − at . The case
t = 1 has the same form. Similarly, the posterior density of αt+1 given yt
is p(αt+1 |Yt−1 , yt ) = p(αt + ηt |Yt−1 , yt ), which is normal with mean at|t and
variance Pt|t + ση2 . Denoting this by N(at+1 , Pt+1 ), we have
Pt
at+1 = at|t = at + (yt − at ),
Pt + σε2
(2.18)
Pt σε2
Pt+1 = Pt|t + ση2 = + ση2 ,
Pt + σε2
which are, of course, the same as (2.13) and (2.14). It follows that the Kalman
filter from a Bayesian point of view has the same form (2.15) as the Kalman
filter from the standpoint of classical inference. This is an important result; as
will be seen in Chapter 4 and later chapters, many inference results for the state
αt are the same whether approached from a classical or a Bayesian standpoint.
Filtering 15

2.2.4 Minimum variance linear unbiased treatment


In some situations, some workers object to the assumption of normality in
model (2.3) on the grounds that the observed time series they are concerned
with do not appear to behave in a way that corresponds with the normal distri-
bution. In these circumstances an alternative approach is to treat the filtering
problem as a problem in the estimation of αt and αt+1 given Yt and to confine
attention to estimates that are linear unbiased functions of yt ; we then choose
those estimates that have minimum variance. We call these estimates minimum
variance linear unbiased estimates (MVLUE).
Taking first the case of αt , we seek an estimate ᾱt which has the linear form
ᾱt = β + γyt where β and γ are constants given Yt−1 and which is unbiased in
the sense that the estimation error ᾱt − αt has zero mean in the conditional joint
distribution of αt and yt given Yt−1 . We therefore have

E(ᾱt − αt |Yt−1 ) = E(β + γyt − αt |Yt−1 )


(2.19)
= β + γat − at = 0,

so β = at (1 − γ) which gives ᾱt = at + γ(yt − at ). Thus ᾱt − αt = γ(αt − at +


εt ) − (αt − at ). Now Cov(αt − at + εt , αt − at ) = Pt so we have

Var(ᾱt − αt |Yt−1 ) = γ 2 (Pt + σε2 ) − 2γPt + Pt


 2
 2
 Pt Pt2 (2.20)
= Pt + σ ε γ − + P t − .
Pt + σε2 Pt + σε2

This is minimised when γ = Pt /(Pt + σε2 ) which gives

Pt
ᾱt = at + (yt − at ), (2.21)
Pt + σε2
Pt σε2
Var(ᾱt − αt |Yt−1 ) = . (2.22)
Pt + σε2

Similarly, if we estimate αt+1 given Yt−1 by the linear function ᾱt+1 = β ∗ + γ ∗ yt

and require this to have the unbiasedness property E(ᾱt+1 − αt+1 |Yt−1 ) = 0, we
find that β ∗ = at (1 − γ ∗ ) so ᾱt+1

= at + γ ∗ (yt − at ). By the same argument as
for ᾱt we find that Var(ᾱt+1 − αt+1 |Yt−1 ) is minimised when γ ∗ = Pt /(Pt + σε2 )

giving

∗ Pt
ᾱt+1 = at + (yt − at ), (2.23)
Pt + σε2
∗ Pt σε2
Var(ᾱt+1 − αt+1 |Yt−1 ) = + ση2 . (2.24)
Pt + σε2
16 Local level model

We have therefore shown that the estimates of ᾱt and ᾱt+1 given by the MVLUE
approach and their variances are exactly the same as the values at|t , at+1 , Pt|t
and Pt+1 in (2.11) to (2.14) that are obtained by assuming normality, both from
a classical and from a Bayesian standpoint. It follows that the values given by the
Kalman filter recursion (2.15) are MVLUE. We shall show in Subsection 4.3.1
that the same is true for the general linear Gaussian state space model (4.12).

2.2.5 Illustration
In this subsection we shall illustrate the output of the Kalman filter using obser-
vations from the river Nile. The data set consists of a series of readings of the
annual flow volume at Aswan from 1871 to 1970. The series has been analysed by
Cobb (1978) and Balke (1993). We analyse the data using the local level model
(2.3) with a1 = 0, P1 = 107 , σε2 = 15, 099 and ση2 = 1, 469.1. The values for a1
and P1 were chosen arbitrarily for illustrative purposes. The values for σε2 and
ση2 are the maximum likelihood estimates which we obtain in Subsection 2.10.3.
The values of at together with the raw data, Pt , vt and Ft , for t = 2, . . . , n, given
by the Kalman filter, are presented graphically in Fig. 2.1.

(i) (ii)
17500

1250
15000

1000 12500

10000
750
7500
500
1880 1900 1920 1940 1960 1880 1900 1920 1940 1960
(iii) 32500 (iv)

250 30000

27500
0
25000

−250 22500

1880 1900 1920 1940 1960 1880 1900 1920 1940 1960

Fig. 2.1 Nile data and output of Kalman filter: (i) data (dots), filtered state at (solid
line) and its 90% confidence intervals (light solid lines); (ii) filtered state variance Pt ;
(iii) prediction errors vt ; (iv) prediction variance Ft .
Forecast errors 17

The most obvious feature of the four graphs is that Pt and Ft converge rapidly
to constant values which confirms that the local level model has a steady state
solution; for discussion of the concept of a steady state see Section 2.11. However,
it was found that the fitted local level model converged numerically to a steady
state in around 25 updates of Pt although the graph of Pt seems to suggest that
the steady state was obtained after around 10 updates.

2.3 Forecast errors


The Kalman filter residual vt = yt − at and its variance Ft are the one-step
ahead forecast error and the one-step ahead forecast error variance of yt given
Yt−1 as defined in Section 2.2. The forecast errors v1 , . . . , vn are sometimes called
innovations because they represent the new part of yt that cannot be predicted
from the past for t = 1, . . . , n. We shall make use of vt and Ft for a variety of
results in the next sections. It is therefore important to study them in detail.

2.3.1 Cholesky decomposition


First we show that v1 , . . . , vn are mutually independent. The joint density of
y1 , . . . , yn is

n
p(y1 , . . . , yn ) = p(y1 ) p(yt |Yt−1 ). (2.25)
t=2

We then transform from y1 , . . . , yn to v1 , . . . , vn . Since each vt equals yt minus a


linear function of y1 , . . . , yt−1 for t = 2, . . . , n, the Jacobian is one. From (2.25)
and making the substitution we have


n
p(v1 , . . . , vn ) = p(vt ), (2.26)
t=1

since p(v1 ) = p(y1 ) and p(vt ) = p(yt |Yt−1 ) for t = 2, . . . , n. Consequently, the
vt ’s are independently distributed.
We next show that the forecast errors vt are effectively obtained from a
Cholesky decomposition of the observation vector Yn . The Kalman filter recur-
sions compute the forecast error vt as a linear function of the initial mean a1
and the observations y1 , . . . , yt since

v 1 = y1 − a1 ,
v2 = y2 − a1 − K1 (y1 − a1 ),
v3 = y3 − a1 − K2 (y2 − a1 ) − K1 (1 − K2 )(y1 − a1 ), and so on.

It should be noted that Kt does not depend on the initial mean a1 and the
observations y1 , . . . , yn ; it depends only on the initial state variance P1 and the
disturbance variances σε2 and ση2 . Using the definitions in (2.4), we have
30 Local level model

Substituting the covariance terms into this and taking into account the definition
(2.34) leads directly to

rt−1 = rt , α̂t = at + Pt rt−1 , t = τ, . . . , τ ∗ − 1. (2.55)

The consequence is that we can use the original state smoother (2.37) for all t
by taking Kt = 0, and hence Lt = 1, at the missing time points. This device
applies to any missing observation within the sample period. In the same way
the equations for the variance of the state error and the smoothed disturbances
can be obtained by putting Kt = 0 at missing time points.

2.7.1 Illustration
Here we consider the Nile data and the same local level model as before; however,
we treat the observations at time points 21, . . . , 40 and 61, . . . , 80 as missing.
The Kalman filter is applied first and the output vt , Ft , at and Pt is stored for
t = 1, . . . , n. Then, the state smoothing recursions are applied. The first two
graphs in Fig. 2.5 are the Kalman filter values of at and Pt , respectively. The
last two graphs are the smoothing output α̂t and Vt , respectively.
Note that the application of the Kalman filter to missing observations can
be regarded as extrapolation of the series to the missing time points, while
smoothing at these points is effectively interpolation.

2.8 Forecasting
Let ȳn+j be the minimum mean square error forecast of yn+j given the time
series y1 , . . . , yn for j = 1, 2, . . . , J with J as some pre-defined positive integer.
By minimum mean square error forecast here we mean the function ȳn+j of
y1 , . . . , yn which minimises E[(yn+j − ȳn+j )2 |Yn ]. Then ȳn+j = E(yn+j |Yn ). This
follows immediately from the well-known result that if x is a random variable
with mean µ the value of λ that minimises E(x−λ)2 is λ = µ; see Exercise 4.14.3.
The variance of the forecast error is denoted by F̄n+j = Var(yn+j |Yn ). The theory
of forecasting for the local level model turns out to be surprisingly simple; we
merely regard forecasting as filtering the observations y1 , . . . , yn , yn+1 , . . . , yn+J
using the recursion (2.15) and treating the last J observations yn+1 , . . . , yn+J as
missing, that is, taking Kt = 0 in (2.15).
Letting ān+j = E(αn+j |Yn ) and P̄n+j = Var(αn+j |Yn ), it follows immediately
from equation (2.54) with τ = n + 1 and τ ∗ = n + J in §2.7 that

ān+j+1 = ān+j , P̄n+j+1 = P̄n+j + ση2 , j = 1, . . . , J − 1,


Forecasting 31

with ān+1 = an+1 and P̄n+1 = Pn+1 obtained from the Kalman filter (2.15).
Furthermore, we have

ȳn+j = E(yn+j |Yn ) = E(αn+j |Yn ) + E(εn+j |Yn ) = ān+j ,


F̄n+j = Var(yn+j |Yn ) = Var(αn+j |Yn ) + Var(εn+j |Yn ) = P̄n+j + σε2 ,

for j = 1, . . . , J. The consequence is that the Kalman filter can be applied for
t = 1, . . . , n + J where we treat the observations at times n + 1, . . . , n + J as
missing. Thus we conclude that forecasts and their error variances are delivered
by applying the Kalman filter in a routine way with Kt = 0 for t = n+1, . . . , n+J.
The same property holds for the general linear Gaussian state space model as
we shall show in Section 4.11. For a Bayesian treatment a similar argument can
be used to show that the posterior mean and variance of the forecast of yn+j is
obtained by treating yn+1 , . . . , yn+j as missing values, for j = 1, . . . , J.

2.8.1 Illustration
The Nile data set is now extended by 30 missing observations allowing the com-
putation of forecasts for the observations y101 , . . . , y130 . Only the Kalman filter

(i) (ii)
50000

1250
40000

1000 30000

750 20000

10000
500
1900 1950 2000 1900 1950 2000

(iii) (iv)
1200
60000
1100
50000
1000

40000
900

800 30000

1900 1950 2000 1900 1950 2000

Fig. 2.6 Nile data and output of forecasting: (i) data (dots), state forecast at and
50% confidence intervals; (ii) state variance Pt ; (iii) observation forecast E(yt |Yt−1 );
(iv) observation forecast variance Ft .
32 Local level model

is required. The graphs in Fig. 2.6 contain ŷn+j|n = an+j|n , Pn+j|n , an+j|n
and Fn+j|n , respectively, for j √ = 1, . . . , J with J = 30. The confidence inter-
val for E(yn+j |Yn ) is ŷn+j|n ± k F n+j|n where k is determined by the required
probability of inclusion; in Fig. 2.6 this probability is 50%.

2.9 Initialisation
We assumed in our treatment of the linear Gaussian model in previous sections
that the distribution of the initial state α1 is N(a1 , P1 ) where a1 and P1 are
known. We now consider how to start up the filter (2.15) when nothing is known
about the distribution of α1 , which is the usual situation in practice. In this
situation it is reasonable to represent α1 as having a diffuse prior density, that
is, fix a1 at an arbitrary value and let P1 → ∞. From (2.15) we have
v 1 = y1 − a1 , F1 = P1 + σε2 ,

and, by substituting into the equations for a2 and P2 in (2.15), it follows that
P1
a2 = a 1 + (y1 − a1 ), (2.56)
P1 + σε2
 
P1
P2 = P 1 1 − + ση2
P1 + σε2
P1
= σ 2 + ση2 . (2.57)
P1 + σε2 ε

Letting P1 → ∞, we obtain a2 = y1 , P2 = σε2 +ση2 ; we can then proceed normally


with the Kalman filter (2.15) for t = 2, . . . , n. This process is called diffuse
initialisation of the Kalman filter and the resulting filter is called the diffuse
Kalman filter. We note the interesting fact that the same values of at and Pt for
t = 2, . . . , n can be obtained by treating y1 as fixed and taking α1 ∼ N(y1 , σε2 ).
Specifically, we have a1|1 = y1 and P1|1 = σε2 . It follows from (2.18) for t = 1
that a2 = y1 and P2 = σε2 + ση2 . This is intuitively reasonable in the absence of
information about the marginal distribution of α1 since (y1 − α1 ) ∼ N(0, σε2 ).
We also need to take account of the diffuse distribution of the initial state
α1 in the smoothing recursions. It is shown above that the filtering equations
for t = 2, . . . , n are not affected by letting P1 → ∞. Therefore, the state and
disturbance smoothing equations are also not affected for t = n, . . . , 2 since these
only depend on the Kalman filter output. From (2.37), the smoothed mean of
the state α1 is given by
   
1 P1
α̂1 = a1 + P1 v1 + 1 − r1
P1 + σε2 P1 + σε2
P1 P1
= a1 + 2
v1 + σ 2 r1 .
P1 + σε P1 + σε2 ε
Initialisation 33

Letting P1 → ∞, we obtain α̂1 = a1 + v1 + σε2 r1 and by substituting for v1 we


have
α̂1 = y1 + σε2 r1 .
The smoothed conditional variance of the state α1 given Yn is, from (2.43)
  2
2 1 P1
V1 = P1 − P1 + 1− N1
P1 + σε2 P1 + σε2
   2
P1 P1
= P1 1 − − σε4 N1
P1 + σε2 P1 + σε2
   2
P1 2 P1
= σε − σε4 N1 .
P1 + σε2 P1 + σε2

Letting P1 → ∞, we obtain V1 = σε2 − σε4 N1 .


The smoothed means of the disturbances for t = 1 are given by

1 P1
ε̂1 = σε2 u1 , with u1 = v1 − r1 ,
P1 + σε2 P1 + σε2

and η̂1 = ση2 r1 . Letting P1 → ∞, we obtain ε̂1 = −σε2 r1 . Note that r1 depends
on the Kalman filter output for t = 2, . . . , n. The smoothed variances of the
disturbances for t = 1 depend on D1 and N1 of which only D1 is affected by
P1 → ∞; using (2.47),
 2
1 P1
D1 = + N1 .
P1 + σε2 P1 + σε2

Letting P1 → ∞, we obtain D1 = N1 and therefore Var(ε̂1 ) = σε4 N1 . The


variance of the smoothed estimate of η1 remains unaltered as Var(η̂1 ) = ση4 N1 .
The initial smoothed state α̂1 under diffuse conditions can also be obtained
by assuming that y1 is fixed and α1 = y1 − ε1 where ε1 ∼ N (0, σε2 ). For example,
for the smoothed mean of the state at t = 1, we have now only n − 1 varying
yt ’s so that
n
Cov(α1 , vj )
α̂1 = a1 + vj
j=2
Fj

with a1 = y1 . It follows from (2.56) that a2 = a1 = y1 . Further, v2 = y2 − a2 =


α2 + ε2 − y1 = α1 + η1 + ε2 − y1 = −ε1 + η1 + ε2 . Consequently, Cov(α1 , v2 ) =
Cov(−ε1 , −ε1 + η1 + ε2 ) = σε2 . We therefore have from (2.32),

σε2 (1 − K2 )σε2 (1 − K2 )(1 − K3 )σε2


α̂1 = a1 + v2 + v3 + v4 + · · ·
F2 F3 F4
= y1 + σε2 r1 ,
34 Local level model

as before with r1 as defined in (2.34) for t = 1. The equations for the remaining
α̂t ’s are the same as previously. The same results may be obtained by Bayesian
arguments.
Use of a diffuse prior for initialisation is the approach preferred by most time
series analysts in the situation where nothing is known about the initial value
α1 . However, some workers find the diffuse approach uncongenial because they
regard the assumption of an infinite variance as unnatural since all observed
time series have finite values. From this point of view an alternative approach
is to assume that α1 is an unknown constant to be estimated from the data by
maximum likelihood. The simplest form of this idea is to estimate α1 by maxi-
mum likelihood from the first observation y1 . Denote this maximum likelihood
estimate by α̂1 and its variance by Var(α̂1 ). We then initialise the Kalman filter
by taking a1|1 = α̂1 and P1|1 = Var(α̂1 ). Since when α1 is fixed y1 ∼ N(α1 , σε2 ),
we have α̂1 = y1 and Var(α̂1 ) = σε2 . We therefore initialise the filter by taking
a1|1 = y1 and P1|1 = σε2 . But these are the same values as we obtain by assum-
ing that α1 is diffuse. It follows that we obtain the same initialisation of the
Kalman filter by representing α1 as a random variable with infinite variance as
by assuming that it is fixed and unknown and estimating it from y1 . We shall
show that a similar result holds for the general linear Gaussian state space model
in Subsection 5.7.3.

2.10 Parameter estimation


We now consider the fitting of the local level model to data from the standpoint of
classical inference. In effect, this amounts to deriving formulae on the assumption
that the additional parameters are known and then replacing these by their
maximum likelihood estimates. Bayesian treatments will be considered for the
general linear Gaussian model in Chapter 13. Parameters in state space models
are often called hyperparameters, possibly to distinguish them from elements of
state vectors which can plausibly be thought of as random parameters; however,
in this book we shall just call them additional parameters, since with the usual
meaning of the word parameter this is what they are. We will discuss methods
for calculating the loglikelihood function and the maximisation of it with respect
to the additional parameters, σε2 and ση2 .

2.10.1 Loglikelihood evaluation


Since
p(y1 , . . . , yt ) = p(Yt−1 )p(yt |Yt−1 ),

for t = 2, . . . , n, the joint density of y1 , . . . , yn can be expressed as


n
p(Yn ) = p(yt |Yt−1 ),
t=1
Parameter estimation 35

where p(y1 |Y0 ) = p(y1 ). Now p(yt |Yt−1 ) = N(at , Ft ) and vt = yt − at so on taking
logs and assuming that a1 and P1 are known the loglikelihood is given by
n  
n 1 vt2
log L = log p(Yn ) = − log(2π) − log Ft + . (2.58)
2 2 t=1 Ft

The exact loglikelihood can therefore be constructed easily from the Kalman
filter (2.15).
Alternatively, let us derive the loglikelihood for the local level model from
the representation (2.4). This gives

n 1 1
log L = − log(2π) − log|Ω| − (Yn − a1 1) Ω−1 (Yn − a1 1), (2.59)
2 2 2

which follows from the multivariate normal distribution Yn ∼ N(a1 1, Ω). Using
results from §2.3.1, Ω = CF C  , |C| = 1, Ω−1 = C  F −1 C and v = C(Yn − a1 1);
it follows that

log|Ω| = log|CF C  | = log|C||F ||C| = log|F |,

and
(Yn − a1 1) Ω−1 (Yn − a1 1) = v  F −1 v.
n  −1
Substitution and using the results log|F | =
 t=1 log Ft and v F v =
n −1 2
F
t=1 t v t lead directly to (2.58).
The loglikelihood in the diffuse case is derived as follows. All terms in (2.58)
remain finite as P1 → ∞ with Yn fixed except the term for t = 1. It thus seems
reasonable to remove the influence of P1 as P1 → ∞ by defining the diffuse
loglikelihood as
 
1
log Ld = lim log L + log P1
P1 →∞ 2
  n  
1 F1 v12 n 1 vt2
= − lim log + − log(2π) − log Ft +
2 P1 →∞ P1 F1 2 2 t=2 Ft
n  
n 1 v2
=− log(2π) − log Ft + t , (2.60)
2 2 t=2 Ft

since F1 /P1 → 1 and v12 /F1 → 0 as P1 → ∞. Note that vt and Ft remain finite
as P1 → ∞ for t = 2, . . . , n.
Since P1 does not depend on σε2 and ση2 , the values of σε2 and ση2 that maximise
log L are identical to the values that maximise log L + 12 log P1 . As P1 → ∞,
these latter values converge to the values that maximise log Ld because first and
36 Local level model

second derivatives with respect to σε2 and ση2 converge, and second derivatives are
finite and strictly negative. It follows that the maximum likelihood estimators
of σε2 and ση2 obtained by maximising (2.58) converge to the values obtained by
maximising (2.60) as P1 → ∞.
We estimate the unknown parameters σε2 and ση2 by maximising expres-
sion (2.58) or (2.60) numerically according to whether a1 and P1 are known
or unknown. In practice it is more convenient to maximise numerically with
respect to the quantities ψε = log σε2 and ψη = log ση2 . An efficient algorithm for
numerical maximisation is implemented in the STAMP 8.3 package of Koopman,
Harvey, Doornik and Shephard (2010). This optimisation procedure is based on
the quasi-Newton scheme BFGS for which details are given in Subsection 7.3.2.

2.10.2 Concentration of loglikelihood


It can be advantageous to re-parameterise the model prior to maximisation in
order to reduce the dimensionality of the numerical search for the estimation of
the parameters. For example, for the local level model we can put q = ση2 /σε2 to
obtain the model
 
y t = αt + εt , εt ∼ N 0, σε2 ,
 
αt+1 = αt + ηt , ηt ∼ N 0, qσε2 ,
and estimate the pair σε2 , q in preference to σε2 , ση2 . Put Pt∗ = Pt /σε2 and Ft∗ =
Ft /σε2 ; from (2.15) and Section 2.9, we have
vt = yt − at , Ft∗ = Pt∗ + 1,

at+1 = at + Kt v t , Pt+1 = Pt∗ (1 − Kt ) + q,
where Kt = Pt /Ft = Pt∗ /Ft∗ for t = 2, . . . , n and these relations are initialised
with a2 = y1 and P2∗ = 1 + q. Note that Ft∗ depends on q but not on σε2 . The
loglikelihood (2.60) then becomes
n  
n n−1 1 v2
log Ld = − log(2π) − log σε2 − log Ft∗ + 2 t ∗ . (2.61)
2 2 2 t=2 σ ε Ft

By maximising (2.61) with respect to σε2 , for given F2∗ , . . . , Fn∗ , we obtain

1  vt2
n
σ̂ 2ε = . (2.62)
n − 1 t=2 Ft∗

The value of log Ld obtained by substituting σ̂ε2 for σε2 in (2.61) is called the
concentrated diffuse loglikelihood and is denoted by log Ldc , giving

1
n
n n−1 n−1
log Ldc = − log(2π) − − log σ̂ε2 − log Ft∗ . (2.63)
2 2 2 2 t=2

This is maximised with respect to q by a one-dimensional numerical search.


Steady state 37

Table 2.1 Estimation of parameters of local level


model by maximum likelihood.

Iteration q ψ Score Loglikelihood

0 1 0 −3.32 −495.68
1 0.0360 −3.32 0.93 −492.53
2 0.0745 −2.60 0.25 −492.10
3 0.0974 −2.32 −0.001 −492.07
4 0.0973 −2.33 0.0 −492.07

2.10.3 Illustration
The estimates of the variances σε2 and ση2 = qσε2 for the Nile data are obtained
by maximising the concentrated diffuse loglikelihood (2.63) with respect to ψ
where q = exp(ψ). In Table 2.1 the iterations of the BFGS procedure are
reported starting with ψ = 0. The relative percentage change of the loglikeli-
hood goes down very rapidly and convergence is achieved after 4 iterations. The
final estimate for ψ is −2.33 and hence the estimate of q is q̂ = 0.097. The
estimate of σε2 given by (2.62) is 15099 which implies that the estimate of ση2 is
σ̂η2 = q̂σ̂ε2 = 0.097 × 15099 = 1469.1.

2.11 Steady state


We now consider whether the Kalman filter (2.15) converges to a steady state
as n → ∞. This will be the case if Pt converges to a positive value, P̄ say.
Obviously, we would then have Ft → P̄ + σε2 and Kt → P̄ /(P̄ + σε2 ). To check
whether there is a steady state, put Pt+1 = Pt = P̄ in (2.15) and verify whether
the resulting equation in P̄ has a positive solution. The equation is
 

P̄ = P̄ 1− + ση2 ,
P̄ + σε2

which reduces to the quadratic

x2 − xq − q = 0, (2.64)

where x = P̄ /σε2 and q = ση2 /σε2 , with the solution

 ! 
x = q + q 2 + 4q /2.

This is positive when q > 0 which holds for nontrivial models. The other solution
to (2.64) is inapplicable since it is negative for q > 0. Thus all non-trivial local
level models have a steady state solution.
3 Linear state space models

3.1 Introduction
The general linear Gaussian state space model can be written in a variety of
ways; we shall use the form

yt = Zt αt + εt , εt ∼ N(0, Ht ),
(3.1)
αt+1 = Tt αt + Rt ηt , ηt ∼ N(0, Qt ), t = 1, . . . , n,

where yt is a p × 1 vector of observations called the observation vector and αt


is an unobserved m × 1 vector called the state vector. The idea underlying the
model is that the development of the system over time is determined by αt
according to the second equation of (3.1), but because αt cannot be observed
directly we must base the analysis on observations yt . The first equation of (3.1)
is called the observation equation and the second is called the state equation. The
matrices Zt , Tt , Rt , Ht and Qt are initially assumed to be known and the error
terms εt and ηt are assumed to be serially independent and independent of each
other at all time points. Matrices Zt and Tt−1 can be permitted to depend on
y1 , . . . , yt−1 . The initial state vector α1 is assumed to be N(a1 , P1 ) independently
of ε1 , . . . , εn and η1 , . . . , ηn , where a1 and P1 are first assumed known; we will
consider in Chapter 5 how to proceed in the absence of knowledge of a1 and
P1 . In practice, some or all of the matrices Zt , Ht , Tt , Rt and Qt will depend
on elements of an unknown parameter vector ψ, the estimation of which will be
considered in Chapter 7. The same model is used for a classical and a Bayesian
analysis. The general linear state space model is the same as (3.1) except that
the error densities are written as εt ∼ (0, Ht ) and ηt ∼ (0, Qt ), that is, the
normality assumption is dropped.
The first equation of (3.1) has the structure of a linear regression model
where the coefficient vector αt varies over time. The second equation represents a
first order vector autoregressive model, the Markovian nature of which accounts
for many of the elegant properties of the state space model. The local level
model (2.3) considered in the last chapter is a simple special case of (3.1). In
many applications Rt is the identity. In others, one could define ηt∗ = Rt ηt and
Q∗t = Rt Qt Rt and proceed without explicit inclusion of Rt , thus making the
model look simpler. However, if Rt is m × r with r < m and Qt is nonsingular,
there is an obvious advantage in working with nonsingular ηt rather than singular
ηt∗ . We assume that Rt is a subset of the columns of Im ; in this case Rt is called a
44 Linear state space models

selection matrix since it selects the rows of the state equation which have nonzero
disturbance terms; however, much of the theory remains valid if Rt is a general
m × r matrix.
Model (3.1) provides a powerful tool for the analysis of a wide range of
problems. In this chapter we shall give substance to the general theory to be
presented in Chapter 4 by describing a number of important applications of the
model to problems in time series analysis and in spline smoothing analysis.

3.2 Univariate structural time series models


A structural time series model is one in which the trend, seasonal and error
terms in the basic model (2.1), plus other relevant components, are modelled
explicitly. In this section we shall consider structural models for the case where
yt is univariate; we shall extend this to the case where yt is multivariate in
Section 3.3. A detailed discussion of structural time series models, together with
further references, has been given by Harvey (1989).

3.2.1 Trend component


The local level model considered in Chapter 2 is a simple form of a structural
time series model. By adding a slope term νt , which is generated by a random
walk, we obtain the model
 
yt = µt + εt , εt ∼ N 0, σε2 ,
 
µt+1 = µt + νt + ξt , ξt ∼ N 0, σξ2 , (3.2)
 
νt+1 = νt + ζt , ζt ∼ N 0, σζ2 .

This is called the local linear trend model. If ξt = ζt = 0 then νt+1 = νt = ν,


say, and µt+1 = µt + ν so the trend is exactly linear and (3.2) reduces to the
deterministic linear trend plus noise model. The form (3.2) with σξ2 > 0 and
σζ2 > 0 allows the trend level and slope to vary over time.
Applied workers sometimes complain that the series of values of µt obtained
by fitting this model does not look smooth enough to represent their idea of
what a trend should look like. This objection can be met by setting σξ2 = 0
at the outset and fitting the model under this restriction. Essentially the same
effect can be obtained by using in place of the second and third equation of (3.2)
the model ∆2 µt+1 = ζt , i.e. µt+1 = 2µt − µt−1 + ζt where ∆ is the first difference
operator defined by ∆xt = xt − xt−1 . This and its extension ∆r µt = ζt for r > 2
have been advocated for modelling trend in state space models in a series of
papers by Young and his collaborators under the name integrated random walk
models; see, for example, Young, Lane, Ng and Palmer (1991). We see that (3.2)
can be written in the form
Univariate structural time series models 45
 
µt
yt = (1 0) + εt ,
νt
      
µt+1 1 1 µt ξ
= + t ,
νt+1 0 1 νt ζt

which is a special case of (3.1).

3.2.2 Seasonal component


To model the seasonal term γt in (2.1), suppose there are s ‘months’ per ‘year’.
Thus for monthly data s = 12, for quarterly data s = 4 and for daily data, when
modelling the weekly pattern, s = 7. If the seasonal pattern is constant over time,
the seasonal
s values for months 1 to s can be modelled by the constants γ1∗ , . . . , γs∗
∗ ∗
where j=1 γj = 0. For the jth ‘month’ in ‘year’ i we have γt = γj where
s−1
t = s(i − 1) + j for i = 1, 2, . . . and j = 1, . . . , s. It follows that j=0 γt+1−j = 0
s−1
so γt+1 = − j=1 γt+1−j with t = s − 1, s, . . . . In practice we often wish to allow
the seasonal pattern to change over time. A simple way to achieve this is to add
an error term ωt to this relation giving the model


s−1
 
γt+1 = − γt+1−j + ωt , ωt ∼ N 0, σω2 , (3.3)
j=1

for t = 1, . . . , n where initialisation at t = 1, . . . , s − 1 will be taken care of


later by our general treatment of the initialisation question in Chapter 5. An
alternative suggested by Harrison and Stevens (1976) is to denote the effect of
season j at time t by γjt and then let γjt be generated by the quasi-random walk

γj,t+1 = γjt + ωjt , t = (i − 1)s + j, i = 1, 2, . . . , j = 1, . . . , s, (3.4)

with an adjustment to ensure that each successive set of s seasonal components


sums to zero; see Harvey (1989, §2.3.4) for details of the adjustment.
It is often preferable to express the seasonal in a trigonometric form, one
version of which, for a constant seasonal, is

[s/2]
 2πj
γt = (γ̃j cos λj t + γ̃j∗ sin λj t), λj = , j = 1, . . . , [s/2], (3.5)
j=1
s

where [a] is the largest integer ≤ a and where the quantities γ̃j and γ̃j∗ are given
constants. For a time-varying seasonal this can be made stochastic by replacing
γ̃j and γ̃j∗ by the random walks

∗ ∗ ∗
γ̃j,t+1 = γ̃jt + ω̃jt , γ̃j,t+1 = γ̃jt + ω̃jt , j = 1, . . . , [s/2], t = 1, . . . , n, (3.6)
46 Linear state space models


where ω̃jt and ω̃jt are independent N(0, σω2 ) variables; for details see Young, Lane,
Ng and Palmer (1991). An alternative trigonometric form is the quasi-random
walk model
[s/2]

γt = γjt , (3.7)
j=1

where


γj,t+1 = γjt cos λj + γjt sin λj + ωjt ,
∗ ∗ ∗
γj,t+1 = −γjt sin λj + γjt cos λj + ωjt , j = 1, . . . , [s/2], (3.8)


in which the ωjt and ωjt terms are independent N(0, σω2 ) variables. We can show
that when the stochastic terms in (3.8) are zero, the values of γt defined by (3.7)
are periodic with period s by taking

γjt = γ̃j cos λj t + γ̃j∗ sin λj t,



γjt = −γ̃j sin λj t + γ̃j∗ cos λj t,

which are easily shown to satisfy the deterministic part of (3.8). The required
result follows since γt defined by (3.5) is periodic with period s. In effect, the
deterministic part of (3.8) provides a recursion for (3.5).
The advantage of (3.7) over (3.6) is that the contributions of the errors ωjt and

ωjt are not amplified in (3.7) by the trigonometric functions cos λj t and sin λj t.
We regard (3.3) as the main time domain model and (3.7) as the main frequency
domain model for the seasonal component in structural time series analysis. A
more detailed discussion of seasonal models is presented in Proietti (2000). In
particular, he shows that the seasonal model in trigonometric form with specific

variance restrictions for ωjt and ωjt , is equivalent to the quasi-random walk
seasonal model (3.4).

3.2.3 Basic structural time series model


Each of the four seasonal models of the previous subsection can be combined
with either of the trend models to give a structural time series model and all
these can be put in the state space form (3.1). For example, for the local linear
trend model (3.2) together with model (3.3) we have the observation equation

y t = µ t + γ t + εt , t = 1, . . . , n. (3.9)

To represent the model in state space form, we take the state vector as

αt = (µt νt γt γt−1 ... γt−s+2 ) ,


4 Filtering, smoothing
and forecasting

4.1 Introduction
In this chapter and the following three chapters we provide a general treatment
from both classical and Bayesian perspectives of the linear Gaussian state space
model (3.1). The observations yt will be treated as multivariate. For much of
the theory, the development is a straightforward extension to the general case
of the treatment of the simple local level model in Chapter 2. We also consider
linear unbiased estimates in the non-normal case.
In Section 4.2 we present some elementary results in multivariate regres-
sion theory which provide the foundation for our treatment of Kalman filtering
and smoothing later in the chapter. We begin by considering a pair of jointly
distributed random vectors x and y. Assuming that their joint distribution is
normal, we show in Lemma 1 that the conditional distribution of x given y is
normal and we derive its mean vector and variance matrix. We shall show in
Section 4.3 that these results lead directly to the Kalman filter. For workers who
do not wish to assume normality we derive in Lemma 2 the minimum variance
linear unbiased estimate of x given y. For those who prefer a Bayesian approach
we derive in Lemma 3, under the assumption of normality, the posterior den-
sity of x given an observed value of y. Finally in Lemma 4, while retaining the
Bayesian approach, we drop the assumption of normality and derive a quasi-
posterior density of x given y, with a mean vector which is linear in y and which
has minimum variance matrix.
All four lemmas can be regarded as representing in appropriate senses the
regression of x on y. For this reason in all cases the mean vectors and variance
matrices are the same. We shall use these lemmas to derive the Kalman filter and
smoother in Sections 4.3 and 4.4. Because the mean vectors and variance matrices
are the same, we need only use one of the four lemmas to derive the results that
we need; the results so obtained then remain valid under the conditions assumed
under the other three lemmas.
Denote the set of observations y1 , . . . , yt by Yt . In Section 4.3 we will
derive the Kalman filter, which is a recursion for calculating at|t = E(αt |Yt ),
at+1 = E(αt+1 |Yt ), Pt|t = Var(αt |Yt ) and Pt+1 = Var(αt+1 |Yt ) given at and
Pt . The derivation requires only elementary properties of multivariate regres-
sion theory derived in Lemmas 1 to 4. We also investigate some properties of
state estimation errors and one-step ahead forecast errors. In Section 4.4 we use
Basic results in multivariate regression theory 77

the output of the Kalman filter and the properties of forecast errors to obtain
recursions for smoothing the series, that is, calculating the conditional mean and
variance matrix of αt , for t = 1, . . . , n, n+1, given all the observations y1 , . . . , yn .
Estimates of the disturbance vectors εt and ηt and their error variance matrices
given all the data are derived in Section 4.5. Covariance matrices of smoothed
estimators are considered in Section 4.7. The weights associated with filtered
and smoothed estimates of functions of the state and disturbance vectors are
discussed in Section 4.8. Section 4.9 describes how to generate random samples
for purposes of simulation from the smoothed densities of the state and distur-
bance vectors given the observations. The problem of missing observations is
considered in Section 4.10 where we show that with the state space approach
the problem is easily dealt with by means of simple modifications of the Kalman
filter and the smoothing recursions. Section 4.11 shows that forecasts of observa-
tions and state can be obtained simply by treating future observations as missing
values; these results are of special significance in view of the importance of fore-
casting in much practical time series work. A comment on varying dimensions
of the observation vector is given in Section 4.12. Finally, in Section 4.13 we
consider a general matrix formulation of the state space model.

4.2 Basic results in multivariate regression theory


In this section we present some basic results in elementary multivariate regression
theory that we shall use for the development of the theory for the linear Gaussian
state space model (3.1) and its non-Gaussian version with εt ∼ (0, Ht ) and
ηt ∼ (0, Qt ). We shall present the results in a general form before embarking
on the state space theory because we shall need to apply them in a variety of
different contexts and it is preferable to prove them once only in general rather
than to produce a series of similar ad hoc proofs tailored to specific situations.
A further point is that this form of presentation exposes the intrinsically simple
nature of the mathematical theory underlying the state space approach to time
series analysis. Readers who are prepared to take for granted the results in
Lemmas 1 to 4 below can skip the proofs and go straight to Section 4.3.
Suppose that x and y are jointly normally distributed random vectors with
       
x µx x Σxx Σxy
E = , Var = , (4.1)
y µy y Σxy Σyy

where Σyy is assumed to be a nonsingular matrix.


Lemma 1 The conditional distribution of x given y is normal with mean vector

E(x|y) = µx + Σxy Σ−1


yy (y − µy ), (4.2)

and variance matrix

Var(x|y) = Σxx − Σxy Σ−1 


yy Σxy . (4.3)
78 Filtering, smoothing and forecasting

Proof. Let z = x−Σxy Σ−1 yy (y −µy ). Since the transformation from (x, y) to (y, z)
is linear and (x, y) is normally distributed, the joint distribution of y and z is
normal. We have
E(z) = µx
Var(z) = E [(z − µx )(z − µx ) ]
= Σxx − Σxy Σ−1 
yy Σxy , (4.4)

Cov(y, z) = E [y(z − µx ) ]
 
= E y(x − µx ) − y(y − µy ) Σ−1 
yy Σxy

= 0. (4.5)
Using the result that if two vectors are normal and uncorrelated they are inde-
pendent, we infer from (4.5) that z is distributed independently of y. Since
the distribution of z does not depend on y its conditional distribution given
y is the same as its unconditional distribution, that is, it is normal with
mean vector µx and variance matrix (4.4) which is the same as (4.3). Since
z = x − Σxy Σ−1
yy (y − µy ), it follows that the conditional distribution of x given
y is normal with mean vector (4.2) and variance matrix (4.3).
Formulae (4.2) and (4.3) are well known in regression theory. An early proof
in a state space context is given in Åström (1970, Chapter 7, Theorem 3.2). The
proof given here is based on the treatment given by Rao (1973, §8a.2(v)). A par-
tially similar proof is given by Anderson (2003, Theorem 2.5.1). A quite different
proof in a state space context is given by Anderson and Moore (1979, Example
3.2) which is repeated by Harvey (1989, Appendix to Chapter 3); some details
of this proof are given in Exercise 4.14.1.
We can regard Lemma 1 as representing the regression of x on y in a mul-
tivariate normal distribution. It should be noted that Lemma 1 remains valid
when Σyy is singular if the symbol Σ−1yy is interpreted as a generalised inverse; see
the treatment in Rao (1973). Åström (1970) pointed out that if the distribution
of (x, y) is singular we can always derive a nonsingular distribution by making a
projection on the hyperplanes where the mass is concentrated. The fact that the
conditional variance Var(x|y) given by (4.3) does not depend on y is a property
special to the multivariate normal distribution and does not generally hold for
other distributions.
We now consider the estimation of x when x is unknown and y is known, as
for example when y is an observed vector. Under the assumptions of Lemma 1
we take as our estimate of x the conditional expectation x % = E(x|y), that is,
% = µx + Σxy Σ−1
x yy (y − µy ). (4.6)
% − x so x
This has estimation error x % is conditionally unbiased in the sense that
x − x|y) = x
E(% % − E(x|y) = 0. It is also obviously unconditionally unbiased in the
x − x) = 0. The unconditional error variance matrix of x
sense that E(% % is
Basic results in multivariate regression theory 79
 
x − x) = Var Σxy Σ−1
Var(% −1 
yy (y − µy ) − (x − µx ) = Σxx − Σxy Σyy Σxy . (4.7)

Expressions (4.6) and (4.7) are, of course, the same as (4.2) and (4.3) respectively.
We now consider the estimation of x given y when the assumption that (x, y)
is normally distributed is dropped. We assume that the other assumptions of
Lemma 1 are retained. Let us restrict our attention to estimates x̄ that are
linear in the elements of y, that is, we shall take

x̄ = β + γy,

where β is a fixed vector and γ is a fixed matrix of appropriate dimensions. The


estimation error is x̄ − x. If E(x̄ − x) = 0, we say that x̄ is a linear unbiased
estimate (LUE) of x given y. If there is a particular value x∗ of x̄ such that

Var(x̄ − x) − Var(x∗ − x),

is non-negative definite for all LUEs x̄ we say that x∗ is a minimum variance


linear unbiased estimate (MVLUE) of x given y. Note that the mean vectors
and variance matrices here are unconditional and not conditional given y as
were considered in Lemma 1. An MVLUE for the non-normal case is given by
the following lemma.
Lemma 2 Whether (x,y) is normally distributed or not, the estimate x % defined
by (4.6) is a MVLUE of x given y and its error variance matrix is given by (4.7).
Proof. Since x̄ is an LUE, we have

E(x̄ − x) = E(β + γy − x)
= β + γµy − µx = 0.

It follows that β = µx − γµy and therefore

x̄ = µx + γ(y − µy ). (4.8)

Thus

Var(x̄ − x) = Var [µx + γ(y − µy ) − x]


= Var [γ(y − µy ) − (x − µx )]
= γΣyy γ  − γΣxy − Σxy γ  + Σxx
 
= Var (γ − Σxy Σ−1 −1 
yy )y + Σxx − Σxy Σyy Σxy . (4.9)

% be the value of x̄ obtained by putting γ = Σxy Σ−1


Let x yy in (4.8). Then
% = µx + Σxy Σ−1
x yy (y − µ y ) and from (4.9), it follows that

x − x) = Σxx − Σxy Σ−1


Var(% 
yy Σxy .
80 Filtering, smoothing and forecasting

We can therefore rewrite (4.9) in the form


 
Var(x̄ − x) = Var (γ − Σxy Σ−1 x − x),
yy )y + Var(% (4.10)

 
which holds for all LUEs x̄. Since Var (γ − Σxy Σ−1
yy )y is non-negative definite
the lemma is proved.

The MVLUE property of the vector estimate x % implies that arbitrary lin-
ear functions of elements of x % are minimum variance linear unbiased estimates
of the corresponding linear functions of the elements of x. Lemma 2 can be
regarded as an analogue for multivariate distributions of the Gauss–Markov the-
orem for least squares regression of a dependent variable on fixed regressors. For
a treatment of the Gauss–Markov theorem, see, for example, Davidson and Mac-
Kinnon (1993, Chapter 3). Lemma 2 is proved in the special context of Kalman
filtering by Duncan and Horn (1972) and by Anderson and Moore (1979, §3.2).
However, their treatments lack the brevity and generality of Lemma 2 and its
proof.
Lemma 2 is highly significant for workers who prefer not to assume normal-
ity as the basis for the analysis of time series on the grounds that many real
time series have distributions that appear to be far from normal; however, the
MVLUE criterion is regarded as acceptable as a basis for analysis by many of
these workers. We will show later in the book that many important results in
state space analysis such as Kalman filtering and smoothing, missing observation
analysis and forecasting can be obtained by using Lemma 1; Lemma 2 shows that
these results also satisfy the MVLUE criterion. A variant of Lemma 2 is to for-
mulate it in terms of minimum mean square error matrix rather than minimum
variance unbiasedness; this variant is dealt with in Exercise 4.14.4.
Other workers prefer to treat inference problems in state space time series
analysis from a Bayesian point of view instead of from the classical standpoint
for which Lemmas 1 and 2 are appropriate. We therefore consider basic results
in multivariate regression theory that will lead us to a Bayesian treatment of the
linear Gaussian state space model.
Suppose that x is a parameter vector with prior density p(x) and that y is an
observational vector with density p(y) and conditional density p(y|x). Suppose
further that the joint density of x and y is the multivariate normal density p(x, y).
Then the posterior density of x given y is

p(x, y) p(x)p(y|x)
p(x|y) = = . (4.11)
p(y) p(y)

We shall use the same notation as in (4.1) for the first and second moments of
x and y. The equation (4.11) is a form of Bayes Theorem.
Basic results in multivariate regression theory 81

Lemma 3 The posterior density of x given y is normal with posterior mean


vector (4.2) and posterior variance matrix (4.3).

Proof. Using (4.11), the proof follows immediately from Lemma 1.

A general Bayesian treatment of state space time series analysis is given by


West and Harrison (1997, §17.2.2) in which an explicit proof of our Lemma 3 is
given. Their results emerge in a different form from our (4.2) and (4.3) which, as
they point out, can be converted to ours by an algebraical identity; see Exercise
4.14.5 for details.
We now drop the assumption of normality and derive a Bayesian-type ana-
logue of Lemma 2. Let us introduce a broader concept of a posterior density
than the traditional one by using the term ‘posterior density’ to refer to any
density of x given y and not solely to the conditional density of x given y. Let
us consider posterior densities whose mean x̄ given y is linear in the elements of
y, that is, we take x̄ = β + γy where β is a fixed vector and γ is a fixed matrix;
we say that x̄ is a linear posterior mean. If there is a particular value x∗ of x̄
such that
Var(x̄ − x) − Var(x∗ − x)

is non-negative definite for all linear posterior means x̄, we say that x∗ is a
minimum variance linear posterior mean estimate (MVLPME) of x given y.

Lemma 4 The linear posterior mean x % defined by (4.6) is a MVLPME and its
error variance matrix is given by (4.7).

Proof. Taking expectations with respect to density p(y) we have

E(x̄) = µx = E(β + γy) = β + γµy ,

from which it follows that β = µx − γµy and hence that (4.8) holds. Let x % be the
value of x̄ obtained by putting γ = Σxy Σ−1
yy in (4.8). It follows as in the proof of
Lemma 2 that (4.10) applies so the lemma is proved.

The four lemmas in this section have an important common property.


Although each lemma starts from a different criterion, they all finish up with
distributions which have the same mean vector (4.2) and the same variance
matrix (4.3). The significance of this result is that formulae for the Kalman fil-
ter, its associated smoother and related results throughout Part I of the book are
exactly the same whether an individual worker wishes to start from a criterion
of classical inference, minimum variance linear unbiased estimation or Bayesian
inference.
82 Filtering, smoothing and forecasting

4.3 Filtering
4.3.1 Derivation of the Kalman filter
For convenience we restate the linear Gaussian state space model (3.1) here as

yt = Zt αt + εt , εt ∼ N(0, Ht ),
αt+1 = Tt αt + Rt ηt , ηt ∼ N(0, Qt ), t = 1, . . . , n, (4.12)
α1 ∼ N(a1 , P1 ),

where details are given below (3.1). At various points we shall drop the normality
assumptions in (4.12). Let Yt−1 denote the set of past observations y1 , . . . , yt−1
for t = 2, 3, . . . while Y0 indicates that there is no prior observation before t = 1.
In our treatments below, we will define Yt by the vector (y1 , . . . , yt ) . Starting
at t = 1 in (4.12) and building up the distributions of αt and yt recursively, it is
easy to show that p(yt |α1 , . . . , αt , Yt−1 ) = p(yt |αt ) and p(αt+1 |α1 , . . . , αt , Yt ) =
p(αt+1 |αt ). In Table 4.1 we give the dimensions of the vectors and matrices of
the state space model.
In this section we derive the Kalman filter for model (4.12) for the case where
the initial state α1 is N(a1 , P1 ) where a1 and P1 are known. We shall base the
derivation on classical inference using Lemma 1. It follows from Lemmas 2 to 4
that the basic results are also valid for minimum variance linear unbiased estima-
tion and for Bayesian-type inference with or without the normality assumption.
Returning to the assumption of normality, our object is to obtain the condi-
tional distributions of αt and αt+1 given Yt for t = 1, . . . , n. Let at|t = E(αt |Yt ),
at+1 = E(αt+1 |Yt ), Pt|t = Var(αt |Yt ) and Pt+1 = Var(αt+1 |Yt ). Since all distri-
butions are normal, it follows from Lemma 1 that conditional distributions of
subsets of variables given other subsets of variables are also normal; the distri-
butions of αt given Yt and αt+1 given Yt are therefore given by N(at|t , Pt|t ) and
N(at+1 , Pt+1 ). We proceed inductively; starting with N(at , Pt ), the distribution
of αt given Yt−1 , we show how to calculate at|t , at+1 , Pt|t and Pt+1 from at and
Pt recursively for t = 1, . . . , n.
Let

vt = yt − E(yt |Yt−1 ) = yt − E(Zt αt + εt |Yt−1 ) = yt − Zt at . (4.13)

Table 4.1 Dimensions of state space


model (4.12).

Vector Matrix

yt p×1 Zt p×m
αt m×1 Tt m×m
εt p×1 Ht p×p
ηt r×1 Rt m×r
Qt r×r
a1 m×1 P1 m×m
Filtering 83

Thus vt is the one-step ahead forecast error of yt given Yt−1 . When Yt−1 and
vt are fixed then Yt is fixed and vice versa. Thus E(αt |Yt ) = E(αt |Yt−1 , vt ). But
E(vt |Yt−1 ) = E(yt − Zt at |Yt−1 ) = E(Zt αt + εt − Zt at |Yt−1 ) = 0. Consequently,
E(vt ) = 0 and Cov(yj , vt ) = E[yj E(vt |Yt−1 ) ] = 0 for j = 1, . . . , t − 1. Also,

at|t = E(αt |Yt ) = E(αt |Yt−1 , vt ),


at+1 = E(αt+1 |Yt ) = E(αt+1 |Yt−1 , vt ).

Now apply Lemma 1 in Section 4.2 to the conditional joint distribution of αt


and vt given Yt−1 , taking x and y in Lemma 1 as αt and vt here. This gives

at|t = E(αt |Yt−1 ) + Cov(αt , vt )[Var(vt )]−1 vt , (4.14)

where Cov and Var refer to covariance and variance in the conditional joint
distributions of αt and vt given Yt−1 . Here, E(αt |Yt−1 ) = at by definition of at
and
  
Cov(αt , vt ) = E αt (Zt αt + εt − Zt at ) |Yt−1
= E [αt (αt − at ) Zt |Yt−1 ] = Pt Zt , (4.15)

by definition of Pt . Let

Ft = Var(vt |Yt−1 ) = Var(Zt αt + εt − Zt at |Yt−1 ) = Zt Pt Zt + Ht . (4.16)

Then
at|t = at + Pt Zt Ft−1 vt . (4.17)
By (4.3) of Lemma 1 in Section 4.2 we have

Pt|t = Var(αt |Yt ) = Var(αt |Yt−1 , vt )

= Var(αt |Yt−1 ) − Cov(αt , vt )[Var(vt )]−1 Cov(αt , vt )


= Pt − Pt Zt Ft−1 Zt Pt . (4.18)

We assume that Ft is nonsingular; this assumption is normally valid in well-


formulated models, but in any case it is relaxed in Section 6.4. Relations (4.17)
and (4.18) are sometimes called the updating step of the Kalman filter.
We now develop recursions for at+1 and Pt+1 . Since αt+1 = Tt αt + Rt ηt , we
have

at+1 = E(Tt αt + Rt ηt |Yt )


= Tt E(αt |Yt ), (4.19)

Pt+1 = Var(Tt αt + Rt ηt |Yt )


= Tt Var(αt |Yt )Tt + Rt Qt Rt , (4.20)
84 Filtering, smoothing and forecasting

for t = 1, . . . , n.
Substituting (4.17) into (4.19) gives

at+1 = Tt at|t
= T t a t + Kt v t , t = 1, . . . , n, (4.21)

where
Kt = Tt Pt Zt Ft−1 . (4.22)
The matrix Kt is referred to as the Kalman gain. We observe that at+1 has been
obtained as a linear function of the previous value at and the forecast error vt
of yt given Yt−1 . Substituting from (4.18) and (4.22) in (4.20) gives

Pt+1 = Tt Pt (Tt − Kt Zt ) + Rt Qt Rt , t = 1, . . . , n. (4.23)

Relations (4.21) and (4.23) are sometimes called the prediction step of the
Kalman filter.
The recursions (4.17), (4.21), (4.18) and (4.23) constitute the celebrated
Kalman filter for model (4.12). They enable us to update our knowledge of
the system each time a new observation comes in. It is noteworthy that we
have derived these recursions by simple applications of the standard results of
multivariate normal regression theory contained in Lemma 1. The key advantage
of the recursions is that we do not have to invert a (pt × pt) matrix to fit the
model each time the tth observation comes in for t = 1, . . . , n; we only have to
invert the (p × p) matrix Ft and p is generally much smaller than n; indeed, in
the most important case in practice, p = 1. Although relations (4.17), (4.21),
(4.18) and (4.23) constitute the forms in which the multivariate Kalman filter
recursions are usually presented, we shall show in Section 6.4 that variants of
them in which elements of the observational vector yt are brought in one at a
time, rather than the entire vector yt , are in general computationally superior.
We infer from Lemma 2 that when the observations are not normally dis-
tributed and we restrict attention to estimates which are linear in yt and
unbiased, and also when matrices Zt and Tt do not depend on previous yt ’s,
then under appropriate assumptions the values of at|t and at+1 given by the
filter minimise the variance matrices of the estimates of αt and αt+1 given Yt .
These considerations emphasise the point that although our results are obtained
under the assumption of normality, they have a wider validity in the sense of
minimum variance linear unbiased estimation when the variables involved are
not normally distributed. It follows from the discussion just after the proof of
Lemma 2 that the estimates are also minimum error variance linear estimates.
From the standpoint of Bayesian inference we note that, on the assump-
tion of normality, Lemma 3 implies that the posterior densities of αt and αt+1
given Yt are normal with mean vectors (4.17) and (4.21) and variance matrices
(4.18) and (4.23), respectively. We therefore do not need to provide a sepa-
rate Bayesian derivation of the Kalman filter. If the assumption of normality
Filtering 85

is dropped, Lemma 4 demonstrates that the Kalman filter, as we have derived


it, provides quasi-posterior mean vectors and variance matrices with minimum
variance linear unbiased interpretations.

4.3.2 Kalman filter recursion


For convenience we collect together the filtering equations

v t = y t − Zt at , Ft = Zt Pt Zt + Ht ,
at|t = at + Pt Zt Ft−1 vt , Pt|t = Pt − Pt Zt Ft−1 Zt Pt , (4.24)
 
at+1 = Tt at + Kt vt , Pt+1 = Tt Pt (Tt − Kt Zt ) + Rt Qt Rt ,

for t = 1, . . . , n, where Kt = Tt Pt Zt Ft−1 with a1 and P1 as the mean vector and
variance matrix of the initial state vector α1 . The recursion (4.24) is called the
Kalman filter. Once at|t and Pt|t are computed, it suffices to adopt the relations

at+1 = Tt at|t , Pt+1 = Tt Pt|t Tt + Rt Qt Rt ,

for predicting the state vector αt+1 and its variance matrix at time t. In Table 4.2
we give the dimensions of the vectors and matrices of the Kalman filter equations.

4.3.3 Kalman filter for models with mean adjustments


It is sometimes convenient to include mean adjustments in the state space model
(4.12) giving the form

yt = Zt αt + dt + εt , εt ∼ N (0, Ht ),
αt+1 = Tt αt + ct + Rt ηt , ηt ∼ N (0, Qt ), (4.25)
α1 ∼ N (a1 , P1 ),

where p × 1 vector dt and m × 1 vector ct are known and may change over
time. Indeed, Harvey (1989) employs (4.25) as the basis for the treatment of the
linear Gaussian state space model. While the simpler model (4.12) is adequate
for most purposes, it is worth while presenting the Kalman filter for model (4.25)
explicitly for occasional use.

Table 4.2 Dimensions of Kalman


filter.

Vector Matrix

vt p×1 Ft p×p
Kt m×p
at m×1 Pt m×m
at|t m×1 Pt|t m×m
86 Filtering, smoothing and forecasting

Defining at = E(αt |Yt−1 ) and Pt = Var(αt |Yt−1 ) as before and assuming that
dt can depend on Yt−1 and ct can depend on Yt , the Kalman filter for (4.25)
takes the form
vt = yt − Zt at − dt , Ft = Zt Pt Zt + Ht ,
at|t = at + Pt Zt Ft−1 vt , Pt|t = Pt − Pt Zt Ft−1 Zt Pt , (4.26)
at+1 = Tt at|t + ct , Pt+1 = Tt Pt|t Tt + Rt Qt Rt ,
for t = 1, . . . , n. The reader can easily verify this result by going through the
argument leading from (4.19) to (4.23) step by step for model (4.25) in place of
model (4.12).

4.3.4 Steady state


When dealing with a time-invariant state space model in which the system matri-
ces Zt , Ht , Tt , Rt , and Qt are constant over time, the Kalman recursion for Pt+1
converges to a constant matrix P̄ which is the solution to the matrix equation
P̄ = T P̄ T  − T P̄ Z  F̄ −1 Z P̄ T  + RQR ,
where F̄ = Z P̄ Z  + H. The solution that is reached after convergence to P̄ is
referred to as the steady state solution of the Kalman filter. Use of the steady
state after convergence leads to considerable computational savings because the
recursive computations for Ft , Kt , Pt|t and Pt+1 are no longer required.

4.3.5 State estimation errors and forecast errors


Define the state estimation error as
xt = α t − a t , with Var(xt ) = Pt , (4.27)
as for the local level model in Subsection 2.3.2. We now investigate how these
errors are related to each other and to the one-step ahead forecast errors vt =
yt −E(yt |Yt−1 ) = yt −Zt at . Since vt is the part of yt that cannot be predicted from
the past, the vt ’s are sometimes referred to as innovations. It follows immediately
from the Kalman filter relations and the definition of xt that
v t = y t − Zt at
= Z t α t + ε t − Zt at
= Z t xt + εt , (4.28)
and
xt+1 = αt+1 − at+1
= T t α t + R t η t − T t a t − Kt v t
= T t x t + R t η t − K t Z t x t − K t εt
= Lt xt + Rt ηt − Kt εt , (4.29)
State smoothing 87

where Kt = Tt Pt Zt Ft−1 and Lt = Tt − Kt Zt ; these recursions are similar to


(2.31) for the local level model. Analogously to the state space relations

y t = Z t α t + εt , αt+1 = Tt αt + Rt ηt ,

we obtain the innovation analogue of the state space model, that is,

v t = Z t xt + εt , xt+1 = Lt xt + Rt ηt − Kt εt , (4.30)

with x1 = α1 − a1 , for t = 1, . . . , n. The recursion for Pt+1 can be derived more


easily than in Subsection 4.3.1 by the steps

Pt+1 = Var(xt+1 ) = E[(αt+1 − at+1 )xt+1 ]


= E(αt+1 xt+1 )
= E[(Tt αt + Rt ηt )(Lt xt + Rt ηt − Kt εt ) ]
= Tt Pt Lt + Rt Qt Rt ,

since Cov(xt , ηt ) = 0. Relations (4.30) will be used for deriving the smoothing
recursions in the next section.
We finally show that the one-step ahead forecast errors are independent of
each other using the same arguments as in Subsection 2.3.1. The joint density
of the observational vectors y1 , . . . , yn is


n
p(y1 , . . . , yn ) = p(y1 ) p(yt |Yt−1 ).
t=2

Transforming from yt to vt = yt − Zt at we have


n
p(v1 , . . . , vn ) = p(vt ),
t=1

since p(y1 ) = p(v1 ) and the Jacobian of the transformation is unity because each
vt is yt minus a linear function of y1 , . . . , yt−1 for t = 2, . . . , n. Consequently
v1 , . . . , vn are independent of each other, from which it also follows that vt , . . . , vn
are independent of Yt−1 .

4.4 State smoothing


4.4.1 Introduction
We now derive the conditional density of αt given the entire series y1 , . . . , yn
for t = 1, . . . , n. We do so by assuming normality and using Lemma 1, noting
from Lemmas 2 to 4 that the mean vectors and variance matrices we obtain are
7 Maximum likelihood
estimation of parameters

7.1 Introduction
So far we have developed methods for estimating parameters which can be placed
in the state vector of model (4.12). In virtually all applications in practical work
the models depend on additional parameters which have to be estimated from
the data; for example, in the local level model (2.3) the variances σε2 and ση2 are
unknown and need to be estimated. In classical analyses, these additional param-
eters are assumed to be fixed but unknown whereas in Bayesian analyses they
are assumed to be random variables. Because of the differences in assumptions
the treatment of the two cases is not the same. In this chapter we deal with clas-
sical analyses in which the additional parameters are fixed and are estimated by
maximum likelihood. The Bayesian treatment for these parameters is discussed
as part of a general Bayesian discussion of state space models in Chapter 13 of
Part II.
For the linear Gaussian model we shall show that the likelihood can be cal-
culated by a routine application of the Kalman filter, even when the initial state
vector is fully or partially diffuse. We also give the details of the computation
of the likelihood when the univariate treatment of multivariate observations is
adopted as suggested in Section 6.4. We go on to consider how the loglikelihood
can be maximised by means of iterative numerical procedures. An important part
in this process is played by the score vector and we show how this is calculated,
both for the case where the initial state vector has a known distribution and for
the diffuse case. A useful device for maximisation of the loglikelihood in some
cases, particularly in the early stages of maximisation, is the EM algorithm; we
give details of this for the linear Gaussian model. We go on to consider biases
in estimates due to errors in parameter estimation. The chapter ends with a
discussion of some questions of goodness-of-fit and diagnostic checks.

7.2 Likelihood evaluation


7.2.1 Loglikelihood when initial conditions are known
We first assume that the initial state vector has density N(a1 , P1 ) where a1 and
P1 are known. The likelihood is
Likelihood evaluation 171


n
L(Yn ) = p(y1 , . . . , yn ) = p(y1 ) p(yt |Yt−1 ),
t=2

where Yt = (y1 , . . . , yt ) . In practice we generally work with the loglikelihood


n
log L(Yn ) = log p(yt |Yt−1 ), (7.1)
t=1

where p(y1 |Y0 ) = p(y1 ). For model (3.1), E(yt |Yt−1 ) = Zt at . Putting vt = yt −
Zt at , Ft = Var(yt |Yt−1 ) and substituting N(Zt at , Ft ) for p(yt |Yt−1 ) in (7.1), we
obtain

1  
n
np
log L(Yn ) = − log 2π − log|Ft | + vt Ft−1 vt . (7.2)
2 2 t=1

The quantities vt and Ft are calculated routinely by the Kalman filter (4.24)
so log L(Yn ) is easily computed from the Kalman filter output. We assume that
Ft is nonsingular for t = 1, . . . , n. If this condition is not satisfied initially it is
usually possible to redefine the model so that it is satisfied. The representation
(7.2) of the loglikelihood was first given by Schweppe (1965). Harvey (1989, §3.4)
refers to it as the prediction error decomposition.

7.2.2 Diffuse loglikelihood


We now consider the case where some elements of α1 are diffuse. As in Section 5.1,
we assume that α1 = a + Aδ + R0 η0 where a is a known constant vector, δ ∼
N(0, κIq ), η0 ∼ N(0, Q0 ) and A R0 = 0, giving α1 ∼ N(a1 , P1 ) where P1 =
κP∞ + P∗ and κ → ∞. From (5.6) and (5.7),

Ft = κF∞,t + F∗,t + O(κ−1 ) with F∞,t = Zt P∞,t Zt , (7.3)

where by definition of d, P∞,t = 0 for t = 1, . . . , d. The number of diffuse elements


in α1 is q which is the dimensionality of vector δ. Thus the loglikelihood (7.2) will
contain a term − 12 q log 2πκ so log L(Yn ) will not converge as κ → ∞. Following
de Jong (1991), we therefore define the diffuse loglikelihood as
* q +
log Ld (Yn ) = lim log L(Yn ) + log κ
κ→∞ 2

and we work with log Ld (Yn ) in place of log L(Yn ) for estimation of unknown
parameters in the diffuse case. Similar definitions for the diffuse loglikelihood
function have been adopted by Harvey and Phillips (1979) and Ansley and Kohn
(1986). As in Section 5.2, and for the same reasons, we assume that F∞,t is

Common questions

Powered by AI

The state space model framework accommodates both classical and Bayesian techniques by providing a unified structure where inference results for the state parameters are consistent under both approaches. This is evident in computational methods such as the Kalman filter and smoothing recursions, which give the same form from Bayesian and classical perspectives. For instance, the Bayesian analysis considers the prior distribution and updates it with new observations, mirroring the recursive updating of estimates in the classical approach . Additionally, the computation of filtered estimates and their variances use formulas derived in a way that is valid for both probabilistic interpretations .

In the context of a local level model, the Kalman filter plays a crucial role by updating estimates of the state vector each time a new observation is received. It computes the filtered estimator and one-step ahead predictor of the state, utilizing known conditional distributions . The filter, through recursive equations, permits real-time updating of the state estimates and their variances, incorporating noise and missing observations while handling them effectively within the state space framework .

The four lemmas from elementary multivariate regression are foundational for developing the theory of the linear state space model. They ensure that the formulae for classical models assuming normality remain valid under linear minimum variance and Bayesian conditions. These results demonstrate equivalence in estimation methods across different assumptions, notably in deriving the Kalman filter and smoothing recursions. These derivations enable consistent estimation of state vectors and their conditional variance matrices, crucial for practical applications where assumptions about the distribution cannot strictly be Gaussian .

Exponential smoothing fits into state space and ARIMA models by being expressible as simple forms of these models. The historical development of exponential smoothing showed that it could be cast under the state space paradigm, helping to unify various forecasting methods. Here, state space models are flexible enough to represent the dynamics captured by exponential smoothing, linking it theoretically to ARIMA models which are well-known in time series analysis. This transformation highlights the state space framework's capability to represent a wide array of time series models, demonstrating its broad applicability .

Minimum variance linear unbiased estimators (MVLUE) are applied in state space models by seeking estimates of states that are linear functions of observations and unbiased. The advantage of using MVLUE in state space models lies in their ability to provide estimates with the smallest possible estimation error variance. This approach is critical when model assumptions like normal distribution are questionable, providing robustness in estimation by focusing on linear unbiased functions that minimize variance, thereby ensuring efficiency .

Simulation smoothing in state space models is significant as it helps generate random samples from the smoothed densities of the state and disturbance vectors given the observations. This process is crucial for simulation-based inference methods, aiding in the estimation of the state and disturbances even when data is incomplete or noisy. By providing a basis for evaluating model performance and conducting diagnostic checks, simulation smoothing enhances the robustness and applicability of state space models .

A multivariate time series can be transformed into a univariate series in the state space approach by sequentially incorporating elements of the observational vectors into the system. This method can result in computational savings as it reduces the dimensionality of the problem without losing critical information. By treating high-dimensional observation vectors one element at a time, the state space method simplifies calculations and allows efficient handling of the data .

Initialisation plays a vital role in the application of recursive algorithms within the state space approach by setting baseline conditions for further computations. It determines how the recursions are started at the initial stage, which is essential since the accuracy of subsequent state estimates relies heavily on the initial state set-up. Parameters involved in the initial state can be known distributions, treated as random variables with infinite variance, or estimated as unknown constants with methods like maximum likelihood, ensuring the robustness of the analysis from the onset .

The state space approach handles missing observations uniquely by integrating them seamlessly into the filtering and smoothing recursions. This capability allows the approach to adapt dynamically by considering future observations as if they were missing, thus facilitating straightforward forecasting. Unlike other models that might require complex data handling procedures, the state space model's recursive nature allows for simple modifications that accommodate missing data, thus preserving the integrity and continuity of analysis .

Continuous-time state space modelling handles time series data by accounting for changes continuously over time, unlike discrete-time models which rely on specific time intervals. This approach can capture the dynamics and temporal relationships in data more fluidly, offering a potentially more accurate depiction of systems in contexts where phenomena do not naturally occur at regular intervals. Moreover, it aligns better with the theoretical frameworks of certain systems where time is a fundamental and unbroken parameter .

You might also like