Motion Code for Time Series Analysis
Motion Code for Time Series Analysis
1 Introduction
Noisy time series analysis is a fundamental problem that attracts substantial
effort from the research community. However, unlike images, videos, text, or
tabular data, finding a suitable mathematical concept to represent and study
time series is a complicated task. For instance, consider the data consisting of
2 different groups of time series, which have unequal lengths between 80 and 95
data points and receive 2 distinct colors (see Figure 1). Each group corresponds
to the audio data of the a particular word’s pronunciation but pronounced by dif-
ferent people with distinct accents. To capture the underlying shape of these time
2 Minh, Bajaj
Fig. 1: (a): 2 collections of time series based on pronunciation audio data. Each
collection shows the pronunciation of the word Absorptivity or Anything; (b)
and (c): The most informative timestamps for audio data of word Absorptivity
(red) and word Anything (blue).
series, traditional methods view them as ordered vectors and leverage techniques
such as distance-based [11], feature-based [18], or shapelet-based [2]. However,
such methods are inadequate, particularly in this example where both groups of
time series are highly mixed and the shapes of many red series resemble those in
blue color (see Figure 1a). For alternative solutions, several frameworks propose
treating time series as sequential data and then utilizing recurrent neural net-
works like LSTM-FCN [12] or advanced convolutional networks on time series
like ROCKET [5]. Unfortunately, with limited data size (about 90 data points
in this case), it’s particularly challenging for these techniques to learn higher-
order correlations between several timestamps and capture the exact underlying
signals of the time series. Other options like dictionary-based [26], or interval-
based [6] methods that rely on empirical statistics gathered along the entire data
set are generally unreliable. This is because collected statistics can significantly
deviate from the truth value due to noise, and thus it limits their ability to sepa-
rate noise and underlying signals. Such separation ability is extremely crucial for
time series data partially corrupted due to broken recording devices, distorted
transmissions, or privacy policies. These challenges motivate us to model time
series collections and their underlying signals via abstract stochastic processes.
The stochastic viewpoint is not yet sufficient for data composed of multiple un-
derlying dynamical models. Many research works [24,7,3,23] use different types
of stochastic processes to model time series. Nonetheless, their approaches are
limited within one single dynamical model and often focus solely on one single
series. On the other hand, our framework is capable of handling multiple time
series by introducing the variational definition of the most informative times-
tamps (see Definition 4). Building upon the stochastic modeling viewpoint, to
handle multiple dynamics at once, we assign a signature vector called motion
code to each dynamical model and take a joint learning approach to optimize all
motion codes together. Augmented by sparse learning techniques, our approach
can prevent overfitting for a particular stochastic process and provide enough
discrimination to yield distinctive features between individual dynamics.
The final model called Motion Code is therefore able to learn across many
underlying processes in a holistic and robust manner. Additionally, thanks to
stochastic process modeling, Motion Code can provide interpretable dynamics
features and handle missing and varying-length time series. To summarize, our
contributions include:
1. A learning model called Motion Code that jointly learns across different
collections of noisy time series data. Motion Code explicitly models the
underlying stochastic process and is capable of capturing and separating
noise from the core signals.
2. Motion Code allows solving time series classification and forecasting si-
multaneously without the need for separate models for each task.
3. An interpretable feature for a collection of noisy time series called the most
informative timestamps that utilizes variational inference to learn the
underlying stochastic process from the given data collection.
Time series classification. For time series classification, there are a variety of
classical techniques including: distance-based [11], interval-based [6], dictionary-
based [26], shapelet-based [2], feature-based [18], as well as ensemble models
[14,22] that combine individual classifiers. In terms of deep learning [10], there
are convolutional neural network [12,5], modified versions of residual networks
[19], and auto-encoders [9].
Input: Training time series data consist of samples that belong to exactly L
L
(L ∈ N ≥ 2) underlying stochastic processes {Gk }k=1 . More specifically, for
i,k Bk
each k ∈ 1, L, let Ck = y i=1
be the sample set consisting of Bk time series
i,k
y i
, all of which are samples from the k th stochastic process Gk . Here each
time series y i,k = (yti,k )t∈Ti,k has the timestamps set Ti,k ⊂ R+ . Each real vari-
able yti,k ∈ R at time t ∈ Ti,k is called a data point of y i,k . We also associate the
time series y i,k with its data point vector (yti,k )t∈Ti,k ∈ R|Ti,k | . The training data
B B
include B collections of time series {Ck }k=1 corresponding to processes {Gk }k=1 .
Tasks and required outputs: The main task is to produce a model M that
L L
jointly learns the processes {Gk }k=1 from the given data {Ck }k=1 . Moreover, the
parameters from M must be learning-transferable to the following tasks:
1. Classification: At test time, given a new time series y = {yt }t∈T with
timestamps T , classify y into the closest group among L possible groups.
2. Forecasting: Suppose a time series y has the underlying stochastic process
Gk for a particular k ∈ 1, L. Given future timestamps T , predict {yt }t∈T .
modeling of this specific stochastic process for Motion Code is expressed through
the three assumptions below, which are later used to derive a practical algorithm
(Algorithm 1) for noisy datasets.
Assumption 2: For k ∈ 1, L, assume that the data (y i,k )t∈Ti,k has the Gaussian
distribution with mean (gk )Ti,k = ((gk )t )t∈Ti,k ∈ R|Ti,k | and covariance matrix
σI|Ti,k | . Here, In is the n × n identity matrix, and gk is the underlying signal of
the stochastic process Gk . Moreover, the constant σ ∈ R+ is the noise variance
of sample data from the underlying signals.
reconstructed using only this subset. The visualization of the most informative
timestamps is provided in Figure 2 and is further discussed in Section 3.
To concretely define the most informative timestamps, we first introduce gen-
eralized evidence lower bound function (GELB) in Definition 3. We then
define the most informative timestamps as the maximizers of this GELB
function in Definition 4. Finally, assumptions from Section 2.2 allow simplifying
the calculation of the the most informative timestamps with a concrete
formula shown in Lemma 1.
B Z
1 X p(y i |gTi )p(gS m )
L(C, G, S m , ϕ) := p(gTi |gS m )ϕ(gS m ) log dgTi dgS m (3)
B i=1 ϕ(gS m )
Again, the vectors gTi and gS m are the signal vectors (g(t))t∈Ti ∈ R|Ti | and
m
(g(t))t∈S m ∈ R|S | on timestamps Ti of y i and on S m .
Also define the function Lmax such that Lmax (C, G, S m ) := maxϕ L(C, G, S m , ϕ).
Hence, (S m )∗ can be found by maximizing Lmax over all possible S m .
Our Motion Code’s training loss function depends heavily on the function Lmax .
Lemma 1 below provides a closed-form formula for Lmax to help derive the
Motion Code’s Algorithm 1.
B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
Title Suppressed Due to Excessive Length 7
QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
1
y QT1 T1 0 0
Y = ... , QC,G = 0 . . . (5)
0
yB 0 0 QTB TB
Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1
σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. The proof is given in the Supplementary Materials.
With the core concept of the most informative timestamps outlined in Sec-
tion 2.3, we can now describe Motion Code learning framework in details:
[
S m,k := σ(G(z )) ∈ Rm , where σ is the standard sigmoid function. (8)
k
Training loss function: The goal is to make S[ m,k approximate the true S m,k ,
max
which is the maximizer of L with an explicit formula given in Equation (6).
As a result, we want to maximize Lmax (Ck , Gk , S
[ m,k ) for all k, leading to the
Here the last term is the regularization term for the motion codes zk with hy-
perparameter λ. An explicit training algorithm is given below (see Algorithm 1).
Bk
Input: L collections of time series data Ck = y i,k i=1 , where the series y i,k has
Kernel choice: For implementation, we use spectral kernel of the form K η (t, s) :=
PJ 2
j=1 αj exp(-0.5βj |t − s| ) for parameters η = (α1 , · · · , αJ , β1 , · · · , βJ ).
10 Minh, Bajaj
3 Experiments
Table 1: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: DTW, TSF, RISE, BOSS, BOSS-E, catch22, and our Motion
Code. Values shown are classification accuracies in percentage.
Motion
Data sets DTW TSF RISE BOSS BOSS-E catch22
Code
Chinatown 54.23 61.22 65.6 47.81 41.69 55.39 66.47
ECGFiveDays 54.47 58.07 59.35 50.06 58.42 52.85 66.55
FreezerSmallTrain 52.42 54.28 53.79 50 50.95 53.58 70.25
GunPointOldVersusYoung 92.7 99.05 98.73 87.62 93.33 98.41 91.11
HouseTwenty 57.98 57.14 42.02 52.1 57.98 45.38 70.59
InsectEPGRegularTrain 100 100 83.13 99.2 91.97 95.98 100
ItalyPowerDemand 57.05 68.71 65.79 52.77 53.26 55.88 72.5
Lightning7 21.92 28.77 26.03 12.33 28.77 24.66 31.51
MoteStrain 56.47 61.1 61.5 53.83 53.51 57.19 72.68
PowerCons 78.33 92.22 85.56 65.56 77.22 80 92.78
SonyAIBORobotSurface2 63.27 67.68 69.78 48.06 61.91 64.43 75.97
Sound 50 87.5 62.5 68.75 62.5 50 87.5
SineCurves 100 100 100 62.71 100 100 100
UWaveGestureLibraryAll 78.25 83.67 79.79 12.23 74.87 47.38 80.18
We acquire 12 sensors and devices time series data sets from publicly available
UCR data sets [1]. We prepare 2 additional data sets named Sound [21] and
SineCurves. Each data set consists of multiple collections of time series, and
each collection has an unique label. For each of 14 data sets, we add a Gaussian
noise with standard deviation σ = 0.3 ∗ A, where A is the maximum possible
absolute value of all data points. The inherent noise in the original data sets is
not adequate for robust testing since even simpler algorithms like DTW perform
very well (achieve near or over 90% accuracy) on many datasets. The noise added
thus serves as an adversarial factor to stress test the robustness of time series
algorithms. We then implement Motion Code algorithm on these 14 datasets
to extract optimal parameters for downstream tasks including classification and
forecasting. All experiments are executed on Nvidia A100 GPU.
Title Suppressed Due to Excessive Length 11
Table 2: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: Shapelet, Teaser, SVC, LSTM-FCN, Rocket, Hive-Cote 2, and
our Motion Code. Values shown are classification accuracies in percentage. Error
means running on a given data set has failed.
Shape- LSTM- Hive- Motion
Data sets Teaser SVC Rocket
let FCN Cote 2 Code
Chinatown 61.22 Error 56.27 66.47 62.97 61.52 66.47
ECGFiveDays 52.61 Error 49.71 53.54 56.79 55.75 66.55
FreezerSmallTrain 50 50.11 50 50 52.67 58.18 70.25
GunPointOldVersusYoung 74.6 89.52 52.38 52.38 90.48 98.73 91.11
HouseTwenty 57.14 53.78 45.38 57.98 58.82 59.66 70.59
InsectEPGRegularTrain 44.58 100 85.94 100 46.18 100 100
ItalyPowerDemand 62.49 63.17 49.85 61.61 70.36 72.98 72.5
Lightning7 20.55 21.92 26.03 17.81 27.4 32.88 31.51
MoteStrain 47.76 Error 50.64 56.55 68.85 56.95 72.68
PowerCons 74.44 51.11 77.78 68.33 87.22 90 92.78
SonyAIBORobotSurface2 69.36 68.84 61.7 63.27 74.71 78.49 75.97
Sound 68.75 Error 62.5 56.25 75 75 87.5
SineCurves 100 96.61 67.8 100 100 100 100
UWaveGestureLibraryAll 49.5 26.52 Error 12.67 83.45 78.5 80.18
For forecasting, each of the 14 data sets is divided into two parts: 80% of the data
points go to train set, and the remaining 20% future data points are set aside
for testing. For Motion Code, we output a single prediction for all series in the
same collection. We choose 5 algorithms to serve as our baselines: Exponential
smoothing [8], ARIMA [20], State space model [7], TBATS [4], and Last
seen. Last seen is a simple algorithm that uses previous values to predict next
time steps. Their implementations are included in statsmodels library [28] and
sktime library [17]. We run 5 baseline algorithms with individual predictions
12 Minh, Bajaj
Table 3: Average root mean-square error (RMSE) table for 6 time series forecast-
ing algorithms: Exponential smoothing, ARIMA, State-space model, Last seen,
TBATS, and Motion Code on 14 data sets. Values shown are RMSE between
prediction and ground truth time series averaged over the data points.
Exp.
State Last Motion
Data sets Smooth- ARIMA TBATS
space seen Code
ing
Chinatown Error 1079 775.96 723.1 633.04 518.49
ECGFiveDays 0.34 0.43 1.58 0.19 0.17 0.27
FreezerSmallTrain 0.88 0.58 0.93 0.57 0.56 0.74
GunPointOldVersusYoung 60.38 128.44 59.83 41.51 20.94 417.94
HouseTwenty 1117 3386 730.51 497.3 560.88 648.27
InsectEPGRegularTrain 0.043 0.095 0.25 0.019 0.02 0.048
ItalyPowerDemand Error 2.02 2.37 1.24 0.96 0.67
Lightning7 1.7 2.85 1.7 1.35 1.35 1.08
MoteStrain 1.11 1.52 1.09 1.01 0.88 0.82
PowerCons 3.38 4.85 4.41 1.77 1.72 1.15
SonyAIBORobotSurface2 2.79 2.01 3.21 1.39 1.52 2.26
Sound 0.087 0.27 0.086 0.1 0.059 0.085
SineCurves 3.5 4.37 3.53 1.53 1.63 1.07
UWaveGestureLibraryAll 4.37 5 4.45 0.98 1.44 0.98
and provide comparison results in Table 3. Even without making individual pre-
dictions, Motion Code still surpasses other methods in 7/14 data sets.
Interpretable feature: Despite having several noisy time series that deviate
from the common mean, the points at most informative timestamps S m,k form
a skeleton approximation of the underlying stochastic process. All the impor-
tant twists and turns are constantly observed by the corresponding points at
important timestamps (See Figure 2). Those points create a feature that helps
visualize the underlying dynamics with explicit global behaviors such as increas-
ing, decreasing, staying still, unlike the original complex time series collections
with no visible common patterns among series.
Uneven length and missing data handling: For each time series, Motion
Code processes individual data points and their timestamps one by one. Hence,
time series fed into the algorithm can have different timestamps sets. We in-
clude an experiment to illustrate that even with missing data, Motion Code still
learns the underlying dynamics sufficiently well. In the Sound data, we remove
randomly from each series about 10% of its data points, and retrain the model
on the modified data. The time series in the new dataset have both different
Title Suppressed Due to Excessive Length 13
lengths and out-of-sync timestamps values. The resulting model still gives the
same accuracy and accurate skeleton approximation (see Figure 1), showing Mo-
tion Code’s effectiveness for uneven length and missing data.
Fig. 3: Two-dimensional features extracted from individual time series in the data
set SineCurves via separate sparse Gaussian process models.
5 Conclusion
In this work, we employ variational inference and sparse stochastic process model
to develop an integrated framework called Motion Code. The method can per-
form time series forecasting simultaneously with classification across different
collections of time series data, while most other current methods only focus on
one task at a time. Our Motion Code model is particularly robust to noise and
produces competitive performance against other popular time series classifica-
tion and forecasting algorithms. Moreover, as we have demonstrated in Section 4,
Motion Code provides an interpretable feature that effectively captures the
core information of the underlying stochastic process. Finally, our method can
deal with varying-length time series and missing data, while many other meth-
ods fail to do so. In future work, we plan to generalize Motion Code, with
non-Gaussian priors adapted to time series from various application domains.
References
1. Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series
classification bake off: a review and experimental evaluation of recent algorithmic
advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)
Title Suppressed Due to Excessive Length 15
2. Bostrom, A., Bagnall, A.: Binary shapelet transform for multiclass time series clas-
sification. In: Transactions on Large-Scale Data- and Knowledge-Centered Systems
XXXII, pp. 24–46. Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
3. Cao, Y., Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Efficient optimization for
sparse gaussian process regression. In: Advances in Neural Information Processing
Systems. vol. 26. Curran Associates, Inc. (2013)
4. De Livera, A.M., Hyndman, R.J., Snyder, R.D.: Forecasting time series with com-
plex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 106(496),
1513–1527 (2011)
5. Dempster, A., Petitjean, F., Webb, G.I.: ROCKET: exceptionally fast and accurate
time series classification using random convolutional kernels. Data Min. Knowl.
Discov. 34(5), 1454–1495 (2020)
6. Deng, H., Runger, G., Tuv, E., Vladimir, M.: A time series forest for classification
and feature extraction. Inf. Sci. (Ny) 239, 142–153 (2013)
7. Durbin, J.: Time Series Analysis by State Space Methods: Second Edition. Oxford
University Press (2012)
8. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving
averages. Int. J. Forecast. 20(1), 5–10 (2004)
9. Hu, Q., Zhang, R., Zhou, Y.: Transfer learning for short-term wind speed prediction
with deep neural networks. Renew. Energy 85, 83–95 (2016)
10. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Deep
learning for time series classification: a review. Data Min. Knowl. Discov. 33(4),
917–963 (2019)
11. Jeong, Y.S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic time warping for
time series classification. Pattern Recognit. 44(9), 2231–2240 (2011)
12. Karim, F., Majumdar, S., Darabi, H., Harford, S.: Multivariate LSTM-FCNs for
time series classification. Neural Netw. 116, 237–245 (2019)
13. Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Philos.
Trans. A Math. Phys. Eng. Sci. 379(2194), 20200209 (2021)
14. Lines, J., Taylor, S., Bagnall, A.: HIVE-COTE: The hierarchical vote collective of
transformation-based ensembles for time series classification. In: 2016 IEEE 16th
International Conference on Data Mining (ICDM). IEEE (2016)
15. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale opti-
mization. Math. Program. 45(1-3), 503–528 (1989)
16. Liu, Z., Zhu, Z., Gao, J., Xu, C.: Forecast methods for time series data: A survey.
IEEE Access 9, 91896–91912 (2021)
17. Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., Király, F.J.: sk-
time: A Unified Interface for Machine Learning with Time Series. arXiv e-prints
arXiv:1909.07872 (Sep 2019)
18. Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.:
catch22: CAnonical time-series CHaracteristics: Selected through highly compara-
tive time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)
19. Ma, Q., Shen, L., Chen, W., Wang, J., Wei, J., Yu, Z.: Functional echo state
network for time series classification. Inf. Sci. (Ny) 373, 1–20 (2016)
20. Malki, Z., Atlam, E.S., Ewis, A., Dagnew, G., Alzighaibi, A.R., ELmarhomy, G.,
Elhosseini, M.A., Hassanien, A.E., Gad, I.: ARIMA models for predicting the end
of COVID-19 pandemic and the risk of second rebound. Neural Comput. Appl.
33(7), 2929–2948 (2021)
21. Media, F.: Forvo: The pronunciation guide. [Link] (2022), ac-
cessed: 2022-04-01
16 Minh, Bajaj
22. Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.: HIVE-
COTE 2.0: a new meta ensemble for time series classification. Mach. Learn. 110(11-
12), 3211–3243 (2021)
23. Moss, H.B., Ober, S.W., Picheny, V.: Inducing point allocation for sparse gaussian
processes in high-throughput bayesian optimisation. In: Proceedings of The 26th
International Conference on Artificial Intelligence and Statistics. Proceedings of
Machine Learning Research, vol. 206, pp. 5213–5230. PMLR (25–27 Apr 2023)
24. Qi, Y., Abdel-Gawad, A.H., Minka, T.P.: Sparse-posterior gaussian processes for
general likelihoods. In: Proceedings of the 26th conference on uncertainty in arti-
ficial intelligence. pp. 450–457. Citeseer (2010)
25. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning. MIT
Press (2006)
26. Schäfer, P.: The BOSS is concerned with time series classification in the presence
of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015)
27. Schäfer, P., Leser, U.: TEASER: early and accurate time series classification. Data
Min. Knowl. Discov. 34(5), 1336–1362 (2020)
28. Skipper Seabold, Josef Perktold: Statsmodels: Econometric and Statistical Model-
ing with Python. In: Proceedings of the 9th Python in Science Conference. pp. 92
– 96 (2010)
29. Titsias, M.: Variational learning of inducing variables in sparse gaussian processes.
In: Proceedings of the Twelth International Conference on Artificial Intelligence
and Statistics. Proceedings of Machine Learning Research, vol. 5, pp. 567–574.
PMLR, Florida, USA (16–18 Apr 2009)
Title Suppressed Due to Excessive Length 17
A Proof of Lemma 1
B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
1
y QT1 T1 0 0
Y = ... , QC,G = 0 . . . (5)
0
yB 0 0 QTB TB
Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1
σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. Define the conditional mean signal vector αi = E[gTi |gS m ]. From Equa-
−1
α
tion (2), i = KTi S m (KS m S m ) gS m . Let A be the combined mean signal vector
α1
A := ... . With these notations, following the derivation from [29], individual
αB
terms in Equation (3) can be simplified as follows:
p(y i |gTi )p(gS m )
Z
p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
ϕ(gS m )
Z Z
i p(gS m )
= ϕ(gS m ) p(gTi |gS m ) log p(y |gTi )dgTi + log dgS m
ϕ(gS m )
Z
i 1 p(gS m )
= ϕ(gS m ) log pN (y |αi , σI|Ti | ) − 2 T r(KTi Ti − QTi Ti ) + log dgS m
2σ ϕ(gS m )
pN (y|αi , σI|Ti | )p(gS m )
Z
1
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
ϕ(gS )m 2σ
18 Minh, Bajaj
Using this equation for individual term, we upper-bound the function L(S, G, T m , ϕ)
(see Equation (3)) as follows:
L(S, G, T m , ϕ)
B B
pN (y i |gTi , σ 2 I)p(gS m )
Z
X 1 1 X
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
i=1
B ϕ(gS m ) 2σ B i=1
Z !1/B !
Y p(g S m)
= ϕ(gS m ) log pN (y i |αi , σ 2 I) dgS m
i
ϕ(gS m )
B
1 X
− T r(KTi Ti − QTi Ti )
2σ 2 B i=1
Z B
!1/B B
Y
i 2 1 X
≤ log pN (y |αi , σ I) p(gSm )dg
Sm − 2 T r(KTi Ti − QTi Ti )
i=1
2σ B i=1
Z B
1 X
= log pN (Y |A, Bσ 2 )p(gS m )dgS m − T r(KTi Ti − QTi Ti )
2σ 2 B i=1
B
1 X
= log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti )
2σ 2 B i=1
The only inequality for this bound is due to Jensen inequality. This upper-
bound no longer depends on the variational distribution ϕ and only depends
on the timestamps in S m . As a result, by definition of Lmax , we obtain the
Equation (6). Moreover, for this bound, the equality holds when:
B
Y
ϕ∗ (gS m ) ∝ pN (y i |αi , σ 2 I)1/B p(gS m )
i=1
B
!
σ −2 X −1 1
∝ exp (y ) KTi S m (KS m S m ) gS m − (gS m )T ×
i T
B i=1 2
−2 XB !
σ −1 −1 −1
(KS m S m ) KS m Ti KTi S m (KS m S m ) + (KS m S m ) × gS m
B i=1
⊔
⊓
Title Suppressed Due to Excessive Length 19
Fig. 5: Forecasting with uncertainty on the PowerCons data set from 5 fore-
casting algorithms: ARIMA, State-space model, Last seen, TBATS, and Motion
Code. We train the algorithms on time horizon [0, 0.8] and test on [0.8, 1]. Pow-
erCons includes two collections of time series: power consumption over the warm
and cold seasons. Exponential Smoothing is not included as it doesn’t output
the uncertainty.