0% found this document useful (0 votes)
118 views20 pages

Motion Code for Time Series Analysis

Uploaded by

ryan.roby.scmcmf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views20 pages

Motion Code for Time Series Analysis

Uploaded by

ryan.roby.scmcmf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Motion Code: Robust Time Series Classification

and Forecasting via Sparse Variational


Multi-Stochastic Processes Learning

Chandrajit Bajaj1,3 and Minh Nguyen2,3


1
Department of Computer Science
arXiv:2402.14081v2 [[Link]] 24 Apr 2024

The University of Texas at Austin


2
Department of Mathematics
The University of Texas at Austin
3
Oden Institute for Computational Engineering and Sciences
The University of Texas at Austin

Abstract. Despite being extensively studied, time series classification


and forecasting on noisy data remain highly difficult. The main chal-
lenges lie in finding suitable mathematical concepts to describe time
series and effectively separating noise from the true signals. Instead of
treating time series as a static vector or a data sequence as often seen in
previous methods, we introduce a novel framework that considers each
time series, not necessarily of fixed length, as a sample realization of
a continuous-time stochastic process. Such mathematical model explic-
itly captures the data dependence across several timestamps and detects
the hidden time-dependent signals from noise. However, since the un-
derlying data is often composed of several distinct dynamics, modeling
using a single stochastic process is not sufficient. To handle such settings,
we first assign each dynamics a signature vector. We then propose the
abstract concept of the most informative timestamps to infer a sparse
approximation of the individual dynamics based on their assigned vec-
tors. The final model, referred to as Motion Code, contains parameters
that can fully capture different underlying dynamics in an integrated
manner. This allows unmixing classification and generation of specific
sub-type forecasting simultaneously. Extensive experiments on sensors
and devices noisy time series data demonstrate Motion Code’s competi-
tiveness against time series classification and forecasting benchmarks.

1 Introduction
Noisy time series analysis is a fundamental problem that attracts substantial
effort from the research community. However, unlike images, videos, text, or
tabular data, finding a suitable mathematical concept to represent and study
time series is a complicated task. For instance, consider the data consisting of
2 different groups of time series, which have unequal lengths between 80 and 95
data points and receive 2 distinct colors (see Figure 1). Each group corresponds
to the audio data of the a particular word’s pronunciation but pronounced by dif-
ferent people with distinct accents. To capture the underlying shape of these time
2 Minh, Bajaj

(a) 2 time series collections (b) Absorptivity (c) Anything

Fig. 1: (a): 2 collections of time series based on pronunciation audio data. Each
collection shows the pronunciation of the word Absorptivity or Anything; (b)
and (c): The most informative timestamps for audio data of word Absorptivity
(red) and word Anything (blue).

series, traditional methods view them as ordered vectors and leverage techniques
such as distance-based [11], feature-based [18], or shapelet-based [2]. However,
such methods are inadequate, particularly in this example where both groups of
time series are highly mixed and the shapes of many red series resemble those in
blue color (see Figure 1a). For alternative solutions, several frameworks propose
treating time series as sequential data and then utilizing recurrent neural net-
works like LSTM-FCN [12] or advanced convolutional networks on time series
like ROCKET [5]. Unfortunately, with limited data size (about 90 data points
in this case), it’s particularly challenging for these techniques to learn higher-
order correlations between several timestamps and capture the exact underlying
signals of the time series. Other options like dictionary-based [26], or interval-
based [6] methods that rely on empirical statistics gathered along the entire data
set are generally unreliable. This is because collected statistics can significantly
deviate from the truth value due to noise, and thus it limits their ability to sepa-
rate noise and underlying signals. Such separation ability is extremely crucial for
time series data partially corrupted due to broken recording devices, distorted
transmissions, or privacy policies. These challenges motivate us to model time
series collections and their underlying signals via abstract stochastic processes.

First, we propose viewing each time series as an instance of a common stochastic


process governing the dynamic natures of each series group. With this viewpoint,
it is possible that at an individual level, the time series instance contains a large
amount of noise. However, given our carefully designed method that is suitable
for fitting such underlying stochastic process, the truth dynamics can still be re-
covered from these noisy instances. For example, the scatter points in Figure 1b
and Figure 1c show the skeleton approximations learned via our model for the
underlying stochastic processes. While the original dataset is mixed, each skele-
ton approximation uncovers a hidden common feature between the time series
of the same group. In fact, every global movement including going up or down,
sharply increasing or decreasing, or staying still, is precisely observed by those
Title Suppressed Due to Excessive Length 3

approximations. Because of the generalization level of stochastic process, the


learned model not only records the core signals but also yields relevant statisti-
cal dependence between multiple data points.

The stochastic viewpoint is not yet sufficient for data composed of multiple un-
derlying dynamical models. Many research works [24,7,3,23] use different types
of stochastic processes to model time series. Nonetheless, their approaches are
limited within one single dynamical model and often focus solely on one single
series. On the other hand, our framework is capable of handling multiple time
series by introducing the variational definition of the most informative times-
tamps (see Definition 4). Building upon the stochastic modeling viewpoint, to
handle multiple dynamics at once, we assign a signature vector called motion
code to each dynamical model and take a joint learning approach to optimize all
motion codes together. Augmented by sparse learning techniques, our approach
can prevent overfitting for a particular stochastic process and provide enough
discrimination to yield distinctive features between individual dynamics.

The final model called Motion Code is therefore able to learn across many
underlying processes in a holistic and robust manner. Additionally, thanks to
stochastic process modeling, Motion Code can provide interpretable dynamics
features and handle missing and varying-length time series. To summarize, our
contributions include:

1. A learning model called Motion Code that jointly learns across different
collections of noisy time series data. Motion Code explicitly models the
underlying stochastic process and is capable of capturing and separating
noise from the core signals.
2. Motion Code allows solving time series classification and forecasting si-
multaneously without the need for separate models for each task.
3. An interpretable feature for a collection of noisy time series called the most
informative timestamps that utilizes variational inference to learn the
underlying stochastic process from the given data collection.

1.1 Related works

Time series classification. For time series classification, there are a variety of
classical techniques including: distance-based [11], interval-based [6], dictionary-
based [26], shapelet-based [2], feature-based [18], as well as ensemble models
[14,22] that combine individual classifiers. In terms of deep learning [10], there
are convolutional neural network [12,5], modified versions of residual networks
[19], and auto-encoders [9].

Time series forecasting For forecasting, researchers have developed several


approaches such as exponential smoothing [8], TBATS [4], ARIMA [20], proba-
bilistic state-space models [7], and deep learning frameworks [13,16].
4 Minh, Bajaj

Stochastic modeling of time-series. Gaussian process [25] has been fre-


quently used to model continuous-time series. To reduce the computational cost,
sparse Gaussian process [29] has been proposed to approximate the true posterior
with pseudo-training examples. Along this line of work, several different versions
of generative models together with approximate/exact inference [24,3,23] have
been developed. While these works are restricted within one single series, our
learning approach applies to generalized multi-time series collections.

The rest of the paper is organized as follows: Section 2 provides mathemat-


ical and algorithmic details for our Motion Code framework. Section 3 includes
experiments and benchmarkings for Motion Code on time series classification
and forecasting. Section 4 presents a discussion advocating the benefits of our
learning framework.

2 Motion Code: Joint learning on collection of time series


2.1 Problem statement
We first formulate time series problem via stochastic processes context.

Input: Training time series data consist of samples that belong to exactly L
L
(L ∈ N ≥ 2) underlying stochastic processes {Gk }k=1 . More specifically, for
 i,k Bk
each k ∈ 1, L, let Ck = y i=1
be the sample set consisting of Bk time series
 i,k
y i
, all of which are samples from the k th stochastic process Gk . Here each
time series y i,k = (yti,k )t∈Ti,k has the timestamps set Ti,k ⊂ R+ . Each real vari-
able yti,k ∈ R at time t ∈ Ti,k is called a data point of y i,k . We also associate the
time series y i,k with its data point vector (yti,k )t∈Ti,k ∈ R|Ti,k | . The training data
B B
include B collections of time series {Ck }k=1 corresponding to processes {Gk }k=1 .

Tasks and required outputs: The main task is to produce a model M that
L L
jointly learns the processes {Gk }k=1 from the given data {Ck }k=1 . Moreover, the
parameters from M must be learning-transferable to the following tasks:
1. Classification: At test time, given a new time series y = {yt }t∈T with
timestamps T , classify y into the closest group among L possible groups.
2. Forecasting: Suppose a time series y has the underlying stochastic process
Gk for a particular k ∈ 1, L. Given future timestamps T , predict {yt }t∈T .

2.2 Three assumptions underlying Motion Code


Motion Code framework models time series via stochastic processes and intro-
duces a novel concept of the most informative timestamps to handle multi-
time series collections. To simplify mathematical formulation and make explicit
computations feasible, we work with a particular type of stochastic processes
called nice kernelized Gaussian process (see Definition 1 and Definition 2). The
Title Suppressed Due to Excessive Length 5

modeling of this specific stochastic process for Motion Code is expressed through
the three assumptions below, which are later used to derive a practical algorithm
(Algorithm 1) for noisy datasets.

Definition 1. A stochastic process G := {g(t)}t≥0 is a kernelized Gaussian


process [25] with the mean function µ : R → R and the positive-definite kernel
function K : R × R → R iff the joint distribution of the signal vector gT =
(g(t))t∈T on any timestamps set T is Gaussian and is characterized by:

p(gT ) = p((g(t))t∈T ) = N (µT , KT T ), (1)

Here µT is the mean vector (µ(t))t∈T , and KT T is the positive-definite n × n


kernel matrix (K(t, s))t,s∈T . N (µ, Σ) denote a Gaussian distribution with mean
µ and covariance matrix Σ. The random function g is called the underlying
signal of the stochastic process G.

Notation: Throughout this paper, for ordered sets T and S of timestamps, we


use notation gT for the signal vector (g(t))t∈T ∈ R|T | , µT for the mean vector
(µ(t))t∈T ∈ R|T | , and KT S for the |T |-by-|S| kernel matrix (K(t, s))t∈T,s∈S .

Definition 2. (Nice Gaussian process) A kernelized Gaussian process G =


{g(t)}t≥0 with signal g, kernel K, and mean µ is nice if for any two disjoint
timestamps sets T and S, the corresponding mean vectors are approximately
−1
proportional: µT ≈ KT S KSS µS . In such cases, the conditional distribution of
gT given gS is Gaussian with the following mean and variance:
−1 −1
p(gT |gS , T, S) = N (KT S KSS gS , KT T − KT S KSS KST ) (2)

Assumption 1: We explicitly model each underlying stochastic process Gk by


a kernelized Gaussian process (see Definition 1).

Assumption 2: For k ∈ 1, L, assume that the data (y i,k )t∈Ti,k has the Gaussian
distribution with mean (gk )Ti,k = ((gk )t )t∈Ti,k ∈ R|Ti,k | and covariance matrix
σI|Ti,k | . Here, In is the n × n identity matrix, and gk is the underlying signal of
the stochastic process Gk . Moreover, the constant σ ∈ R+ is the noise variance
of sample data from the underlying signals.

Assumption 3: Assume that Gk is nice (see Definition 2) so that the condi-


tional distribution of signal vectors can be simplified to Equation (2).

2.3 The most informative timestamps


In this section, we develop the core mathematical concept behind Motion Code
called the most informative timestamps. The most informative timestamps
of a time series collection generalize the concept of inducing points of a single
time series that is introduced in [29]. They are a small subset of timestamps
that minimizes the mismatch between the original data and the information
6 Minh, Bajaj

reconstructed using only this subset. The visualization of the most informative
timestamps is provided in Figure 2 and is further discussed in Section 3.
To concretely define the most informative timestamps, we first introduce gen-
eralized evidence lower bound function (GELB) in Definition 3. We then
define the most informative timestamps as the maximizers of this GELB
function in Definition 4. Finally, assumptions from Section 2.2 allow simplifying
the calculation of the the most informative timestamps with a concrete
formula shown in Lemma 1.

Definition 3. Suppose we are given a stochastic process G = {g(t)}t≥0 and a


 B
collection of time series C = y i i=1 consisting of B independent time series y i
sampled from G. Each series y i = (yti )t∈Ti consists of Ni = |Ti | data points and
is called a realization of G. Let m be a fixed positive integer. We define the gen-
eralized evidence lower bound function L = L(C, G, S m , ϕ) as a function
of the data collection C, the stochastic process G, the m-elements timestamps set
S m = {s1 , · · · , sm } ⊂ R+ , and a variational distribution ϕ on Rm :

B Z
1 X p(y i |gTi )p(gS m )
L(C, G, S m , ϕ) := p(gTi |gS m )ϕ(gS m ) log dgTi dgS m (3)
B i=1 ϕ(gS m )

Again, the vectors gTi and gS m are the signal vectors (g(t))t∈Ti ∈ R|Ti | and
m
(g(t))t∈S m ∈ R|S | on timestamps Ti of y i and on S m .

Definition 4. For a fixed m ∈ N, the m-elements set (S m )∗ ⊂ R+ is said to


be the most informative timestamps with respect to a noisy time series
collection C of a stochastic process G if there exists a variational distribution ϕ∗
on Rm so that:
(S m )∗ , ϕ∗ = arg max
m
L(C, G, S m , ϕ) (4)
S ,ϕ

Also define the function Lmax such that Lmax (C, G, S m ) := maxϕ L(C, G, S m , ϕ).
Hence, (S m )∗ can be found by maximizing Lmax over all possible S m .

Our Motion Code’s training loss function depends heavily on the function Lmax .
Lemma 1 below provides a closed-form formula for Lmax to help derive the
Motion Code’s Algorithm 1.
 B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
Title Suppressed Due to Excessive Length 7

QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
 1  
y QT1 T1 0 0
Y =  ...  , QC,G =  0 . . . (5)
   
0 
yB 0 0 QTB TB

Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X


with mean µ and covariance matrix Σ. Furthermore, the optimal variational dis-
tribution ϕ∗ = arg maxϕ L(C, G, S m , ϕ). is a Gaussian distribution of the form:
B
! !
∗ −2 1 X
ϕ (gS m ) = N σ KS m S m Σ KS m Ti y i , KS m S m ΣKS m S m (7)
B i=1

σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. The proof is given in the Supplementary Materials.

2.4 Motion Code Learning

With the core concept of the most informative timestamps outlined in Sec-
tion 2.3, we can now describe Motion Code learning framework in details:

Model and parameters: Let the k th stochastic process Gk be modeled by a


nice Gaussian process with kernel function K ηk parameterized by ηk for k ∈ 1, L.
Next, we normalized all timestamps to the interval [0, 1]. We choose a fixed
m ∈ N as the number of the most informative timestamps, and a fixed latent
dimension d ∈ N. We model the most informative timestamps S m,k of the k th
stochastic process Gk (with data collection Ck ) jointly for all k ∈ 1, L by a
common map G : Rd → Rm . More concretely, choose L different d-dimensional
vectors z1 , · · · , zL ∈ Rd called motion codes, and model S m,k by:

[
S m,k := σ(G(z )) ∈ Rm , where σ is the standard sigmoid function. (8)
k

We approximate G by a linear map specified by the parameter matrix Θ so that


G(zk ) ≈ Θzk . Hence, Motion Code includes 3 types of parameters:

1. Kernel parameters η := (η1 , · · · , ηL ) for underlying Gaussian process Gk .


2. Motion codes z := (z1 , · · · , zL ) with zi ∈ Rd .
3. The joint map parameter Θ with dimension m × d.
8 Minh, Bajaj

Training loss function: The goal is to make S[ m,k approximate the true S m,k ,
max
which is the maximizer of L with an explicit formula given in Equation (6).
As a result, we want to maximize Lmax (Ck , Gk , S
[ m,k ) for all k, leading to the

following loss function:


L
X L
X
U(η, z, Θ) = − Lmax (Ck , Gk , S
[ m,k ) + λ ∥zk ∥22 (9)
k=1 k=1

Here the last term is the regularization term for the motion codes zk with hy-
perparameter λ. An explicit training algorithm is given below (see Algorithm 1).

Algorithm 1 Motion Code training algorithm

Bk
Input: L collections of time series data Ck = y i,k i=1 , where the series y i,k has


timestamps Ti,k . Additional hyperparameters include the number of most informative


timestamps m, motion codes dimension d, max iteration M , and stopping threshold ϵ
Output: Parameters η, z, Θ described above that optimize loss function U(η, z, Θ).

1: Initialize η and z to be constant vectors 1, and Θ to be the constant matrix, where


each column is the arithmetic sequence between 0.1 and 0.9.
2: repeat
3: Use the current parameter η, z to calculate the predicted most informative times-
tamps for the kth stochastic process: S[ m,k = σ(Θz ).
k
η η η
4: For k ∈ 1, L, i ∈ 1, Bk , calculate KSkm,k S m,k , KSkm,k T , KTk S m,k , and the cor-
i,k i,k
responding matrix Q’s, and QC,G defined in Lemma 1.
5: Use above results to calculate Lmax (Ck , Gk , S[
m,k ) given by Equation (6) via an

automatic differentiation framework for k ∈ 1, L.


6: Calculate the loss U(η, z, Θ) and its differential via automatic differentiation.
7: Update parameters (η, z, Θ) using Limited-memory Broyden Fletcher Goldfarb
Shanno (L-BFGS) algorithm [15].
8: until numbers of iterations exceed M or training loss decreases less than ϵ.
9: Output the final (η, z, Θ).

2.5 Classification and Forecasting with Motion Code

We use the trained parameters η, z, θ from Algorithm 1 to perform time series


forecasting and classification. We need an immediate step of finding prelimi-
nary predictions that give us the predicted mean signal pk = E[(gk )T ] ∈ R|T | .
With such pk , we can then perform forecasting and classification tasks.

Preliminary predictions: For a given k ∈ 1, L, the predicted distribution of


the signal vector (gk )T is calculated by marginalizing over the signal (gk )S m,k
Title Suppressed Due to Excessive Length 9

on the most informative timestamps S m,k of process Gk :


Z
p((gk )T ) = p((gk )T |(gk )S m,k )ϕ∗ ((gk )S m,k )d(gk )S m,k (10)

where the optimal variational distribution ϕ∗ is a Gaussian distribution has the


following mean µk and covariance matrix Ak (based on Lemma 1):
Bk
!
−2 ηk 1 X ηk
µk = σ KS m,k S m,k Σ K m,k y , Ak = KSηkm,k S m,k ΣKSηkm,k S m,k
i
Bk i=1 S Ti,k
(11)
−1 ηk σ −2 PBk ηk ηk
where Σ = Λ with Λ := KS m,k S m,k + i=1 KS m,k Ti,k KTi,k S m,k .
Bk
The convolution in Equation (10) only involves product of two Gaussians
and can be calculated explicitly as:
p((gk )T ) = N (KTηkS m,k (KSηkm,k S m,k )−1 µk , KTηkT − KTηkS m,k (KSηkm,k S m,k )−1 KSηkm,k T
+ KTηkS m,k (KSηkm,k S m,k )−1 Ak (KSηkm,k S m,k )−1 KSηkm,k T ) (12)

Forecasting: For the stochastic process Gk , we output the mean vector pk =


E[(gk )T ] ∈ R|T | obtained from Equation (12) as the whole process’s prediction.

Classification: To classify a series y with timestamps T , we first calculate


pk = E[(gk )T ] ∈ R|T | on T from Equation (12) for each k ∈ 1, L. Motion Code
outputs a prediction based on the label of the closest pk :
kpredicted = arg max∥y − pk ∥2,R|T | (13)
k

Here we use the simple Euclidean distance ∥.∥2,R|T | .

Time complexity: The matrix multiplication between matrices of size m-by-m


and size m-by-|Ti,k | or |Ti,k |-by-m is the most expensive  individual operationin
PL PBk 2
Algorithm 1. Hence, Algorithm 1 has time complexity O k=1 i=1 m |Ti,k | ×
L Bk
M = O(m2 N M ), where N =
P P
k=1 i=1 |Ti,k | is the total number of data
points, M is the max number of iterations, and m is the number of the most
informative timestamps. For time series tasks, by the same argument, the cost
for predicting a single mean vector pk defined above is O(m2 |T |). Hence, the
cost for forecasting on timestamps T is O(m2 |T |). For classification on L groups
of time series, classifying a time series with timestamps T is O(m2 |T ||L|). As m
is chosen relatively small, the above complexities are approximately linear with
respect to the number of data points of the time series input.

Kernel choice: For implementation, we use spectral kernel of the form K η (t, s) :=
PJ 2
j=1 αj exp(-0.5βj |t − s| ) for parameters η = (α1 , · · · , αJ , β1 , · · · , βJ ).
10 Minh, Bajaj

Hyperparameters: In Section 3, we choose the number of the most informative


timestamps m = 10, the latent dimension d = 2, and the number of kernel
components J = 1. Other hyperparameters include λ = 1, ϵ = 10−5 , M = 10.

3 Experiments

Table 1: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: DTW, TSF, RISE, BOSS, BOSS-E, catch22, and our Motion
Code. Values shown are classification accuracies in percentage.
Motion
Data sets DTW TSF RISE BOSS BOSS-E catch22
Code
Chinatown 54.23 61.22 65.6 47.81 41.69 55.39 66.47
ECGFiveDays 54.47 58.07 59.35 50.06 58.42 52.85 66.55
FreezerSmallTrain 52.42 54.28 53.79 50 50.95 53.58 70.25
GunPointOldVersusYoung 92.7 99.05 98.73 87.62 93.33 98.41 91.11
HouseTwenty 57.98 57.14 42.02 52.1 57.98 45.38 70.59
InsectEPGRegularTrain 100 100 83.13 99.2 91.97 95.98 100
ItalyPowerDemand 57.05 68.71 65.79 52.77 53.26 55.88 72.5
Lightning7 21.92 28.77 26.03 12.33 28.77 24.66 31.51
MoteStrain 56.47 61.1 61.5 53.83 53.51 57.19 72.68
PowerCons 78.33 92.22 85.56 65.56 77.22 80 92.78
SonyAIBORobotSurface2 63.27 67.68 69.78 48.06 61.91 64.43 75.97
Sound 50 87.5 62.5 68.75 62.5 50 87.5
SineCurves 100 100 100 62.71 100 100 100
UWaveGestureLibraryAll 78.25 83.67 79.79 12.23 74.87 47.38 80.18

3.1 Data sets

We acquire 12 sensors and devices time series data sets from publicly available
UCR data sets [1]. We prepare 2 additional data sets named Sound [21] and
SineCurves. Each data set consists of multiple collections of time series, and
each collection has an unique label. For each of 14 data sets, we add a Gaussian
noise with standard deviation σ = 0.3 ∗ A, where A is the maximum possible
absolute value of all data points. The inherent noise in the original data sets is
not adequate for robust testing since even simpler algorithms like DTW perform
very well (achieve near or over 90% accuracy) on many datasets. The noise added
thus serves as an adversarial factor to stress test the robustness of time series
algorithms. We then implement Motion Code algorithm on these 14 datasets
to extract optimal parameters for downstream tasks including classification and
forecasting. All experiments are executed on Nvidia A100 GPU.
Title Suppressed Due to Excessive Length 11

Table 2: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: Shapelet, Teaser, SVC, LSTM-FCN, Rocket, Hive-Cote 2, and
our Motion Code. Values shown are classification accuracies in percentage. Error
means running on a given data set has failed.
Shape- LSTM- Hive- Motion
Data sets Teaser SVC Rocket
let FCN Cote 2 Code
Chinatown 61.22 Error 56.27 66.47 62.97 61.52 66.47
ECGFiveDays 52.61 Error 49.71 53.54 56.79 55.75 66.55
FreezerSmallTrain 50 50.11 50 50 52.67 58.18 70.25
GunPointOldVersusYoung 74.6 89.52 52.38 52.38 90.48 98.73 91.11
HouseTwenty 57.14 53.78 45.38 57.98 58.82 59.66 70.59
InsectEPGRegularTrain 44.58 100 85.94 100 46.18 100 100
ItalyPowerDemand 62.49 63.17 49.85 61.61 70.36 72.98 72.5
Lightning7 20.55 21.92 26.03 17.81 27.4 32.88 31.51
MoteStrain 47.76 Error 50.64 56.55 68.85 56.95 72.68
PowerCons 74.44 51.11 77.78 68.33 87.22 90 92.78
SonyAIBORobotSurface2 69.36 68.84 61.7 63.27 74.71 78.49 75.97
Sound 68.75 Error 62.5 56.25 75 75 87.5
SineCurves 100 96.61 67.8 100 100 100 100
UWaveGestureLibraryAll 49.5 26.52 Error 12.67 83.45 78.5 80.18

We compare Motion Code’s performance on time series classification with


12 other algorithms: DTW [11], TSF [6], RISE [14], BOSS [26], BOSS-E
[26], catch22 [18], Shapelet [2], Teaser [27], SVC [17], LSTM-FCN [12],
Rocket [5], and Hive-Cote 2 [22]. Their implementations come from the li-
brary sktime [17]. We use classification accuracy (measured in percentage) for
performance comparison. Table 1 and Table 2 show that Motion Code outper-
forms all other algorithms on 9/14 noisy data sets, and performs second-best
for other 3/14 data sets, only behind the ensemble model Hive-Cote 2. This
result demonstrates our method’s robustness when dealing with collections of
noisy time series. Unlike other algorithms, Motion Code can handle a time
series collection with unequal sequence lengths and missing data. However, for
comparison purposes, we don’t consider such special data.

3.2 Time series forecasting

For forecasting, each of the 14 data sets is divided into two parts: 80% of the data
points go to train set, and the remaining 20% future data points are set aside
for testing. For Motion Code, we output a single prediction for all series in the
same collection. We choose 5 algorithms to serve as our baselines: Exponential
smoothing [8], ARIMA [20], State space model [7], TBATS [4], and Last
seen. Last seen is a simple algorithm that uses previous values to predict next
time steps. Their implementations are included in statsmodels library [28] and
sktime library [17]. We run 5 baseline algorithms with individual predictions
12 Minh, Bajaj

Table 3: Average root mean-square error (RMSE) table for 6 time series forecast-
ing algorithms: Exponential smoothing, ARIMA, State-space model, Last seen,
TBATS, and Motion Code on 14 data sets. Values shown are RMSE between
prediction and ground truth time series averaged over the data points.
Exp.
State Last Motion
Data sets Smooth- ARIMA TBATS
space seen Code
ing
Chinatown Error 1079 775.96 723.1 633.04 518.49
ECGFiveDays 0.34 0.43 1.58 0.19 0.17 0.27
FreezerSmallTrain 0.88 0.58 0.93 0.57 0.56 0.74
GunPointOldVersusYoung 60.38 128.44 59.83 41.51 20.94 417.94
HouseTwenty 1117 3386 730.51 497.3 560.88 648.27
InsectEPGRegularTrain 0.043 0.095 0.25 0.019 0.02 0.048
ItalyPowerDemand Error 2.02 2.37 1.24 0.96 0.67
Lightning7 1.7 2.85 1.7 1.35 1.35 1.08
MoteStrain 1.11 1.52 1.09 1.01 0.88 0.82
PowerCons 3.38 4.85 4.41 1.77 1.72 1.15
SonyAIBORobotSurface2 2.79 2.01 3.21 1.39 1.52 2.26
Sound 0.087 0.27 0.086 0.1 0.059 0.085
SineCurves 3.5 4.37 3.53 1.53 1.63 1.07
UWaveGestureLibraryAll 4.37 5 4.45 0.98 1.44 0.98

and provide comparison results in Table 3. Even without making individual pre-
dictions, Motion Code still surpasses other methods in 7/14 data sets.

4 Discussion on Motion Code’s benefits

Interpretable feature: Despite having several noisy time series that deviate
from the common mean, the points at most informative timestamps S m,k form
a skeleton approximation of the underlying stochastic process. All the impor-
tant twists and turns are constantly observed by the corresponding points at
important timestamps (See Figure 2). Those points create a feature that helps
visualize the underlying dynamics with explicit global behaviors such as increas-
ing, decreasing, staying still, unlike the original complex time series collections
with no visible common patterns among series.

Uneven length and missing data handling: For each time series, Motion
Code processes individual data points and their timestamps one by one. Hence,
time series fed into the algorithm can have different timestamps sets. We in-
clude an experiment to illustrate that even with missing data, Motion Code still
learns the underlying dynamics sufficiently well. In the Sound data, we remove
randomly from each series about 10% of its data points, and retrain the model
on the modified data. The time series in the new dataset have both different
Title Suppressed Due to Excessive Length 13

(a) Motion 1 in SineCurves (b) Motion 3 in SineCurves

(c) Humidity sensor MoteStrain (d) Temperature sensor MoteS-


train

Fig. 2: Forecasting with uncertainty for time series collections in SineCurves


and MoteStrain. Mean values on most informative timestamps are included.
Here, we train Motion Code on time horizon [0, 0.8] and predict on [0.8, 1].

lengths and out-of-sync timestamps values. The resulting model still gives the
same accuracy and accurate skeleton approximation (see Figure 1), showing Mo-
tion Code’s effectiveness for uneven length and missing data.

Learning effectiveness across multiple time series: To illustrate Motion


Code’s learning ability across several time series, we compare its performance
with an alternative approach which learns from each individual series through a
two-step feature extraction procedure on SineCurves data. First, we implement
sparse Gaussian regression [29] to extract the optimal inducing points from each
series and stack them into vector X. Then, we use singular value decomposi-
tion to perform the low-rank compression on X and obtain corresponding 2D
features. Figure 3 shows that features extracted from this process yield inter-
twined and inseparable clusters. Additionally, the expensive operations on each
individual series are prohibitive in practice. On the other hand, Motion Code
achieves 100% accuracy (see Table 1) for classification task on the same data
set, suggesting the advantage of Motion Code’s holistic approach across series
and hidden dynamics.
14 Minh, Bajaj

Fig. 3: Two-dimensional features extracted from individual time series in the data
set SineCurves via separate sparse Gaussian process models.

5 Conclusion

In this work, we employ variational inference and sparse stochastic process model
to develop an integrated framework called Motion Code. The method can per-
form time series forecasting simultaneously with classification across different
collections of time series data, while most other current methods only focus on
one task at a time. Our Motion Code model is particularly robust to noise and
produces competitive performance against other popular time series classifica-
tion and forecasting algorithms. Moreover, as we have demonstrated in Section 4,
Motion Code provides an interpretable feature that effectively captures the
core information of the underlying stochastic process. Finally, our method can
deal with varying-length time series and missing data, while many other meth-
ods fail to do so. In future work, we plan to generalize Motion Code, with
non-Gaussian priors adapted to time series from various application domains.

Author acknowledgement: This research was supported in part by a grant


from the NIH DK129979, in part from the Peter O’Donnell Foundation, the
Michael J. Fox Foundation, Jim Holland-Backcountry Foundation, and in part
from a grant from the Army Research Office accomplished under Cooperative
Agreement Number W911NF-19-2-0333.

Author Contributions: Motion Code is developed by Minh Nguyen and im-


proved by Chandrajit Bajaj and Minh Nguyen. Alphabetical order is currently
used to order author names. The implementation is done by Minh Nguyen and
is available at [Link]

References
1. Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series
classification bake off: a review and experimental evaluation of recent algorithmic
advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)
Title Suppressed Due to Excessive Length 15

2. Bostrom, A., Bagnall, A.: Binary shapelet transform for multiclass time series clas-
sification. In: Transactions on Large-Scale Data- and Knowledge-Centered Systems
XXXII, pp. 24–46. Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
3. Cao, Y., Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Efficient optimization for
sparse gaussian process regression. In: Advances in Neural Information Processing
Systems. vol. 26. Curran Associates, Inc. (2013)
4. De Livera, A.M., Hyndman, R.J., Snyder, R.D.: Forecasting time series with com-
plex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 106(496),
1513–1527 (2011)
5. Dempster, A., Petitjean, F., Webb, G.I.: ROCKET: exceptionally fast and accurate
time series classification using random convolutional kernels. Data Min. Knowl.
Discov. 34(5), 1454–1495 (2020)
6. Deng, H., Runger, G., Tuv, E., Vladimir, M.: A time series forest for classification
and feature extraction. Inf. Sci. (Ny) 239, 142–153 (2013)
7. Durbin, J.: Time Series Analysis by State Space Methods: Second Edition. Oxford
University Press (2012)
8. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving
averages. Int. J. Forecast. 20(1), 5–10 (2004)
9. Hu, Q., Zhang, R., Zhou, Y.: Transfer learning for short-term wind speed prediction
with deep neural networks. Renew. Energy 85, 83–95 (2016)
10. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Deep
learning for time series classification: a review. Data Min. Knowl. Discov. 33(4),
917–963 (2019)
11. Jeong, Y.S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic time warping for
time series classification. Pattern Recognit. 44(9), 2231–2240 (2011)
12. Karim, F., Majumdar, S., Darabi, H., Harford, S.: Multivariate LSTM-FCNs for
time series classification. Neural Netw. 116, 237–245 (2019)
13. Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Philos.
Trans. A Math. Phys. Eng. Sci. 379(2194), 20200209 (2021)
14. Lines, J., Taylor, S., Bagnall, A.: HIVE-COTE: The hierarchical vote collective of
transformation-based ensembles for time series classification. In: 2016 IEEE 16th
International Conference on Data Mining (ICDM). IEEE (2016)
15. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale opti-
mization. Math. Program. 45(1-3), 503–528 (1989)
16. Liu, Z., Zhu, Z., Gao, J., Xu, C.: Forecast methods for time series data: A survey.
IEEE Access 9, 91896–91912 (2021)
17. Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., Király, F.J.: sk-
time: A Unified Interface for Machine Learning with Time Series. arXiv e-prints
arXiv:1909.07872 (Sep 2019)
18. Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.:
catch22: CAnonical time-series CHaracteristics: Selected through highly compara-
tive time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)
19. Ma, Q., Shen, L., Chen, W., Wang, J., Wei, J., Yu, Z.: Functional echo state
network for time series classification. Inf. Sci. (Ny) 373, 1–20 (2016)
20. Malki, Z., Atlam, E.S., Ewis, A., Dagnew, G., Alzighaibi, A.R., ELmarhomy, G.,
Elhosseini, M.A., Hassanien, A.E., Gad, I.: ARIMA models for predicting the end
of COVID-19 pandemic and the risk of second rebound. Neural Comput. Appl.
33(7), 2929–2948 (2021)
21. Media, F.: Forvo: The pronunciation guide. [Link] (2022), ac-
cessed: 2022-04-01
16 Minh, Bajaj

22. Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.: HIVE-
COTE 2.0: a new meta ensemble for time series classification. Mach. Learn. 110(11-
12), 3211–3243 (2021)
23. Moss, H.B., Ober, S.W., Picheny, V.: Inducing point allocation for sparse gaussian
processes in high-throughput bayesian optimisation. In: Proceedings of The 26th
International Conference on Artificial Intelligence and Statistics. Proceedings of
Machine Learning Research, vol. 206, pp. 5213–5230. PMLR (25–27 Apr 2023)
24. Qi, Y., Abdel-Gawad, A.H., Minka, T.P.: Sparse-posterior gaussian processes for
general likelihoods. In: Proceedings of the 26th conference on uncertainty in arti-
ficial intelligence. pp. 450–457. Citeseer (2010)
25. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning. MIT
Press (2006)
26. Schäfer, P.: The BOSS is concerned with time series classification in the presence
of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015)
27. Schäfer, P., Leser, U.: TEASER: early and accurate time series classification. Data
Min. Knowl. Discov. 34(5), 1336–1362 (2020)
28. Skipper Seabold, Josef Perktold: Statsmodels: Econometric and Statistical Model-
ing with Python. In: Proceedings of the 9th Python in Science Conference. pp. 92
– 96 (2010)
29. Titsias, M.: Variational learning of inducing variables in sparse gaussian processes.
In: Proceedings of the Twelth International Conference on Artificial Intelligence
and Statistics. Proceedings of Machine Learning Research, vol. 5, pp. 567–574.
PMLR, Florida, USA (16–18 Apr 2009)
Title Suppressed Due to Excessive Length 17

A Proof of Lemma 1
 B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
 1  
y QT1 T1 0 0
Y =  ...  , QC,G =  0 . . . (5)
   
0 
yB 0 0 QTB TB
Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X


with mean µ and covariance matrix Σ. Furthermore, the optimal variational dis-
tribution ϕ∗ = arg maxϕ L(C, G, S m , ϕ). is a Gaussian distribution of the form:
B
! !
∗ −2 1 X i
ϕ (gS m ) = N σ KS m S m Σ KS m Ti y , KS m S m ΣKS m S m (7)
B i=1

σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. Define the conditional mean signal vector αi = E[gTi |gS m ]. From Equa-
−1
 α
tion (2), i = KTi S m (KS m S m ) gS m . Let A be the combined mean signal vector
α1
A :=  ... . With these notations, following the derivation from [29], individual
 

αB
terms in Equation (3) can be simplified as follows:
p(y i |gTi )p(gS m )
Z
p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
ϕ(gS m )
Z Z 
i p(gS m )
= ϕ(gS m ) p(gTi |gS m ) log p(y |gTi )dgTi + log dgS m
ϕ(gS m )
Z  
i 1 p(gS m )
= ϕ(gS m ) log pN (y |αi , σI|Ti | ) − 2 T r(KTi Ti − QTi Ti ) + log dgS m
2σ ϕ(gS m )
pN (y|αi , σI|Ti | )p(gS m )
Z
1
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
ϕ(gS )m 2σ
18 Minh, Bajaj

Using this equation for individual term, we upper-bound the function L(S, G, T m , ϕ)
(see Equation (3)) as follows:

L(S, G, T m , ϕ)
B B
pN (y i |gTi , σ 2 I)p(gS m )
Z
X 1 1 X
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
i=1
B ϕ(gS m ) 2σ B i=1
Z !1/B !
Y p(g S m)
= ϕ(gS m ) log pN (y i |αi , σ 2 I) dgS m
i
ϕ(gS m )
B
1 X
− T r(KTi Ti − QTi Ti )
2σ 2 B i=1
Z B
!1/B B
Y
i 2 1 X
≤ log pN (y |αi , σ I) p(gSm )dg
Sm − 2 T r(KTi Ti − QTi Ti )
i=1
2σ B i=1
Z B
1 X
= log pN (Y |A, Bσ 2 )p(gS m )dgS m − T r(KTi Ti − QTi Ti )
2σ 2 B i=1
B
1 X
= log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti )
2σ 2 B i=1

The only inequality for this bound is due to Jensen inequality. This upper-
bound no longer depends on the variational distribution ϕ and only depends
on the timestamps in S m . As a result, by definition of Lmax , we obtain the
Equation (6). Moreover, for this bound, the equality holds when:

B
Y
ϕ∗ (gS m ) ∝ pN (y i |αi , σ 2 I)1/B p(gS m )
i=1
B
!
σ −2 X −1 1
∝ exp (y ) KTi S m (KS m S m ) gS m − (gS m )T ×
i T
B i=1 2
 −2 XB    !
σ −1 −1 −1
(KS m S m ) KS m Ti KTi S m (KS m S m ) + (KS m S m ) × gS m
B i=1

Hence, ϕ∗ is a Gaussian distribution with mean and variance as expected:


  !
B
∗ −2 1 X
i
ϕ (gS m ) = N σ KS m S m Λ  KS m Ti y , KS m S m ΛKS m S m (14)
B j=1



Title Suppressed Due to Excessive Length 19

B Further explanations on the generalized evidence lower


bound function in Definition 3
For each i ∈ 1, B, for a single time series y i , we have the following lower-bound
estimate for log-likelihood log(y i ):
Z
log y i = log p(y i |gTi )p(gTi |gS m )p(gS m )dgTi dgS m

p(y i |gTi )p(gS m )


Z
≥ p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
ϕ(gS m )
 B
Adding up inequalities for all time series y i i=1 in the collection C and then
taking the mean, we obtain:
B
! B
1 1 Y
i 1 X
log C = log y = log(y i )
B B i=1
B i=1
B Z
1 X p(y i |gTi )p(gS m )
≥ p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
B i=1 ϕ(gS m )

The above lower-bound on the log-likelihood of the collection C coincides


with Equation (3).

C Other figures and tables

Dataset Train Test Length Description

Sound 18 18 100 Amplitude values of the audio datasets for


the pronunciations of 2 words with differ-
ent accents.
SineCurves 30 20 500 Time series generated from 3 different
functions with sine and cosine components
and the Gaussian noise of variance 0.1.

Table 4: Descriptions of 2 data sets created by authors.


20 Minh, Bajaj

(a) Weekend traffic in Chinatown (b) Weekday traffic in Chinatown

(c) October to March period (d) April to September period


ItalyPowerDemand ItalyPowerDemand

Fig. 4: Forecasting with uncertainty for time series collections in Chinatown


and ItalyPowerDemand. Mean values on most informative timestamps are
included. Here, we train Motion Code on time horizon [0, 0.8] and predict on
[0.8, 1].

(a) Warm season in PowerCons (b) Cold season in PowerCons

Fig. 5: Forecasting with uncertainty on the PowerCons data set from 5 fore-
casting algorithms: ARIMA, State-space model, Last seen, TBATS, and Motion
Code. We train the algorithms on time horizon [0, 0.8] and test on [0.8, 1]. Pow-
erCons includes two collections of time series: power consumption over the warm
and cold seasons. Exponential Smoothing is not included as it doesn’t output
the uncertainty.

You might also like