0% found this document useful (0 votes)

118 views20 pages

Motion Code for Time Series Analysis

Uploaded by

ryan.roby.scmcmf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views20 pages

Motion Code for Time Series Analysis

Uploaded by

ryan.roby.scmcmf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Motion Code: Robust Time Series Classification

and Forecasting via Sparse Variational

Multi-Stochastic Processes Learning

Chandrajit Bajaj1,3 and Minh Nguyen2,3

1
Department of Computer Science
arXiv:2402.14081v2 [[Link]] 24 Apr 2024

The University of Texas at Austin

2
Department of Mathematics
The University of Texas at Austin
3
Oden Institute for Computational Engineering and Sciences
The University of Texas at Austin

Abstract. Despite being extensively studied, time series classification

and forecasting on noisy data remain highly difficult. The main chal-
lenges lie in finding suitable mathematical concepts to describe time
series and effectively separating noise from the true signals. Instead of
treating time series as a static vector or a data sequence as often seen in
previous methods, we introduce a novel framework that considers each
time series, not necessarily of fixed length, as a sample realization of
a continuous-time stochastic process. Such mathematical model explic-
itly captures the data dependence across several timestamps and detects
the hidden time-dependent signals from noise. However, since the un-
derlying data is often composed of several distinct dynamics, modeling
using a single stochastic process is not sufficient. To handle such settings,
we first assign each dynamics a signature vector. We then propose the
abstract concept of the most informative timestamps to infer a sparse
approximation of the individual dynamics based on their assigned vec-
tors. The final model, referred to as Motion Code, contains parameters
that can fully capture different underlying dynamics in an integrated
manner. This allows unmixing classification and generation of specific
sub-type forecasting simultaneously. Extensive experiments on sensors
and devices noisy time series data demonstrate Motion Code’s competi-
tiveness against time series classification and forecasting benchmarks.

1 Introduction
Noisy time series analysis is a fundamental problem that attracts substantial
effort from the research community. However, unlike images, videos, text, or
tabular data, finding a suitable mathematical concept to represent and study
time series is a complicated task. For instance, consider the data consisting of
2 different groups of time series, which have unequal lengths between 80 and 95
data points and receive 2 distinct colors (see Figure 1). Each group corresponds
to the audio data of the a particular word’s pronunciation but pronounced by dif-
ferent people with distinct accents. To capture the underlying shape of these time
2 Minh, Bajaj

(a) 2 time series collections (b) Absorptivity (c) Anything

Fig. 1: (a): 2 collections of time series based on pronunciation audio data. Each
collection shows the pronunciation of the word Absorptivity or Anything; (b)
and (c): The most informative timestamps for audio data of word Absorptivity
(red) and word Anything (blue).

series, traditional methods view them as ordered vectors and leverage techniques
such as distance-based [11], feature-based [18], or shapelet-based [2]. However,
such methods are inadequate, particularly in this example where both groups of
time series are highly mixed and the shapes of many red series resemble those in
blue color (see Figure 1a). For alternative solutions, several frameworks propose
treating time series as sequential data and then utilizing recurrent neural net-
works like LSTM-FCN [12] or advanced convolutional networks on time series
like ROCKET [5]. Unfortunately, with limited data size (about 90 data points
in this case), it’s particularly challenging for these techniques to learn higher-
order correlations between several timestamps and capture the exact underlying
signals of the time series. Other options like dictionary-based [26], or interval-
based [6] methods that rely on empirical statistics gathered along the entire data
set are generally unreliable. This is because collected statistics can significantly
deviate from the truth value due to noise, and thus it limits their ability to sepa-
rate noise and underlying signals. Such separation ability is extremely crucial for
time series data partially corrupted due to broken recording devices, distorted
transmissions, or privacy policies. These challenges motivate us to model time
series collections and their underlying signals via abstract stochastic processes.

First, we propose viewing each time series as an instance of a common stochastic

process governing the dynamic natures of each series group. With this viewpoint,
it is possible that at an individual level, the time series instance contains a large
amount of noise. However, given our carefully designed method that is suitable
for fitting such underlying stochastic process, the truth dynamics can still be re-
covered from these noisy instances. For example, the scatter points in Figure 1b
and Figure 1c show the skeleton approximations learned via our model for the
underlying stochastic processes. While the original dataset is mixed, each skele-
ton approximation uncovers a hidden common feature between the time series
of the same group. In fact, every global movement including going up or down,
sharply increasing or decreasing, or staying still, is precisely observed by those
Title Suppressed Due to Excessive Length 3

approximations. Because of the generalization level of stochastic process, the

learned model not only records the core signals but also yields relevant statisti-
cal dependence between multiple data points.

The stochastic viewpoint is not yet sufficient for data composed of multiple un-
derlying dynamical models. Many research works [24,7,3,23] use different types
of stochastic processes to model time series. Nonetheless, their approaches are
limited within one single dynamical model and often focus solely on one single
series. On the other hand, our framework is capable of handling multiple time
series by introducing the variational definition of the most informative times-
tamps (see Definition 4). Building upon the stochastic modeling viewpoint, to
handle multiple dynamics at once, we assign a signature vector called motion
code to each dynamical model and take a joint learning approach to optimize all
motion codes together. Augmented by sparse learning techniques, our approach
can prevent overfitting for a particular stochastic process and provide enough
discrimination to yield distinctive features between individual dynamics.

The final model called Motion Code is therefore able to learn across many
underlying processes in a holistic and robust manner. Additionally, thanks to
stochastic process modeling, Motion Code can provide interpretable dynamics
features and handle missing and varying-length time series. To summarize, our
contributions include:

1. A learning model called Motion Code that jointly learns across different
collections of noisy time series data. Motion Code explicitly models the
underlying stochastic process and is capable of capturing and separating
noise from the core signals.
2. Motion Code allows solving time series classification and forecasting si-
multaneously without the need for separate models for each task.
3. An interpretable feature for a collection of noisy time series called the most
informative timestamps that utilizes variational inference to learn the
underlying stochastic process from the given data collection.

1.1 Related works

Time series classification. For time series classification, there are a variety of
classical techniques including: distance-based [11], interval-based [6], dictionary-
based [26], shapelet-based [2], feature-based [18], as well as ensemble models
[14,22] that combine individual classifiers. In terms of deep learning [10], there
are convolutional neural network [12,5], modified versions of residual networks
[19], and auto-encoders [9].

Time series forecasting For forecasting, researchers have developed several

approaches such as exponential smoothing [8], TBATS [4], ARIMA [20], proba-
bilistic state-space models [7], and deep learning frameworks [13,16].
4 Minh, Bajaj

Stochastic modeling of time-series. Gaussian process [25] has been fre-

quently used to model continuous-time series. To reduce the computational cost,
sparse Gaussian process [29] has been proposed to approximate the true posterior
with pseudo-training examples. Along this line of work, several different versions
of generative models together with approximate/exact inference [24,3,23] have
been developed. While these works are restricted within one single series, our
learning approach applies to generalized multi-time series collections.

The rest of the paper is organized as follows: Section 2 provides mathemat-

ical and algorithmic details for our Motion Code framework. Section 3 includes
experiments and benchmarkings for Motion Code on time series classification
and forecasting. Section 4 presents a discussion advocating the benefits of our
learning framework.

2 Motion Code: Joint learning on collection of time series

2.1 Problem statement
We first formulate time series problem via stochastic processes context.

Input: Training time series data consist of samples that belong to exactly L
L
(L ∈ N ≥ 2) underlying stochastic processes {Gk }k=1 . More specifically, for
i,k Bk
each k ∈ 1, L, let Ck = y i=1
be the sample set consisting of Bk time series
i,k
y i
, all of which are samples from the k th stochastic process Gk . Here each
time series y i,k = (yti,k )t∈Ti,k has the timestamps set Ti,k ⊂ R+ . Each real vari-
able yti,k ∈ R at time t ∈ Ti,k is called a data point of y i,k . We also associate the
time series y i,k with its data point vector (yti,k )t∈Ti,k ∈ R|Ti,k | . The training data
B B
include B collections of time series {Ck }k=1 corresponding to processes {Gk }k=1 .

Tasks and required outputs: The main task is to produce a model M that
L L
jointly learns the processes {Gk }k=1 from the given data {Ck }k=1 . Moreover, the
parameters from M must be learning-transferable to the following tasks:
1. Classification: At test time, given a new time series y = {yt }t∈T with
timestamps T , classify y into the closest group among L possible groups.
2. Forecasting: Suppose a time series y has the underlying stochastic process
Gk for a particular k ∈ 1, L. Given future timestamps T , predict {yt }t∈T .

2.2 Three assumptions underlying Motion Code

Motion Code framework models time series via stochastic processes and intro-
duces a novel concept of the most informative timestamps to handle multi-
time series collections. To simplify mathematical formulation and make explicit
computations feasible, we work with a particular type of stochastic processes
called nice kernelized Gaussian process (see Definition 1 and Definition 2). The
Title Suppressed Due to Excessive Length 5

modeling of this specific stochastic process for Motion Code is expressed through
the three assumptions below, which are later used to derive a practical algorithm
(Algorithm 1) for noisy datasets.

Definition 1. A stochastic process G := {g(t)}t≥0 is a kernelized Gaussian

process [25] with the mean function µ : R → R and the positive-definite kernel
function K : R × R → R iff the joint distribution of the signal vector gT =
(g(t))t∈T on any timestamps set T is Gaussian and is characterized by:

p(gT ) = p((g(t))t∈T ) = N (µT , KT T ), (1)

Here µT is the mean vector (µ(t))t∈T , and KT T is the positive-definite n × n

kernel matrix (K(t, s))t,s∈T . N (µ, Σ) denote a Gaussian distribution with mean
µ and covariance matrix Σ. The random function g is called the underlying
signal of the stochastic process G.

Notation: Throughout this paper, for ordered sets T and S of timestamps, we

use notation gT for the signal vector (g(t))t∈T ∈ R|T | , µT for the mean vector
(µ(t))t∈T ∈ R|T | , and KT S for the |T |-by-|S| kernel matrix (K(t, s))t∈T,s∈S .

Definition 2. (Nice Gaussian process) A kernelized Gaussian process G =

{g(t)}t≥0 with signal g, kernel K, and mean µ is nice if for any two disjoint
timestamps sets T and S, the corresponding mean vectors are approximately
−1
proportional: µT ≈ KT S KSS µS . In such cases, the conditional distribution of
gT given gS is Gaussian with the following mean and variance:
−1 −1
p(gT |gS , T, S) = N (KT S KSS gS , KT T − KT S KSS KST ) (2)

Assumption 1: We explicitly model each underlying stochastic process Gk by

a kernelized Gaussian process (see Definition 1).

Assumption 2: For k ∈ 1, L, assume that the data (y i,k )t∈Ti,k has the Gaussian
distribution with mean (gk )Ti,k = ((gk )t )t∈Ti,k ∈ R|Ti,k | and covariance matrix
σI|Ti,k | . Here, In is the n × n identity matrix, and gk is the underlying signal of
the stochastic process Gk . Moreover, the constant σ ∈ R+ is the noise variance
of sample data from the underlying signals.

Assumption 3: Assume that Gk is nice (see Definition 2) so that the condi-

tional distribution of signal vectors can be simplified to Equation (2).

2.3 The most informative timestamps

In this section, we develop the core mathematical concept behind Motion Code
called the most informative timestamps. The most informative timestamps
of a time series collection generalize the concept of inducing points of a single
time series that is introduced in [29]. They are a small subset of timestamps
that minimizes the mismatch between the original data and the information
6 Minh, Bajaj

reconstructed using only this subset. The visualization of the most informative
timestamps is provided in Figure 2 and is further discussed in Section 3.
To concretely define the most informative timestamps, we first introduce gen-
eralized evidence lower bound function (GELB) in Definition 3. We then
define the most informative timestamps as the maximizers of this GELB
function in Definition 4. Finally, assumptions from Section 2.2 allow simplifying
the calculation of the the most informative timestamps with a concrete
formula shown in Lemma 1.

Definition 3. Suppose we are given a stochastic process G = {g(t)}t≥0 and a

B
collection of time series C = y i i=1 consisting of B independent time series y i
sampled from G. Each series y i = (yti )t∈Ti consists of Ni = |Ti | data points and
is called a realization of G. Let m be a fixed positive integer. We define the gen-
eralized evidence lower bound function L = L(C, G, S m , ϕ) as a function
of the data collection C, the stochastic process G, the m-elements timestamps set
S m = {s1 , · · · , sm } ⊂ R+ , and a variational distribution ϕ on Rm :

B Z
1 X p(y i |gTi )p(gS m )
L(C, G, S m , ϕ) := p(gTi |gS m )ϕ(gS m ) log dgTi dgS m (3)
B i=1 ϕ(gS m )

Again, the vectors gTi and gS m are the signal vectors (g(t))t∈Ti ∈ R|Ti | and
m
(g(t))t∈S m ∈ R|S | on timestamps Ti of y i and on S m .

Definition 4. For a fixed m ∈ N, the m-elements set (S m )∗ ⊂ R+ is said to

be the most informative timestamps with respect to a noisy time series
collection C of a stochastic process G if there exists a variational distribution ϕ∗
on Rm so that:
(S m )∗ , ϕ∗ = arg max
m
L(C, G, S m , ϕ) (4)
S ,ϕ

Also define the function Lmax such that Lmax (C, G, S m ) := maxϕ L(C, G, S m , ϕ).
Hence, (S m )∗ can be found by maximizing Lmax over all possible S m .

Our Motion Code’s training loss function depends heavily on the function Lmax .
Lemma 1 below provides a closed-form formula for Lmax to help derive the
Motion Code’s Algorithm 1.
B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
Title Suppressed Due to Excessive Length 7

QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
 1  
y QT1 T1 0 0
Y =  ...  , QC,G =  0 . . . (5)
   
0 
yB 0 0 QTB TB

Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X

with mean µ and covariance matrix Σ. Furthermore, the optimal variational dis-
tribution ϕ∗ = arg maxϕ L(C, G, S m , ϕ). is a Gaussian distribution of the form:
B
! !
∗ −2 1 X
ϕ (gS m ) = N σ KS m S m Σ KS m Ti y i , KS m S m ΣKS m S m (7)
B i=1

σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. The proof is given in the Supplementary Materials.

2.4 Motion Code Learning

With the core concept of the most informative timestamps outlined in Sec-
tion 2.3, we can now describe Motion Code learning framework in details:

Model and parameters: Let the k th stochastic process Gk be modeled by a

nice Gaussian process with kernel function K ηk parameterized by ηk for k ∈ 1, L.
Next, we normalized all timestamps to the interval [0, 1]. We choose a fixed
m ∈ N as the number of the most informative timestamps, and a fixed latent
dimension d ∈ N. We model the most informative timestamps S m,k of the k th
stochastic process Gk (with data collection Ck ) jointly for all k ∈ 1, L by a
common map G : Rd → Rm . More concretely, choose L different d-dimensional
vectors z1 , · · · , zL ∈ Rd called motion codes, and model S m,k by:

[
S m,k := σ(G(z )) ∈ Rm , where σ is the standard sigmoid function. (8)
k

We approximate G by a linear map specified by the parameter matrix Θ so that

G(zk ) ≈ Θzk . Hence, Motion Code includes 3 types of parameters:

1. Kernel parameters η := (η1 , · · · , ηL ) for underlying Gaussian process Gk .

2. Motion codes z := (z1 , · · · , zL ) with zi ∈ Rd .
3. The joint map parameter Θ with dimension m × d.
8 Minh, Bajaj

Training loss function: The goal is to make S[ m,k approximate the true S m,k ,
max
which is the maximizer of L with an explicit formula given in Equation (6).
As a result, we want to maximize Lmax (Ck , Gk , S
[ m,k ) for all k, leading to the

following loss function:

L
X L
X
U(η, z, Θ) = − Lmax (Ck , Gk , S
[ m,k ) + λ ∥zk ∥22 (9)
k=1 k=1

Here the last term is the regularization term for the motion codes zk with hy-
perparameter λ. An explicit training algorithm is given below (see Algorithm 1).

Algorithm 1 Motion Code training algorithm

Bk
Input: L collections of time series data Ck = y i,k i=1 , where the series y i,k has

timestamps Ti,k . Additional hyperparameters include the number of most informative

timestamps m, motion codes dimension d, max iteration M , and stopping threshold ϵ
Output: Parameters η, z, Θ described above that optimize loss function U(η, z, Θ).

1: Initialize η and z to be constant vectors 1, and Θ to be the constant matrix, where

each column is the arithmetic sequence between 0.1 and 0.9.
2: repeat
3: Use the current parameter η, z to calculate the predicted most informative times-
tamps for the kth stochastic process: S[ m,k = σ(Θz ).
k
η η η
4: For k ∈ 1, L, i ∈ 1, Bk , calculate KSkm,k S m,k , KSkm,k T , KTk S m,k , and the cor-
i,k i,k
responding matrix Q’s, and QC,G defined in Lemma 1.
5: Use above results to calculate Lmax (Ck , Gk , S[
m,k ) given by Equation (6) via an

automatic differentiation framework for k ∈ 1, L.

6: Calculate the loss U(η, z, Θ) and its differential via automatic differentiation.
7: Update parameters (η, z, Θ) using Limited-memory Broyden Fletcher Goldfarb
Shanno (L-BFGS) algorithm [15].
8: until numbers of iterations exceed M or training loss decreases less than ϵ.
9: Output the final (η, z, Θ).

2.5 Classification and Forecasting with Motion Code

We use the trained parameters η, z, θ from Algorithm 1 to perform time series

forecasting and classification. We need an immediate step of finding prelimi-
nary predictions that give us the predicted mean signal pk = E[(gk )T ] ∈ R|T | .
With such pk , we can then perform forecasting and classification tasks.

Preliminary predictions: For a given k ∈ 1, L, the predicted distribution of

the signal vector (gk )T is calculated by marginalizing over the signal (gk )S m,k
Title Suppressed Due to Excessive Length 9

on the most informative timestamps S m,k of process Gk :

Z
p((gk )T ) = p((gk )T |(gk )S m,k )ϕ∗ ((gk )S m,k )d(gk )S m,k (10)

where the optimal variational distribution ϕ∗ is a Gaussian distribution has the

following mean µk and covariance matrix Ak (based on Lemma 1):
Bk
!
−2 ηk 1 X ηk
µk = σ KS m,k S m,k Σ K m,k y , Ak = KSηkm,k S m,k ΣKSηkm,k S m,k
i
Bk i=1 S Ti,k
(11)
−1 ηk σ −2 PBk ηk ηk
where Σ = Λ with Λ := KS m,k S m,k + i=1 KS m,k Ti,k KTi,k S m,k .
Bk
The convolution in Equation (10) only involves product of two Gaussians
and can be calculated explicitly as:
p((gk )T ) = N (KTηkS m,k (KSηkm,k S m,k )−1 µk , KTηkT − KTηkS m,k (KSηkm,k S m,k )−1 KSηkm,k T
+ KTηkS m,k (KSηkm,k S m,k )−1 Ak (KSηkm,k S m,k )−1 KSηkm,k T ) (12)

Forecasting: For the stochastic process Gk , we output the mean vector pk =

E[(gk )T ] ∈ R|T | obtained from Equation (12) as the whole process’s prediction.

Classification: To classify a series y with timestamps T , we first calculate

pk = E[(gk )T ] ∈ R|T | on T from Equation (12) for each k ∈ 1, L. Motion Code
outputs a prediction based on the label of the closest pk :
kpredicted = arg max∥y − pk ∥2,R|T | (13)
k

Here we use the simple Euclidean distance ∥.∥2,R|T | .

Time complexity: The matrix multiplication between matrices of size m-by-m

and size m-by-|Ti,k | or |Ti,k |-by-m is the most expensive individual operationin
PL PBk 2
Algorithm 1. Hence, Algorithm 1 has time complexity O k=1 i=1 m |Ti,k | ×
L Bk
M = O(m2 N M ), where N =
P P
k=1 i=1 |Ti,k | is the total number of data
points, M is the max number of iterations, and m is the number of the most
informative timestamps. For time series tasks, by the same argument, the cost
for predicting a single mean vector pk defined above is O(m2 |T |). Hence, the
cost for forecasting on timestamps T is O(m2 |T |). For classification on L groups
of time series, classifying a time series with timestamps T is O(m2 |T ||L|). As m
is chosen relatively small, the above complexities are approximately linear with
respect to the number of data points of the time series input.

Kernel choice: For implementation, we use spectral kernel of the form K η (t, s) :=
PJ 2
j=1 αj exp(-0.5βj |t − s| ) for parameters η = (α1 , · · · , αJ , β1 , · · · , βJ ).
10 Minh, Bajaj

Hyperparameters: In Section 3, we choose the number of the most informative

timestamps m = 10, the latent dimension d = 2, and the number of kernel
components J = 1. Other hyperparameters include λ = 1, ϵ = 10−5 , M = 10.

3 Experiments

Table 1: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: DTW, TSF, RISE, BOSS, BOSS-E, catch22, and our Motion
Code. Values shown are classification accuracies in percentage.
Motion
Data sets DTW TSF RISE BOSS BOSS-E catch22
Code
Chinatown 54.23 61.22 65.6 47.81 41.69 55.39 66.47
ECGFiveDays 54.47 58.07 59.35 50.06 58.42 52.85 66.55
FreezerSmallTrain 52.42 54.28 53.79 50 50.95 53.58 70.25
GunPointOldVersusYoung 92.7 99.05 98.73 87.62 93.33 98.41 91.11
HouseTwenty 57.98 57.14 42.02 52.1 57.98 45.38 70.59
InsectEPGRegularTrain 100 100 83.13 99.2 91.97 95.98 100
ItalyPowerDemand 57.05 68.71 65.79 52.77 53.26 55.88 72.5
Lightning7 21.92 28.77 26.03 12.33 28.77 24.66 31.51
MoteStrain 56.47 61.1 61.5 53.83 53.51 57.19 72.68
PowerCons 78.33 92.22 85.56 65.56 77.22 80 92.78
SonyAIBORobotSurface2 63.27 67.68 69.78 48.06 61.91 64.43 75.97
Sound 50 87.5 62.5 68.75 62.5 50 87.5
SineCurves 100 100 100 62.71 100 100 100
UWaveGestureLibraryAll 78.25 83.67 79.79 12.23 74.87 47.38 80.18

3.1 Data sets

We acquire 12 sensors and devices time series data sets from publicly available
UCR data sets [1]. We prepare 2 additional data sets named Sound [21] and
SineCurves. Each data set consists of multiple collections of time series, and
each collection has an unique label. For each of 14 data sets, we add a Gaussian
noise with standard deviation σ = 0.3 ∗ A, where A is the maximum possible
absolute value of all data points. The inherent noise in the original data sets is
not adequate for robust testing since even simpler algorithms like DTW perform
very well (achieve near or over 90% accuracy) on many datasets. The noise added
thus serves as an adversarial factor to stress test the robustness of time series
algorithms. We then implement Motion Code algorithm on these 14 datasets
to extract optimal parameters for downstream tasks including classification and
forecasting. All experiments are executed on Nvidia A100 GPU.
Title Suppressed Due to Excessive Length 11

Table 2: Classification accuracy (in percentage) table for 7 time series classifica-
tion algorithms: Shapelet, Teaser, SVC, LSTM-FCN, Rocket, Hive-Cote 2, and
our Motion Code. Values shown are classification accuracies in percentage. Error
means running on a given data set has failed.
Shape- LSTM- Hive- Motion
Data sets Teaser SVC Rocket
let FCN Cote 2 Code
Chinatown 61.22 Error 56.27 66.47 62.97 61.52 66.47
ECGFiveDays 52.61 Error 49.71 53.54 56.79 55.75 66.55
FreezerSmallTrain 50 50.11 50 50 52.67 58.18 70.25
GunPointOldVersusYoung 74.6 89.52 52.38 52.38 90.48 98.73 91.11
HouseTwenty 57.14 53.78 45.38 57.98 58.82 59.66 70.59
InsectEPGRegularTrain 44.58 100 85.94 100 46.18 100 100
ItalyPowerDemand 62.49 63.17 49.85 61.61 70.36 72.98 72.5
Lightning7 20.55 21.92 26.03 17.81 27.4 32.88 31.51
MoteStrain 47.76 Error 50.64 56.55 68.85 56.95 72.68
PowerCons 74.44 51.11 77.78 68.33 87.22 90 92.78
SonyAIBORobotSurface2 69.36 68.84 61.7 63.27 74.71 78.49 75.97
Sound 68.75 Error 62.5 56.25 75 75 87.5
SineCurves 100 96.61 67.8 100 100 100 100
UWaveGestureLibraryAll 49.5 26.52 Error 12.67 83.45 78.5 80.18

We compare Motion Code’s performance on time series classification with

12 other algorithms: DTW [11], TSF [6], RISE [14], BOSS [26], BOSS-E
[26], catch22 [18], Shapelet [2], Teaser [27], SVC [17], LSTM-FCN [12],
Rocket [5], and Hive-Cote 2 [22]. Their implementations come from the li-
brary sktime [17]. We use classification accuracy (measured in percentage) for
performance comparison. Table 1 and Table 2 show that Motion Code outper-
forms all other algorithms on 9/14 noisy data sets, and performs second-best
for other 3/14 data sets, only behind the ensemble model Hive-Cote 2. This
result demonstrates our method’s robustness when dealing with collections of
noisy time series. Unlike other algorithms, Motion Code can handle a time
series collection with unequal sequence lengths and missing data. However, for
comparison purposes, we don’t consider such special data.

3.2 Time series forecasting

For forecasting, each of the 14 data sets is divided into two parts: 80% of the data
points go to train set, and the remaining 20% future data points are set aside
for testing. For Motion Code, we output a single prediction for all series in the
same collection. We choose 5 algorithms to serve as our baselines: Exponential
smoothing [8], ARIMA [20], State space model [7], TBATS [4], and Last
seen. Last seen is a simple algorithm that uses previous values to predict next
time steps. Their implementations are included in statsmodels library [28] and
sktime library [17]. We run 5 baseline algorithms with individual predictions
12 Minh, Bajaj

Table 3: Average root mean-square error (RMSE) table for 6 time series forecast-
ing algorithms: Exponential smoothing, ARIMA, State-space model, Last seen,
TBATS, and Motion Code on 14 data sets. Values shown are RMSE between
prediction and ground truth time series averaged over the data points.
Exp.
State Last Motion
Data sets Smooth- ARIMA TBATS
space seen Code
ing
Chinatown Error 1079 775.96 723.1 633.04 518.49
ECGFiveDays 0.34 0.43 1.58 0.19 0.17 0.27
FreezerSmallTrain 0.88 0.58 0.93 0.57 0.56 0.74
GunPointOldVersusYoung 60.38 128.44 59.83 41.51 20.94 417.94
HouseTwenty 1117 3386 730.51 497.3 560.88 648.27
InsectEPGRegularTrain 0.043 0.095 0.25 0.019 0.02 0.048
ItalyPowerDemand Error 2.02 2.37 1.24 0.96 0.67
Lightning7 1.7 2.85 1.7 1.35 1.35 1.08
MoteStrain 1.11 1.52 1.09 1.01 0.88 0.82
PowerCons 3.38 4.85 4.41 1.77 1.72 1.15
SonyAIBORobotSurface2 2.79 2.01 3.21 1.39 1.52 2.26
Sound 0.087 0.27 0.086 0.1 0.059 0.085
SineCurves 3.5 4.37 3.53 1.53 1.63 1.07
UWaveGestureLibraryAll 4.37 5 4.45 0.98 1.44 0.98

and provide comparison results in Table 3. Even without making individual pre-
dictions, Motion Code still surpasses other methods in 7/14 data sets.

4 Discussion on Motion Code’s benefits

Interpretable feature: Despite having several noisy time series that deviate
from the common mean, the points at most informative timestamps S m,k form
a skeleton approximation of the underlying stochastic process. All the impor-
tant twists and turns are constantly observed by the corresponding points at
important timestamps (See Figure 2). Those points create a feature that helps
visualize the underlying dynamics with explicit global behaviors such as increas-
ing, decreasing, staying still, unlike the original complex time series collections
with no visible common patterns among series.

Uneven length and missing data handling: For each time series, Motion
Code processes individual data points and their timestamps one by one. Hence,
time series fed into the algorithm can have different timestamps sets. We in-
clude an experiment to illustrate that even with missing data, Motion Code still
learns the underlying dynamics sufficiently well. In the Sound data, we remove
randomly from each series about 10% of its data points, and retrain the model
on the modified data. The time series in the new dataset have both different
Title Suppressed Due to Excessive Length 13

(a) Motion 1 in SineCurves (b) Motion 3 in SineCurves

(c) Humidity sensor MoteStrain (d) Temperature sensor MoteS-

train

Fig. 2: Forecasting with uncertainty for time series collections in SineCurves

and MoteStrain. Mean values on most informative timestamps are included.
Here, we train Motion Code on time horizon [0, 0.8] and predict on [0.8, 1].

lengths and out-of-sync timestamps values. The resulting model still gives the
same accuracy and accurate skeleton approximation (see Figure 1), showing Mo-
tion Code’s effectiveness for uneven length and missing data.

Learning effectiveness across multiple time series: To illustrate Motion

Code’s learning ability across several time series, we compare its performance
with an alternative approach which learns from each individual series through a
two-step feature extraction procedure on SineCurves data. First, we implement
sparse Gaussian regression [29] to extract the optimal inducing points from each
series and stack them into vector X. Then, we use singular value decomposi-
tion to perform the low-rank compression on X and obtain corresponding 2D
features. Figure 3 shows that features extracted from this process yield inter-
twined and inseparable clusters. Additionally, the expensive operations on each
individual series are prohibitive in practice. On the other hand, Motion Code
achieves 100% accuracy (see Table 1) for classification task on the same data
set, suggesting the advantage of Motion Code’s holistic approach across series
and hidden dynamics.
14 Minh, Bajaj

Fig. 3: Two-dimensional features extracted from individual time series in the data
set SineCurves via separate sparse Gaussian process models.

5 Conclusion

In this work, we employ variational inference and sparse stochastic process model
to develop an integrated framework called Motion Code. The method can per-
form time series forecasting simultaneously with classification across different
collections of time series data, while most other current methods only focus on
one task at a time. Our Motion Code model is particularly robust to noise and
produces competitive performance against other popular time series classifica-
tion and forecasting algorithms. Moreover, as we have demonstrated in Section 4,
Motion Code provides an interpretable feature that effectively captures the
core information of the underlying stochastic process. Finally, our method can
deal with varying-length time series and missing data, while many other meth-
ods fail to do so. In future work, we plan to generalize Motion Code, with
non-Gaussian priors adapted to time series from various application domains.

Author acknowledgement: This research was supported in part by a grant

from the NIH DK129979, in part from the Peter O’Donnell Foundation, the
Michael J. Fox Foundation, Jim Holland-Backcountry Foundation, and in part
from a grant from the Army Research Office accomplished under Cooperative
Agreement Number W911NF-19-2-0333.

Author Contributions: Motion Code is developed by Minh Nguyen and im-

proved by Chandrajit Bajaj and Minh Nguyen. Alphabetical order is currently
used to order author names. The implementation is done by Minh Nguyen and
is available at [Link]

References
1. Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series
classification bake off: a review and experimental evaluation of recent algorithmic
advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)
Title Suppressed Due to Excessive Length 15

2. Bostrom, A., Bagnall, A.: Binary shapelet transform for multiclass time series clas-
sification. In: Transactions on Large-Scale Data- and Knowledge-Centered Systems
XXXII, pp. 24–46. Springer Berlin Heidelberg, Berlin, Heidelberg (2017)
3. Cao, Y., Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Efficient optimization for
sparse gaussian process regression. In: Advances in Neural Information Processing
Systems. vol. 26. Curran Associates, Inc. (2013)
4. De Livera, A.M., Hyndman, R.J., Snyder, R.D.: Forecasting time series with com-
plex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 106(496),
1513–1527 (2011)
5. Dempster, A., Petitjean, F., Webb, G.I.: ROCKET: exceptionally fast and accurate
time series classification using random convolutional kernels. Data Min. Knowl.
Discov. 34(5), 1454–1495 (2020)
6. Deng, H., Runger, G., Tuv, E., Vladimir, M.: A time series forest for classification
and feature extraction. Inf. Sci. (Ny) 239, 142–153 (2013)
7. Durbin, J.: Time Series Analysis by State Space Methods: Second Edition. Oxford
University Press (2012)
8. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving
averages. Int. J. Forecast. 20(1), 5–10 (2004)
9. Hu, Q., Zhang, R., Zhou, Y.: Transfer learning for short-term wind speed prediction
with deep neural networks. Renew. Energy 85, 83–95 (2016)
10. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Deep
learning for time series classification: a review. Data Min. Knowl. Discov. 33(4),
917–963 (2019)
11. Jeong, Y.S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic time warping for
time series classification. Pattern Recognit. 44(9), 2231–2240 (2011)
12. Karim, F., Majumdar, S., Darabi, H., Harford, S.: Multivariate LSTM-FCNs for
time series classification. Neural Netw. 116, 237–245 (2019)
13. Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Philos.
Trans. A Math. Phys. Eng. Sci. 379(2194), 20200209 (2021)
14. Lines, J., Taylor, S., Bagnall, A.: HIVE-COTE: The hierarchical vote collective of
transformation-based ensembles for time series classification. In: 2016 IEEE 16th
International Conference on Data Mining (ICDM). IEEE (2016)
15. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale opti-
mization. Math. Program. 45(1-3), 503–528 (1989)
16. Liu, Z., Zhu, Z., Gao, J., Xu, C.: Forecast methods for time series data: A survey.
IEEE Access 9, 91896–91912 (2021)
17. Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., Király, F.J.: sk-
time: A Unified Interface for Machine Learning with Time Series. arXiv e-prints
arXiv:1909.07872 (Sep 2019)
18. Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.:
catch22: CAnonical time-series CHaracteristics: Selected through highly compara-
tive time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)
19. Ma, Q., Shen, L., Chen, W., Wang, J., Wei, J., Yu, Z.: Functional echo state
network for time series classification. Inf. Sci. (Ny) 373, 1–20 (2016)
20. Malki, Z., Atlam, E.S., Ewis, A., Dagnew, G., Alzighaibi, A.R., ELmarhomy, G.,
Elhosseini, M.A., Hassanien, A.E., Gad, I.: ARIMA models for predicting the end
of COVID-19 pandemic and the risk of second rebound. Neural Comput. Appl.
33(7), 2929–2948 (2021)
21. Media, F.: Forvo: The pronunciation guide. [Link] (2022), ac-
cessed: 2022-04-01
16 Minh, Bajaj

22. Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.: HIVE-
COTE 2.0: a new meta ensemble for time series classification. Mach. Learn. 110(11-
12), 3211–3243 (2021)
23. Moss, H.B., Ober, S.W., Picheny, V.: Inducing point allocation for sparse gaussian
processes in high-throughput bayesian optimisation. In: Proceedings of The 26th
International Conference on Artificial Intelligence and Statistics. Proceedings of
Machine Learning Research, vol. 206, pp. 5213–5230. PMLR (25–27 Apr 2023)
24. Qi, Y., Abdel-Gawad, A.H., Minka, T.P.: Sparse-posterior gaussian processes for
general likelihoods. In: Proceedings of the 26th conference on uncertainty in arti-
ficial intelligence. pp. 450–457. Citeseer (2010)
25. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning. MIT
Press (2006)
26. Schäfer, P.: The BOSS is concerned with time series classification in the presence
of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015)
27. Schäfer, P., Leser, U.: TEASER: early and accurate time series classification. Data
Min. Knowl. Discov. 34(5), 1336–1362 (2020)
28. Skipper Seabold, Josef Perktold: Statsmodels: Econometric and Statistical Model-
ing with Python. In: Proceedings of the 9th Python in Science Conference. pp. 92
– 96 (2010)
29. Titsias, M.: Variational learning of inducing variables in sparse gaussian processes.
In: Proceedings of the Twelth International Conference on Artificial Intelligence
and Statistics. Proceedings of Machine Learning Research, vol. 5, pp. 567–574.
PMLR, Florida, USA (16–18 Apr 2009)
Title Suppressed Due to Excessive Length 17

A Proof of Lemma 1
B
Lemma 1. Suppose we’re given a collection C of B noisy time series y i i=1
sampled from a nice Gaussian process G = {g(t)}t≥0 with the underlying signal
g. Further assume that the data (yi )Ti with timestamps Ti has the Gaussian
distribution N (gTi , σI|Ti | ) for a given σ > 0. These assumptions directly follow
from Section 2.2 for a single dynamical model with noise variance constant σ.
Further assume that we’re given a fixed m-elements timestamps set S m . For
i ∈ 1, B, recall from the notations that KTi Ti , KS m Ti , and KTi S m are kernel
matrices between timestamps of y i and S m . Also define the |Ti |-by-|Ti | matrix
QTi Ti := KTi S m (KS m S m )−1 KS m Ti for i ∈ 1, B. Hence, across all time series,
we have the following data vector Y and the joint matrix QC,G :
 1  
y QT1 T1 0 0
Y =  ...  , QC,G =  0 . . . (5)
   
0 
yB 0 0 QTB TB
Then the function Lmax defined in Definition 4 has the following closed-form:
B
1 X
Lmax (C, G, S m ) = log pN (Y |0, Bσ 2 I +QC,G )− T r(KTi Ti −QTi Ti ) (6)
2σ 2 B i=1

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X

with mean µ and covariance matrix Σ. Furthermore, the optimal variational dis-
tribution ϕ∗ = arg maxϕ L(C, G, S m , ϕ). is a Gaussian distribution of the form:
B
! !
∗ −2 1 X i
ϕ (gS m ) = N σ KS m S m Σ KS m Ti y , KS m S m ΣKS m S m (7)
B i=1

σ −2 PB
where Σ = Λ−1 with Λ := KS m S m + i=1 KS Ti KTi S .
m m
B
Proof. Define the conditional mean signal vector αi = E[gTi |gS m ]. From Equa-
−1
 α
tion (2), i = KTi S m (KS m S m ) gS m . Let A be the combined mean signal vector
α1
A :=  ... . With these notations, following the derivation from [29], individual
 

Using this equation for individual term, we upper-bound the function L(S, G, T m , ϕ)
(see Equation (3)) as follows:

L(S, G, T m , ϕ)
B B
pN (y i |gTi , σ 2 I)p(gS m )
Z
X 1 1 X
= ϕ(gS m ) log dgS m − 2 T r(KTi Ti − QTi Ti )
i=1
B ϕ(gS m ) 2σ B i=1
Z !1/B !
Y p(g S m)
= ϕ(gS m ) log pN (y i |αi , σ 2 I) dgS m
i
ϕ(gS m )
B
1 X
− T r(KTi Ti − QTi Ti )
2σ 2 B i=1
Z B
!1/B B
Y
i 2 1 X
≤ log pN (y |αi , σ I) p(gSm )dg
Sm − 2 T r(KTi Ti − QTi Ti )
i=1
2σ B i=1
Z B
1 X
= log pN (Y |A, Bσ 2 )p(gS m )dgS m − T r(KTi Ti − QTi Ti )
2σ 2 B i=1
B
1 X
= log pN (Y |0, Bσ 2 I + QC,G ) − T r(KTi Ti − QTi Ti )
2σ 2 B i=1

The only inequality for this bound is due to Jensen inequality. This upper-
bound no longer depends on the variational distribution ϕ and only depends
on the timestamps in S m . As a result, by definition of Lmax , we obtain the
Equation (6). Moreover, for this bound, the equality holds when:

B
Y
ϕ∗ (gS m ) ∝ pN (y i |αi , σ 2 I)1/B p(gS m )
i=1
B
!
σ −2 X −1 1
∝ exp (y ) KTi S m (KS m S m ) gS m − (gS m )T ×
i T
B i=1 2
−2 XB !
σ −1 −1 −1
(KS m S m ) KS m Ti KTi S m (KS m S m ) + (KS m S m ) × gS m
B i=1

Hence, ϕ∗ is a Gaussian distribution with mean and variance as expected:

  !
B
∗ −2 1 X
i
ϕ (gS m ) = N σ KS m S m Λ  KS m Ti y , KS m S m ΛKS m S m (14)
B j=1

⊔
⊓
Title Suppressed Due to Excessive Length 19

B Further explanations on the generalized evidence lower

bound function in Definition 3
For each i ∈ 1, B, for a single time series y i , we have the following lower-bound
estimate for log-likelihood log(y i ):
Z
log y i = log p(y i |gTi )p(gTi |gS m )p(gS m )dgTi dgS m

p(y i |gTi )p(gS m )

Z
≥ p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
ϕ(gS m )
B
Adding up inequalities for all time series y i i=1 in the collection C and then
taking the mean, we obtain:
B
! B
1 1 Y
i 1 X
log C = log y = log(y i )
B B i=1
B i=1
B Z
1 X p(y i |gTi )p(gS m )
≥ p(gTi |gS m )ϕ(gS m ) log dgTi dgS m
B i=1 ϕ(gS m )

The above lower-bound on the log-likelihood of the collection C coincides

with Equation (3).

C Other figures and tables

Dataset Train Test Length Description

Sound 18 18 100 Amplitude values of the audio datasets for

the pronunciations of 2 words with differ-
ent accents.
SineCurves 30 20 500 Time series generated from 3 different
functions with sine and cosine components
and the Gaussian noise of variance 0.1.

Table 4: Descriptions of 2 data sets created by authors.

20 Minh, Bajaj

(a) Weekend traffic in Chinatown (b) Weekday traffic in Chinatown

(c) October to March period (d) April to September period

ItalyPowerDemand ItalyPowerDemand

Fig. 4: Forecasting with uncertainty for time series collections in Chinatown

and ItalyPowerDemand. Mean values on most informative timestamps are
included. Here, we train Motion Code on time horizon [0, 0.8] and predict on
[0.8, 1].

(a) Warm season in PowerCons (b) Cold season in PowerCons

Fig. 5: Forecasting with uncertainty on the PowerCons data set from 5 fore-
casting algorithms: ARIMA, State-space model, Last seen, TBATS, and Motion
Code. We train the algorithms on time horizon [0, 0.8] and test on [0.8, 1]. Pow-
erCons includes two collections of time series: power consumption over the warm
and cold seasons. Exponential Smoothing is not included as it doesn’t output
the uncertainty.

Motion Code for Time Series Analysis
No ratings yet
Motion Code for Time Series Analysis
17 pages
Survey of Diffusion Models in Time Series
No ratings yet
Survey of Diffusion Models in Time Series
25 pages
Learning Graphical Models for Time Series
No ratings yet
Learning Graphical Models for Time Series
20 pages
Recurrent Neural Processes for Time Series
No ratings yet
Recurrent Neural Processes for Time Series
12 pages
Stochastic Processes in Time Series Analysis
No ratings yet
Stochastic Processes in Time Series Analysis
15 pages
Stochastic Time Series Analysis
No ratings yet
Stochastic Time Series Analysis
49 pages
Anomalies in Time Series
No ratings yet
Anomalies in Time Series
19 pages
Recurrent Neural Networks for Time Series
No ratings yet
Recurrent Neural Networks for Time Series
14 pages
Neural Networks for Time-Series Analysis
No ratings yet
Neural Networks for Time-Series Analysis
4 pages
Nonlinear Time-Series Analysis: Ulrich Parlitz
No ratings yet
Nonlinear Time-Series Analysis: Ulrich Parlitz
31 pages
Time Series Econometrics Overview
No ratings yet
Time Series Econometrics Overview
421 pages
Time Series Analysis Course Notes
No ratings yet
Time Series Analysis Course Notes
79 pages
Surrogate Data Analysis in Nonlinear Time Series
No ratings yet
Surrogate Data Analysis in Nonlinear Time Series
37 pages
Time Series Analysis and Modeling Techniques
No ratings yet
Time Series Analysis and Modeling Techniques
33 pages
Multi-Resolution Diffusion for Forecasting
No ratings yet
Multi-Resolution Diffusion for Forecasting
19 pages
Spectral Methods for Time Series Forecasting
No ratings yet
Spectral Methods for Time Series Forecasting
38 pages
Temporal Component Learning for Time Series
No ratings yet
Temporal Component Learning for Time Series
15 pages
Modeling Volatility with State Space Models
No ratings yet
Modeling Volatility with State Space Models
14 pages
Deep Learning for Time-Series Analysis
No ratings yet
Deep Learning for Time-Series Analysis
63 pages
D3VAE: Generative Time Series Forecasting
No ratings yet
D3VAE: Generative Time Series Forecasting
29 pages
Time Series Modelling and Inference Guide
No ratings yet
Time Series Modelling and Inference Guide
3 pages
Understanding Stochastic Processes and Models
No ratings yet
Understanding Stochastic Processes and Models
20 pages
Fast Detection of Nonlinearity in Time Series
No ratings yet
Fast Detection of Nonlinearity in Time Series
6 pages
Multi-Granularity Time Series Diffusion Models
No ratings yet
Multi-Granularity Time Series Diffusion Models
19 pages
Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
No ratings yet
Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
9 pages
Summary of Durbin-Koopman Methods
100% (1)
Summary of Durbin-Koopman Methods
57 pages
Time-Series Novelty Detection with SVMs
No ratings yet
Time-Series Novelty Detection with SVMs
5 pages
Interpretable Diffusion for Time Series
No ratings yet
Interpretable Diffusion for Time Series
30 pages
AI for Time Series Analysis Tutorial
No ratings yet
AI for Time Series Analysis Tutorial
105 pages
Neural Networks for Time Series Analysis
No ratings yet
Neural Networks for Time Series Analysis
42 pages
Data Science in Medical Analysis
No ratings yet
Data Science in Medical Analysis
23 pages
Time Series Analysis Lecture Notes
No ratings yet
Time Series Analysis Lecture Notes
150 pages
MFX Module 3 Properties of Time Series
No ratings yet
MFX Module 3 Properties of Time Series
76 pages
Time Series Analysis Fundamentals
No ratings yet
Time Series Analysis Fundamentals
6 pages
Bayesian Dynamic Modelling: Bayesian Theory and Applications
No ratings yet
Bayesian Dynamic Modelling: Bayesian Theory and Applications
29 pages
Bayesian Dynamic Modelling Overview
No ratings yet
Bayesian Dynamic Modelling Overview
27 pages
Weak Stationarity in Time Series Analysis
No ratings yet
Weak Stationarity in Time Series Analysis
97 pages
Deep Temporal Models for Time Series Analysis
No ratings yet
Deep Temporal Models for Time Series Analysis
84 pages
Subspace Identification of Time-Varying Systems
No ratings yet
Subspace Identification of Time-Varying Systems
16 pages
Lian Duke 0066D 13204
No ratings yet
Lian Duke 0066D 13204
117 pages
Understanding Time Series Analysis
No ratings yet
Understanding Time Series Analysis
41 pages
Correlogram Analysis in Time Series
No ratings yet
Correlogram Analysis in Time Series
30 pages
Understanding Stationary Processes in Time Series
No ratings yet
Understanding Stationary Processes in Time Series
18 pages
Sequential Models in Earth Sciences
No ratings yet
Sequential Models in Earth Sciences
18 pages
MixMamba: Adaptive Time Series Forecasting
No ratings yet
MixMamba: Adaptive Time Series Forecasting
13 pages
Time Series Analysis Course Outline
0% (1)
Time Series Analysis Course Outline
173 pages
Time Series Forecasting Taxonomy Guide
100% (1)
Time Series Forecasting Taxonomy Guide
91 pages
10 1 1 314 2260 PDF
No ratings yet
10 1 1 314 2260 PDF
41 pages
Time-Series Anomaly Detection Review
No ratings yet
Time-Series Anomaly Detection Review
51 pages
TSMamba: Efficient Time Series Forecasting
No ratings yet
TSMamba: Efficient Time Series Forecasting
15 pages
FreDo - Frequency Domain-Based Long-Term Time Series Forecasting
No ratings yet
FreDo - Frequency Domain-Based Long-Term Time Series Forecasting
12 pages
Dynamic Anomaly Detection in Time Series
No ratings yet
Dynamic Anomaly Detection in Time Series
21 pages
Contrast Profile: A Novel Time Series Primitive
No ratings yet
Contrast Profile: A Novel Time Series Primitive
10 pages
CISC 867: Recurrent Neural Networks
No ratings yet
CISC 867: Recurrent Neural Networks
72 pages
Time Series Analysis Course Notes
No ratings yet
Time Series Analysis Course Notes
57 pages
Time Series Data: Characteristics & Examples
No ratings yet
Time Series Data: Characteristics & Examples
10 pages
Feature-Based Time-Series Analysis Guide
No ratings yet
Feature-Based Time-Series Analysis Guide
28 pages
AI, ML, and DL: Key Differences Explained
No ratings yet
AI, ML, and DL: Key Differences Explained
79 pages
CNN-LSTM Model for HEV Fault Diagnosis
No ratings yet
CNN-LSTM Model for HEV Fault Diagnosis
2 pages
Computer Science Research Titles List
No ratings yet
Computer Science Research Titles List
4 pages
Introduction to AI Concepts and Careers
No ratings yet
Introduction to AI Concepts and Careers
14 pages
Data Science and AI Curriculum Overview
No ratings yet
Data Science and AI Curriculum Overview
1 page
Sentiment Analysis of Indonesian Biodiversity Tweets
No ratings yet
Sentiment Analysis of Indonesian Biodiversity Tweets
13 pages
Keras Deep Learning Cheat Sheet
No ratings yet
Keras Deep Learning Cheat Sheet
2 pages
AIE Curriculum for CSE B.Tech Program
No ratings yet
AIE Curriculum for CSE B.Tech Program
9 pages
fastai: A Layered Deep Learning API
No ratings yet
fastai: A Layered Deep Learning API
27 pages
David Raymond's Research and Achievements
No ratings yet
David Raymond's Research and Achievements
5 pages
BSc Blockchain Technology Curriculum
No ratings yet
BSc Blockchain Technology Curriculum
38 pages
Driver Drowsiness Detection System
No ratings yet
Driver Drowsiness Detection System
7 pages
Airscript - Creating Documents in Air
No ratings yet
Airscript - Creating Documents in Air
6 pages
CS230 Deep Learning Course Overview
No ratings yet
CS230 Deep Learning Course Overview
25 pages
Bagworm Detection in Palm Oil via Aerial Imagery
No ratings yet
Bagworm Detection in Palm Oil via Aerial Imagery
21 pages
DIY Recommender System with Blockchain
No ratings yet
DIY Recommender System with Blockchain
12 pages
Graph Neural Networks for Bike Sharing Demand
No ratings yet
Graph Neural Networks for Bike Sharing Demand
20 pages
CMInDS IIT Bombay Admissions Test Syllabus
No ratings yet
CMInDS IIT Bombay Admissions Test Syllabus
4 pages
Comprehensive Guide to AI Concepts
No ratings yet
Comprehensive Guide to AI Concepts
5 pages
MediaPipe-LSTM for Sign Language Recognition
No ratings yet
MediaPipe-LSTM for Sign Language Recognition
24 pages
AI A-Z: Building Intelligent Systems
No ratings yet
AI A-Z: Building Intelligent Systems
12 pages
Hybrid Neural Network for CFM56-7B Fatigue
No ratings yet
Hybrid Neural Network for CFM56-7B Fatigue
11 pages
Endsem Exam Schedule for Engineering 2023
No ratings yet
Endsem Exam Schedule for Engineering 2023
26 pages
Deep Learning for Cyberbullying Detection
No ratings yet
Deep Learning for Cyberbullying Detection
9 pages
Handwriting Recognition Techniques Overview
No ratings yet
Handwriting Recognition Techniques Overview
25 pages
AI Multimodal Detection of Hate Speech
100% (1)
AI Multimodal Detection of Hate Speech
6 pages
Deep Learning for Bread Quality Assessment
No ratings yet
Deep Learning for Bread Quality Assessment
6 pages
AI Hardware Reliability and Security Insights
No ratings yet
AI Hardware Reliability and Security Insights
11 pages
Deep Learning in Solar Power Forecasting
No ratings yet
Deep Learning in Solar Power Forecasting
18 pages
Accelerating Data Parallel Training in PyTorch
No ratings yet
Accelerating Data Parallel Training in PyTorch
14 pages

Motion Code for Time Series Analysis

Uploaded by

Motion Code for Time Series Analysis

Uploaded by

Motion Code: Robust Time Series Classification

and Forecasting via Sparse Variational

Chandrajit Bajaj1,3 and Minh Nguyen2,3

The University of Texas at Austin

Abstract. Despite being extensively studied, time series classification

(a) 2 time series collections (b) Absorptivity (c) Anything

First, we propose viewing each time series as an instance of a common stochastic

approximations. Because of the generalization level of stochastic process, the

1.1 Related works

Time series forecasting For forecasting, researchers have developed several

Stochastic modeling of time-series. Gaussian process [25] has been fre-

The rest of the paper is organized as follows: Section 2 provides mathemat-

2 Motion Code: Joint learning on collection of time series

2.2 Three assumptions underlying Motion Code

Definition 1. A stochastic process G := {g(t)}t≥0 is a kernelized Gaussian

p(gT ) = p((g(t))t∈T ) = N (µT , KT T ), (1)

Here µT is the mean vector (µ(t))t∈T , and KT T is the positive-definite n × n

Notation: Throughout this paper, for ordered sets T and S of timestamps, we

Definition 2. (Nice Gaussian process) A kernelized Gaussian process G =

Assumption 1: We explicitly model each underlying stochastic process Gk by

Assumption 3: Assume that Gk is nice (see Definition 2) so that the condi-

2.3 The most informative timestamps

Definition 3. Suppose we are given a stochastic process G = {g(t)}t≥0 and a

Definition 4. For a fixed m ∈ N, the m-elements set (S m )∗ ⊂ R+ is said to

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X

2.4 Motion Code Learning

Model and parameters: Let the k th stochastic process Gk be modeled by a

We approximate G by a linear map specified by the parameter matrix Θ so that

1. Kernel parameters η := (η1 , · · · , ηL ) for underlying Gaussian process Gk .

following loss function:

Algorithm 1 Motion Code training algorithm

timestamps Ti,k . Additional hyperparameters include the number of most informative

1: Initialize η and z to be constant vectors 1, and Θ to be the constant matrix, where

automatic differentiation framework for k ∈ 1, L.

2.5 Classification and Forecasting with Motion Code

We use the trained parameters η, z, θ from Algorithm 1 to perform time series

Preliminary predictions: For a given k ∈ 1, L, the predicted distribution of

on the most informative timestamps S m,k of process Gk :

where the optimal variational distribution ϕ∗ is a Gaussian distribution has the

Forecasting: For the stochastic process Gk , we output the mean vector pk =

Classification: To classify a series y with timestamps T , we first calculate

Here we use the simple Euclidean distance ∥.∥2,R|T | .

Time complexity: The matrix multiplication between matrices of size m-by-m

Hyperparameters: In Section 3, we choose the number of the most informative

3.1 Data sets

We compare Motion Code’s performance on time series classification with

3.2 Time series forecasting

4 Discussion on Motion Code’s benefits

(a) Motion 1 in SineCurves (b) Motion 3 in SineCurves

(c) Humidity sensor MoteStrain (d) Temperature sensor MoteS-

Fig. 2: Forecasting with uncertainty for time series collections in SineCurves

Learning effectiveness across multiple time series: To illustrate Motion

Author acknowledgement: This research was supported in part by a grant

Author Contributions: Motion Code is developed by Minh Nguyen and im-

where pN (X|µ, Σ) denotes the density function of a Gaussian random variable X

Hence, ϕ∗ is a Gaussian distribution with mean and variance as expected:

B Further explanations on the generalized evidence lower

p(y i |gTi )p(gS m )

The above lower-bound on the log-likelihood of the collection C coincides

C Other figures and tables

Dataset Train Test Length Description

Sound 18 18 100 Amplitude values of the audio datasets for

Table 4: Descriptions of 2 data sets created by authors.

(a) Weekend traffic in Chinatown (b) Weekday traffic in Chinatown

(c) October to March period (d) April to September period

Fig. 4: Forecasting with uncertainty for time series collections in Chinatown

(a) Warm season in PowerCons (b) Cold season in PowerCons

You might also like