0% found this document useful (0 votes)
94 views25 pages

Deep Learning with RNNs for Time-Series

The document provides an introduction to Recurrent Neural Networks (RNNs) for sequential data, detailing their architecture, training methods, and applications. It discusses various models for time-series data, including auto-regressive models, state-space models, and long-memory processes. Key challenges of RNNs, such as difficulty in remembering distant past information and slow processing, are highlighted along with potential solutions like LSTM and GRU variants.

Uploaded by

5hvbgtjv2z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views25 pages

Deep Learning with RNNs for Time-Series

The document provides an introduction to Recurrent Neural Networks (RNNs) for sequential data, detailing their architecture, training methods, and applications. It discusses various models for time-series data, including auto-regressive models, state-space models, and long-memory processes. Key challenges of RNNs, such as difficulty in remembering distant past information and slow processing, are highlighted along with potential solutions like LSTM and GRU variants.

Uploaded by

5hvbgtjv2z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Deep Learning

Recurrent Neural Networks for Sequential Data (Time Series)

MATH 370: Machine Learning


Tanujit Chakraborty
@ Sorbonne
ctanujit@[Link]
Learning from Time-Series Data
▪ The input is a sequence of (non-i.i.d.) examples 𝑦1 , 𝑦2 , … , 𝑦𝑡 .
▪ The problem may be supervised or unsupervised, e.g.,
▪ Forecasting: Predict 𝑦𝑡+1 using 𝑦1 , 𝑦2 , … , 𝑦𝑡
▪ Cluster the examples or perform dimensionality reduction / Anomaly detection
▪ Evolution of time-series data can be attributed to several factors

▪ Teasing apart these factors of variation is also an important problem.


Auto-regressive Models
▪ Auto-regressive (AR): Regress each example on 𝑝 previous lagged values - AR(𝑝) model

▪ Moving Average (MA): Regress each example on 𝑞 previous stochastic errors - MA(𝑞) model

▪ Auto-regressive Integrated Moving Average (ARMA): Regress each example of 𝑝 previous lagged values
and 𝑞 previous stochastic errors

where 𝑦𝑡 ′ is the differenced series (if the data is nonstationary and differencing is applied). We call this
an ARIMA 𝑝, 𝑑, 𝑞 model.
State-Space Models
▪ Assume that each observation 𝑦𝑡 in the time-series is generated by a low-dimensional latent factor 𝑥𝑡
(one-hot or continuous)

▪ Basically, a generative latent factor model: 𝑦𝑡 = 𝑔(𝑥𝑡 ) and 𝑥𝑡 = 𝑓(𝑥𝑡−1 ), where 𝑔 and 𝑓 are
probability distributions.

▪ Some popular SSMs: Hidden Markov Models (one-hot latent factor 𝑥𝑡 ), Kalman Filters (real-valued latent
factor 𝑥𝑡 )

▪ Note: Models like RNN/LSTM are also similar, except that these are not generative (but can be made
generative)
Long Memory and ARFIMA Process
▪ Definition (Long Memory). Let 𝑋𝑡 , 𝑡 ∈ ℤ be a weakly stationary univariate process with auto-covariance
function 𝛾𝑋 (𝑘) and spectral density function
𝑓𝑋 (𝜆) = (2𝜋)−1 σ∞
𝑘=−∞ 𝛾𝑋 (𝑘)exp(−𝑖𝑘𝜆) for 𝑘 ∈ ℤ and 𝜆 ∈ [−𝜋, 𝜋].
Then, 𝑋𝑡 has long memory if σ∞
𝑘=−∞ 𝛾𝑋 (𝑘) = ∞, and short memory otherwise.
Equivalently, as |𝜆| → 0, 𝑓𝑋 (𝜆) → ∞ for long memory,

• The most popular long-memory model for level data 𝑦𝑡 is the ARFIMA(p, d, q) model introduced by
Granger and Joyeux (1980) and Hosking (1981). Specifically, an ARFIMA(p, d, q) process 𝑦𝑡 is defined by
(1 − 𝐵)𝑑 𝑦𝑡 = 𝑥𝑡

• B is the one-dimensional backshift operator, 𝑥𝑡 is an ARMA(p, q) process that captures short-range


dependence, and d is a fractional differencing parameter.

• Typically, d is chosen such that −1/2 < 𝑑 < 1/2 to ensure that 𝑦𝑡 is stationary and invertible.
Fractional Differencing

Figure: Fractional Differencing applied to DAX index

[Link]
Recurrent Connections in Deep Neural Networks
▪ Feedforward nets such as MLP assume independent observations
𝒚𝑛 depends
𝒚1 𝒚2 𝒚𝑛 only on 𝒉𝑛 𝒚𝑁
𝒗 𝒗 𝒗 𝒉𝑛 depends 𝒗
only on 𝒙𝑛 Feedforward neural networks are not ideal
𝒉1 𝒉2 𝒉𝑛 𝒉𝑁 when inputs [𝒙1 , 𝒙2 , … , 𝒙𝑁 ] and/or outputs
[𝒚1, 𝒚2 , … , 𝒚𝑁 ] represent sequential data
𝑾 𝑾 𝑾 𝑾
(e.g., sequence of words, video (sequence of
𝒙1 𝒙2 𝒙𝑛 𝒙𝑁 frames), etc.

▪ A recurrent structure can be helpful if each input and/or output is a sequence


Corresponding output
𝒚1 𝒚2 𝒚3 𝒚𝑛 𝒚𝑁 (assuming same length as
𝒚𝑛
the input)
Each step of the
input is given in𝒗 𝒗 𝒗 𝒗 𝒗 Compactly
𝒗
form of an 𝑼 𝑼 𝑼
𝑼 𝑼 𝑼
embedding (e.g.,
word2vec if input
𝒉1 𝒉2 𝒉3 𝒉𝑛 𝒉𝑁 𝒉𝑛 𝑼

is a sequence of
words) 𝑾 𝑾 𝑾 𝑾 𝑾 𝑾
A single input
𝒙1 𝒙2 𝒙3 𝒙𝑛 𝒙𝑁 of length 𝑁 𝒙𝑛
RNNs
▪ RNNs are used when each input or output or both are sequences of tokens

𝒚1 𝒚2 𝒚3 𝒚𝑡 Decoder part 𝒚𝑇 𝒚𝑡
𝒗 𝒗 𝒗 𝒗 𝒗 Compactly 𝒗
𝑼 𝑼 𝑼 𝑼 𝑼 𝑼
𝒉1 𝒉2 𝒉3 𝒉𝑡 𝒉𝑇 𝒉𝑡 𝑼

𝑾 𝑾 𝑾 𝑾 𝑾 𝑾

𝒙1 𝒙2 𝒙3 𝒙𝑡 Encoder part 𝒙𝑇 𝒙𝑡
If the input is a word sequence, then each 𝒙𝑛 represent the
corresponding word’s embedding (either a pre-computed
word embedding like word2vec or a learned word embedding)

▪ Hidden state 𝒉𝑡 is supposed to remember everything up to time 𝑡 − 1. However, in practice, RNNs have
difficulties remembering the distant past
▪ Variants such as LSTM, GRU, etc mitigate this issue to some extent

▪ Slow processing is another major issue (e.g., can’t compute 𝒉𝑡 before computing 𝒉𝑡−1 )
Recurrent Neural Networks
▪ A basic RNN’s architecture (assuming input and output sequence have same lengths)

𝒚1 𝒚2 𝒚3 𝒚𝑡 𝒚𝑇 𝒚𝑡
𝒗 𝒗 𝒗 𝒗 𝒗 Compactly 𝒗
𝑼 𝑼 𝑼 𝑼 𝑼 𝑼
𝒉1 𝒉2 𝒉3 𝒉𝑡 𝒉𝑇 𝒉𝑡 𝑼

𝑾 𝑾 𝑾 𝑾 𝑾 𝑾

𝒙1 𝒙2 𝒙3 𝒙𝑡 𝒙𝑇 𝒙𝑡
Given in form of an embedding
(e.g., word embedding if 𝑥1 is a word) 𝑔 is some activation
▪ RNN has three sets of weights 𝑾, 𝑼, 𝒗 function like ReLU

▪ 𝑾 and 𝑼 model how ℎ𝑡 at step t is computed: 𝒉𝑡 = 𝑔(𝑾𝒙𝑡 + 𝑼𝒉𝑡−1 ) 𝑜 depends on the nature
▪ 𝒗 models the hidden layer to output mapping, e.g., 𝒚𝑡 = 𝑜(𝒗𝒉𝑡 ) of 𝑦𝑡 . If it is categorical
then 𝑜 can be softmax
▪ Important: Same 𝑾, 𝑼, 𝒗 are used at all steps of the sequence (weight sharing)
Recurrent Neural Nets (RNN)
▪ A more “micro” view of RNN (the transition matrix U connects the hidden states across
observations, propagating information along the sequence)

Pic source: [Link]


RNN in Action
Workflow of RNN:
• The gif above reflects the magic of recurrent
networks.

• It depicts 4 timesteps. The first is exclusively


influenced by the input data.

• The second one is a mixture of the first and


second inputs. This continues on.

• You should recognize that, in some way,


network 4 is "full".

• Presumably, timestep 5 would have to choose


which memories to keep and which ones to
overwrite.

• This is very real. It's the notion of memory


"capacity".

• As you might expect, bigger layers can hold


more memories for a longer period of time.
Pic source: [Link]
Training RNN
▪ Trained using Backpropagation Through Time (forward propagate from step 1 to end, and then backward
propagate from end to step)
▪ Think of the time-dimension as another hidden layer and then it is just like standard backpropagation for
feedforward neural nets

• They learn by fully propagating


forward from 1 to 4 (through an
entire sequence of arbitrary length),
and then backpropagating all the
derivatives from 4 back to 1.

• You can also pretend that it's just a


funny shaped normal neural network,
except that we're re-using the same
weights (synapses 0,1,and h) in their
respective places.

• Other than that, it's normal


backpropagation.

Black: Prediction, Yellow: Error, Orange: Gradients Pic source: [Link]


RNN Applications
▪ In many problems, each input, each output, or both may be in form of sequences

These green cells are some


feature representation
(e.g., hidden layer of a deep
neural network) of the inputs

▪ Different inputs or outputs need not have the same length


▪ Some examples of prediction tasks in such problems
▪ Image captioning: Input is image (not a sequence), output is the caption (word sequence)
▪ Document classification: Input is a word sequence, output is a categorical label
▪ Machine translation: Input is a word sequence, output is a word sequence (in different language)
▪ Stock price prediction: Input is a sequence of stock prices, output is its predicted price tomorrow
▪ No input – just output (e.g., generation of random but plausible-looking text)
Recurrent Neural Networks: Some Examples
▪ Consider generating a sequence 𝑦1 , 𝑦2 , … , 𝑦𝑇 given an input 𝑥
Words in the generated
caption Predicted 𝑦𝑡−1 also At test time, we can only
fed into ℎ𝑡 feed the predicted 𝑦𝑡−1
Hidden states at each step
of the sequence
During training, if the true 𝑦𝑡−1 is fed,
An image we call it “teacher forcing”
This final hidden state is Isn’t this too much to
supposed to contain the expect?? ☺
▪ Predicting the sentiment of a movie review information about the
entire review
Indeed; this can be an
Hidden states issue with RNNs

Each node Predicted


denotes an sentiment
Words in
embedding of
the review
the word
Recurrent Neural Networks: Some Examples
▪ Parts of speech tagging (or “aligned” translation; input and output have same length)

Parts of speech
tag for each word
Each node
denotes an Words in a
embedding of
the word sentence
▪ “Unaligned” translation (input and output can have different lengths)
Such problems usually require a sequence
encoder- sequence decoder architecture

Encode the input sequence (embeddings of tokens)


into a single embedding vector 𝑐 and then decode
this embedding one output token at a time

▪ In the unaligned case, generation stops when an “end” token (e.g., <END>) is generated
on the output side
Recurrent Neural Networks: Some Examples
▪ Unconditional generation (no input, only an output sequence is generated given a
RNN that was trained using some training data containing several sequences)

𝒚1 𝒚2 𝒚3

𝒉1 𝒉2 𝒉3

“Seed” token, e.g, <START> 𝑠0

▪ Each generate word/token is fed to the next step’s hidden state

▪ Generation stops when an “end” token (e.g., <END>) is generated


For RNNs, Long Distant Past is Hard to Remember
▪ The hidden layer nodes ℎ𝑡 are supposed to summarize the past up to time 𝑡 − 1

𝒚1 𝒚2 𝒚3 𝒚𝑡 𝒚𝑇 𝒚𝑡
𝒗 𝒗 𝒗 𝒗 𝒗 Compactly 𝒗
𝑼 𝑼 𝑼 𝑼 𝑼 𝑼
𝒉1 𝒉2 𝒉3 𝒉𝑡 𝒉𝑇 𝒉𝑡 𝑼

𝑾 𝑾 𝑾 𝑾 𝑾 𝑾

𝒙1 𝒙2 𝒙3 𝒙𝑡 𝒙𝑇 𝒙𝑡

▪ In theory, they should. In practice, they can’t. Some reasons


▪ Vanishing gradients along the sequence too (due to repeated multiplications)
– past knowledge gets “diluted”
▪ Hidden nodes also have limited capacity because of their finite dimensionality

▪ Various extensions of RNNs have been proposed to address forgetting


▪ Gated Recurrent Units (GRU), Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, mid-90s)
GRU and LSTM
▪ Essentially an RNN, except that the hidden states are computed differently
▪ Recall that RNN computes the hidden states as
𝒉𝑡 = 𝑡𝑎𝑛ℎ(𝑾𝒙𝑡 + 𝑼𝒉𝑡−1 )
▪ For RNN: State update is multiplicative (weak memory and gradient issues)
▪ GRU and LSTM contain specialized units and “memory” which modulate what/how much information
from the past to retain/forget

Pic source: [Link]


Capturing Long-Range Dependencies
▪ Idea: Augment the hidden states with gates (with parameters to be learned)
▪ These gates can help us remember and forget information “selectively”

Pic source: [Link]

• The hidden states have 3 type of gates: Input (bottom), Forget (left), Output (top)
• Open gate denoted by 'o' closed gate denoted by '-'
LSTM
▪ In contrast, LSTM maintains a “context” 𝑪𝑡 and computes hidden states as

• Note: ⨀ represents elementwise vector product. Also, state updates now additive, not
multiplicative. Training using backpropagation through time.

• Many variants of LSTM exists, e.g., using 𝑪𝑡 in local computations, Gated Recurrent Units (GRU),
etc. Mostly minor variations of basic LSTM above.
Do LSTM really have long memory? (ICML’2020)

Figure: Autocorrelation plot of traffic and DJI datasets


(To visualize the long memory in the dataset)

Ref: [Link]
Memory RNN and Bidirectional RNN
▪ RNNs and GRU and LSTM only remember the information from the previous tokens
▪ Memory RNN and Bidirectional RNN can remember information from the past and
future tokens

Embeddings that take information


from both directions (depend on
forward and reverse direction
embeddings)
Reverse direction
embeddings of input tokens

Forward direction
embeddings of input tokens

Ref: [Link]
Ref: [Link]
Exercise: RNN

Pic source: [Link]


Exercise: LSTM Initialize: Randomly set the previous hidden state h0 to [1, 1] and memory cells C0 to [0.3, -0.5]

Pic source: [Link]


Any question?

Readings for you:


▪ Deep Learning book
▪ Forecasting (FPP) Book using Python
▪ AI by Hand by Tom Yeh
▪ Special thanks to Piyush Rai and Jay Alammar
– I adopted some of their slides available online.

You might also like