This is the final post in a four-part introduction to time series forecasting. torch
. This post has been about exploring multilevel prediction, and so far we have looked at three approaches: loop prediction, multilayer perceptron (MLP) integration, and sequence-to-sequence models. Here is a brief summary:
-
As you should do when embarking on an adventurous journey, we started with an in-depth study of the tools at our disposal: recurrent neural networks (RNNs). We trained the model to predict the very next observation and then thought of a clever hack. Why not use this for multi-step prediction, feeding back individual predictions in a loop? The results turned out to be quite acceptable.
-
And then the adventure truly began. We built the first model “out of the box” for multilevel prediction, slightly reducing the workload of the RNN and including a second player, a small MLP. Now it was the MLP's job to project the RNN output to multiple points in the future. The results were quite satisfactory, but we didn't stop there.
-
Instead, we use sequence-to-sequence (Sequence-to-Sequence), a technique commonly used in natural language processing (NLP).seq2seq) prediction. Although the prediction performance was not significantly different from the previous example, we found the technique to reflect the prediction results to be intuitively more appealing. causal Relationships between successive predictions.
Today we will enhance the seq2seq approach by adding a new component. fist module. Originally introduced in 2014, the attention mechanism has generated so much interest that the title of a recent paper begins with “Attention is Not All You Need.”
Here's the idea:
In a traditional encoder-decoder setup, the decoder is “primed” with the encoder summary only once. That is, the time when the prediction loop starts. From then on, you are alone. However, if you are careful, you can re-view the entire sequence of encoder output each time you predict a new value. Plus, it gets magnified every time. Those visible output Relevant to the current prediction phase.
This is a particularly useful strategy in translation. When generating the next word, the model needs to know which part of the original sentence to focus on. In contrast, how helpful this technique is for number sequences may depend on the features of that series.
As before we vic_elec
, but this time it is a partial departure from the approach used previously. Using the original bi-hourly dataset, it currently takes a long time to train the model, longer than the reader might want to wait when experimenting. Instead, we aggregate observations by day. To ensure sufficient data, we will train in 2012 and 2013, and reserve 2014 for verification and post-training inspection.
We will attempt to forecast demand up to 14 days into the future. So how long should the input sequence be? This is a matter of experimentation. Moreover, now we are adding an attention mechanism. (I think it may not handle very long sequences so well).
Below we also set the input length to 14 days, but this may not necessarily be the best possible choice for this series.
n_timesteps <- 7 * 2
n_forecast <- 7 * 2
elec_dataset <- dataset(
name = "elec_dataset",
initialize = function(x, n_timesteps, sample_frac = 1) {
self$n_timesteps <- n_timesteps
self$x <- torch_tensor((x - train_mean) / train_sd)
n <- length(self$x) - self$n_timesteps - 1
self$starts <- sort(sample.int(
n = n,
size = n * sample_frac
))
},
.getitem = function(i) {
start <- self$starts[i]
end <- start + self$n_timesteps - 1
lag <- 1
list(
x = self$x[start:end],
y = self$x[(start+lag):(end+lag)]$squeeze(2)
)
},
.length = function() {
length(self$starts)
}
)
batch_size <- 32
train_ds <- elec_dataset(elec_train, n_timesteps)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds <- elec_dataset(elec_valid, n_timesteps)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)
test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)
On the model side we come across three again. module Familiar from previous posts: encoder, decoder and top-level seq2seq modules. However, there are additional components. fist The module the decoder uses to acquire attention weight.
encoder
The encoder still works the same way. Wraps the RNN and returns the final state.
encoder_module <- nn_module(
initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
self$type <- type
self$rnn <- if (self$type == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
}
},
forward = function(x) {
# return outputs for all timesteps, as well as last-timestep states for all layers
x %>% self$rnn()
}
)
attention module
In basic seq2seq, whenever a new value needs to be generated, the decoder considers two things: the previous state and the previous output generated. In the attention setting, the decoder additionally receives the entire output from the encoder. It is assisted by a new agent, the module of interest, in deciding which subset of its outputs are important.
This is why the Attention module exists. Given the current decoder state and the complete encoder output, we obtain the weight of the output, which indicates how relevant that output is to the decoder's current task. The result of this procedure is the so-called attention weight: A normalized score that quantifies the importance of each for each time step of encoding.
Attention can be implemented in a variety of ways. Two implementation options are shown here: addition and multiplication.
additional interest
With additional care, the encoder output and decoder states are usually appended or concatenated (we choose the latter below). The resulting tensor is run through a linear layer and a softmax is applied for normalization.
attention_module_additive <- nn_module(
initialize = function(hidden_dim, attention_size) {
self$attention <- nn_linear(2 * hidden_dim, attention_size)
},
forward = function(state, encoder_outputs) {
# function argument shapes
# encoder_outputs: (bs, timesteps, hidden_dim)
# state: (1, bs, hidden_dim)
# multiplex state to allow for concatenation (dimensions 1 and 2 must agree)
seq_len <- dim(encoder_outputs)[2]
# resulting shape: (bs, timesteps, hidden_dim)
state_rep <- state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
# concatenate along feature dimension
concat <- torch_cat(list(state_rep, encoder_outputs), dim = 3)
# run through linear layer with tanh
# resulting shape: (bs, timesteps, attention_size)
scores <- self$attention(concat) %>%
torch_tanh()
# sum over attention dimension and normalize
# resulting shape: (bs, timesteps)
attention_weights <- scores %>%
torch_sum(dim = 3) %>%
nnf_softmax(dim = 2)
# a normalized score for every source token
attention_weights
}
)
multiplicationism
In multiplicative attention, a score is obtained by calculating the dot product between the decoder state and all encoder outputs. Here again softmax is used for regularization.
attention_module_multiplicative <- nn_module(
initialize = function() {
NULL
},
forward = function(state, encoder_outputs) {
# function argument shapes
# encoder_outputs: (bs, timesteps, hidden_dim)
# state: (1, bs, hidden_dim)
# allow for matrix multiplication with encoder_outputs
state <- state$permute(c(2, 3, 1))
# prepare for scaling by number of features
d <- torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
# scaled dot products between state and outputs
# resulting shape: (bs, timesteps, 1)
scores <- torch_bmm(encoder_outputs, state) %>%
torch_div(torch_sqrt(d))
# normalize
# resulting shape: (bs, timesteps)
attention_weights <- scores$squeeze(3) %>%
nnf_softmax(dim = 2)
# a normalized score for every source token
attention_weights
}
)
decoder
Once the attention weights are calculated, the actual application is handled by the decoder. Specifically, the method in question is, weighted_encoder_outputs()
Calculate the product of the weights and the encoder output to ensure that each output has the appropriate impact.
The rest of the work is done by: forward()
. The concatenation of the weighted encoder output (often called “context”) and the current input is run through a RNN. The ensemble of RNN output, context, and input is then passed to the MLP. Finally, both the RNN state and current prediction are returned.
decoder_module <- nn_module(
initialize = function(type, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
self$type <- type
self$rnn <- if (self$type == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
}
self$linear <- nn_linear(2 * hidden_size + 1, 1)
self$attention <- if (attention_type == "multiplicative") attention_module_multiplicative()
else attention_module_additive(hidden_size, attention_size)
},
weighted_encoder_outputs = function(state, encoder_outputs) {
# encoder_outputs is (bs, timesteps, hidden_dim)
# state is (1, bs, hidden_dim)
# resulting shape: (bs * timesteps)
attention_weights <- self$attention(state, encoder_outputs)
# resulting shape: (bs, 1, seq_len)
attention_weights <- attention_weights$unsqueeze(2)
# resulting shape: (bs, 1, hidden_size)
weighted_encoder_outputs <- torch_bmm(attention_weights, encoder_outputs)
weighted_encoder_outputs
},
forward = function(x, state, encoder_outputs) {
# encoder_outputs is (bs, timesteps, hidden_dim)
# state is (1, bs, hidden_dim)
# resulting shape: (bs, 1, hidden_size)
context <- self$weighted_encoder_outputs(state, encoder_outputs)
# concatenate input and context
# NOTE: this repeating is done to compensate for the absence of an embedding module
# that, in NLP, would give x a higher proportion in the concatenation
x_rep <- x$repeat_interleave(dim(context)[3], 3)
rnn_input <- torch_cat(list(x_rep, context), dim = 3)
# resulting shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
rnn_out <- self$rnn(rnn_input, state)
rnn_output <- rnn_out[[1]]
next_hidden <- rnn_out[[2]]
mlp_input <- torch_cat(list(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
output <- self$linear(mlp_input)
# shapes: (bs, 1) and (1, bs, hidden_size)
list(output, next_hidden)
}
)
seq2seq
module
that much seq2seq
Modules are essentially unchanged (except for the fact that they now allow configuration of modules with caution). See my previous post for a more detailed explanation of what's happening here.
seq2seq_module <- nn_module(
initialize = function(type, input_size, hidden_size, attention_type, attention_size, n_forecast,
num_layers = 1, encoder_dropout = 0) {
self$encoder <- encoder_module(type = type, input_size = input_size, hidden_size = hidden_size,
num_layers, encoder_dropout)
self$decoder <- decoder_module(type = type, input_size = 2 * hidden_size, hidden_size = hidden_size,
attention_type = attention_type, attention_size = attention_size, num_layers)
self$n_forecast <- n_forecast
},
forward = function(x, y, teacher_forcing_ratio) {
outputs <- torch_zeros(dim(x)[1], self$n_forecast)
encoded <- self$encoder(x)
encoder_outputs <- encoded[[1]]
hidden <- encoded[[2]]
# list of (batch_size, 1), (1, batch_size, hidden_size)
out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
# (batch_size, 1)
pred <- out[[1]]
# (1, batch_size, hidden_size)
state <- out[[2]]
outputs[ , 1] <- pred$squeeze(2)
for (t in 2:self$n_forecast) {
teacher_forcing <- runif(1) < teacher_forcing_ratio
input <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
input <- input$unsqueeze(3)
out <- self$decoder(input, state, encoder_outputs)
pred <- out[[1]]
state <- out[[2]]
outputs[ , t] <- pred$squeeze(2)
}
outputs
}
)
When instantiating a top-level model, there is now an additional choice between additive and multiplicative attention. In terms of “accuracy” of performance, my tests didn't make any difference. However, the multiplicative variant is much faster.
net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
attention_size = 8, n_forecast = n_forecast)
As last time, model training involves choosing the degree of coercion by the teacher. Below we set the fraction to 0.0. That is, no enforcement is applied at all.
optimizer <- optim_adam(net$parameters, lr = 0.001)
num_epochs <- 1000
train_batch <- function(b, teacher_forcing_ratio) {
optimizer$zero_grad()
output <- net(b$x, b$y, teacher_forcing_ratio)
target <- b$y
loss <- nnf_mse_loss(output, target[ , 1:(dim(output)[2])])
loss$backward()
optimizer$step()
loss$item()
}
valid_batch <- function(b, teacher_forcing_ratio = 0) {
output <- net(b$x, b$y, teacher_forcing_ratio)
target <- b$y
loss <- nnf_mse_loss(output, target[ , 1:(dim(output)[2])])
loss$item()
}
for (epoch in 1:num_epochs) {
net$train()
train_loss <- c()
coro::loop(for (b in train_dl) {
loss <-train_batch(b, teacher_forcing_ratio = 0.0)
train_loss <- c(train_loss, loss)
})
cat(sprintf("\nEpoch %d, training: loss: %3.5f \n", epoch, mean(train_loss)))
net$eval()
valid_loss <- c()
coro::loop(for (b in valid_dl) {
loss <- valid_batch(b)
valid_loss <- c(valid_loss, loss)
})
cat(sprintf("\nEpoch %d, validation: loss: %3.5f \n", epoch, mean(valid_loss)))
}
# Epoch 1, training: loss: 0.83752
# Epoch 1, validation: loss: 0.83167
# Epoch 2, training: loss: 0.72803
# Epoch 2, validation: loss: 0.80804
# ...
# ...
# Epoch 99, training: loss: 0.10385
# Epoch 99, validation: loss: 0.21259
# Epoch 100, training: loss: 0.10396
# Epoch 100, validation: loss: 0.20975
Select a few predictions from the test set for visual inspection.
net$eval()
test_preds <- vector(mode = "list", length = length(test_dl))
i <- 1
vic_elec_test <- vic_elec_daily %>%
filter(year(Date) == 2014, month(Date) %in% 1:4)
coro::loop(for (b in test_dl) {
output <- net(b$x, b$y, teacher_forcing_ratio = 0)
preds <- as.numeric(output)
test_preds[[i]] <- preds
i <<- i + 1
})
test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))
test_pred2 <- test_preds[[21]]
test_pred2 <- c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))
test_pred3 <- test_preds[[41]]
test_pred3 <- c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))
test_pred4 <- test_preds[[61]]
test_pred4 <- c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))
test_pred5 <- test_preds[[81]]
test_pred5 <- c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))
preds_ts <- vic_elec_test %>%
select(Demand, Date) %>%
add_column(
ex_1 = test_pred1 * train_sd + train_mean,
ex_2 = test_pred2 * train_sd + train_mean,
ex_3 = test_pred3 * train_sd + train_mean,
ex_4 = test_pred4 * train_sd + train_mean,
ex_5 = test_pred5 * train_sd + train_mean) %>%
pivot_longer(-Date) %>%
update_tsibble(key = name)
preds_ts %>%
autoplot() +
scale_color_hue(h = c(80, 300), l = 70) +
theme_minimal()
Figure 1: A sample of two-week-ahead predictions for the 2014 test set.
Due to the practical redefinition of the task, performance here cannot be directly compared to previous models in the series. However, the main goal was to introduce the concept of attention. How to do it specifically manually Implement technology. Once you understand the concept, it's something you really don't need to do at all. Instead, you'll likely use the existing tools that come with it. torch
(Multihead Attention and Converter Module), a tool that may be introduced in future “seasons” of this series.
Thanks for reading!
Photo: David Clode, Unsplash
Bahdanau, Dzmitry, Cho Kyung-hyeon, and Yoshua Bengio. 2014. “Neural machine translation by jointly learning alignment and translation.” CoRR ABS/1409.0473. http://arxiv.org/abs/1409.0473.
Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Attention is not all that is needed: pure attention loses rank twice with depth..” arXiv e-Print, March, arXiv:2103.03404. https://arxiv.org/abs/2103.03404.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jacob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “All you need is attention.” arXiv e-Print, June, arXiv:1706.03762. https://arxiv.org/abs/1706.03762.
Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. “Grammar as a Foreign Language.” CoRR ABS/1412.7449. http://arxiv.org/abs/1412.7449.
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Cho Kyung-hyun, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Show, Be Present, Tell: Generating Neural Image Captions with Visual Attention.” CoRR ABS/1502.03044. http://arxiv.org/abs/1502.03044.