Today we continue our exploration of multilevel time series forecasting. torch
. This post is the third in the series.
-
We first covered the basics of recurrent neural networks (RNNs) and trained a model to predict the very next value in a sequence. We also found that we could predict quite a few steps ahead by feeding back individual predictions in a loop.
-
Next, we built a model “out of the box” for multilevel forecasting. A small multilayer perceptron (MLP) was used to project the RNN output to multiple points in the future.
Of the two approaches, the latter was more successful. However, there are some aspects that are conceptually unsatisfactory. That is, when an MLP makes estimates and produces output over, say, 10 consecutive time points, there is no causal relationship between them. (Imagine a 10-day weather forecast that has never been updated.)
Now we want to try something a little more intuitively appealing. The input is a sequence. The output is a sequence. This type of task is very common in natural language processing (NLP). This is exactly the situation you see in machine translations or summaries.
Quite appropriately, the type of model used for this purpose is named sequence-to-sequence model (often abbreviated). seq2seq). Simply put, they divide the task into two components: encoding and decoding parts. The former is performed only once per input-target pair. The latter is performed with a loop as in the first attempt. However, the decoder can process more information. In each iteration, processing is based on previous predictions and previous states. The previous state becomes the state of the encoder when the loop starts, and continues to be its own state thereafter.
Before discussing the model in detail, the data entry mechanism needs to be adjusted.
We continue to work together vic_elec
provided by tsibbledata
.
Again, the dataset definition in the current post looks a bit different than the previous approach. The target shape is different. this time, y
equivalence x
Moved one space to the left.
The reason we do this is because of the way we train the network. with seq2seqPeople often use a technique called “teacher forcing”, where instead of feeding back their predictions to the decoder module, they pass in their values. must I predicted it. To be clear, this is only done during training and at a configurable level.
library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)
n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps
vic_elec_get_year <- function(year, month = NULL) {
vic_elec %>%
filter(year(Date) == year, month(Date) == if (is.null(month)) month(Date) else month) %>%
as_tibble() %>%
select(Demand)
}
elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()
train_mean <- mean(elec_train)
train_sd <- sd(elec_train)
elec_dataset <- dataset(
name = "elec_dataset",
initialize = function(x, n_timesteps, sample_frac = 1) {
self$n_timesteps <- n_timesteps
self$x <- torch_tensor((x - train_mean) / train_sd)
n <- length(self$x) - self$n_timesteps - 1
self$starts <- sort(sample.int(
n = n,
size = n * sample_frac
))
},
.getitem = function(i) {
start <- self$starts[i]
end <- start + self$n_timesteps - 1
lag <- 1
list(
x = self$x[start:end],
y = self$x[(start+lag):(end+lag)]$squeeze(2)
)
},
.length = function() {
length(self$starts)
}
)
You can then proceed with instantiating the dataset and dataloader as before.
batch_size <- 32
train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)
test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)
Technically, the model consists of three things: module: The encoder and decoder mentioned above, and seq2seq This is the module that coordinates this.
encoder
The encoder takes the input and runs it through the RNN. Of the two things a recurrent neural network returns: output and state, so far we've only used the output. This time, do the opposite. Discards the output and returns only the status.
If the RNN in question is a GRU (assuming the output just takes the last time step we've been doing so far) then there really is no difference. The final state is the same as the final output. However, in LSTM, there is a second kind of state, ‘cell state’. In this case, returning a status instead of a final output conveys more information.
encoder_module <- nn_module(
initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
self$type <- type
self$rnn <- if (self$type == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
}
},
forward = function(x) {
x <- self$rnn(x)
# return last states for all layers
# per layer, a single tensor for GRU, a list of 2 tensors for LSTM
x <- x[[2]]
x
}
)
decoder
Like the encoder, the main component in the decoder is the RNN. However, unlike the architecture shown previously, it does not just return predictions. It also reports back the final state of the RNN.
decoder_module <- nn_module(
initialize = function(type, input_size, hidden_size, num_layers = 1) {
self$type <- type
self$rnn <- if (self$type == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
}
self$linear <- nn_linear(hidden_size, 1)
},
forward = function(x, state) {
# input to forward:
# x is (batch_size, 1, 1)
# state is (1, batch_size, hidden_size)
x <- self$rnn(x, state)
# break up RNN return values
# output is (batch_size, 1, hidden_size)
# next_hidden is
c(output, next_hidden) %<-% x
output <- output$squeeze(2)
output <- self$linear(output)
list(output, next_hidden)
}
)
seq2seq
module
seq2seq
This is where the action happens. The plan is to encode once and then call the decoder in a loop.
Looking at the decoder again forward()
You can see that it requires two arguments. x
and state
.
Depending on the situation, x
It's one of three things: final input, a priori prediction, or a priori ground truth.
-
When the decoder is first called in the input sequence,
x
Maps to the final input value. This is different from tasks like machine translation where you pass a start token. But with time series, we want to continue where the actual measurements leave off. -
On further calls, we want the decoder to continue with the most recent prediction. Therefore, it is logical to re-pass previous predictions.
-
That said, in NLP a technique called “teacher forcing” is commonly used to speed up training. Using teacher coercion, instead of predicting, the decoder delivers the actual ground truth of what it should have predicted. We only do that in some configurable cases, and of course only during training. The rationale for this technique is that without this form of recalibration, cascading prediction errors can quickly erase the remaining signal.
state
, it is also multivalent. But there are only two possibilities: encoder state and decoder state.
-
When the decoder is first called, it is “seeded” with the final state of the encoder. Please note what this looks like only time We use encoding.
-
From then on, the previous state of the decoder itself is passed on. Remember how we return two values: prediction and status?
seq2seq_module <- nn_module(
initialize = function(type, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
self$encoder <- encoder_module(type = type, input_size = input_size,
hidden_size = hidden_size, num_layers, encoder_dropout)
self$decoder <- decoder_module(type = type, input_size = input_size,
hidden_size = hidden_size, num_layers)
self$n_forecast <- n_forecast
},
forward = function(x, y, teacher_forcing_ratio) {
# prepare empty output
outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
# encode current input sequence
hidden <- self$encoder(x)
# prime decoder with final input value and hidden state from the encoder
out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
# decompose into predictions and decoder state
# pred is (batch_size, 1)
# state is (1, batch_size, hidden_size)
c(pred, state) %<-% out
# store first prediction
outputs[ , 1] <- pred$squeeze(2)
# iterate to generate remaining forecasts
for (t in 2:self$n_forecast) {
# call decoder on either ground truth or previous prediction, plus previous decoder state
teacher_forcing <- runif(1) < teacher_forcing_ratio
input <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
input <- input$unsqueeze(3)
out <- self$decoder(input, state)
# again, decompose decoder return values
c(pred, state) %<-% out
# and store current prediction
outputs[ , t] <- pred$squeeze(2)
}
outputs
}
)
net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)
# training RNNs on the GPU currently prints a warning that may clutter
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")
net <- net$to(device = device)
The training procedure is mostly unchanged. But we have to decide: teacher_forcing_ratio
, the proportion of the input sequence on which you want to perform recalibration. in valid_batch()
this is always 0
while inside train_batch()
, it's up to us (or rather, experiment). Here we set it like this: 0.3
.
optimizer <- optim_adam(net$parameters, lr = 0.001)
num_epochs <- 50
train_batch <- function(b, teacher_forcing_ratio) {
optimizer$zero_grad()
output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
target <- b$y$to(device = device)
loss <- nnf_mse_loss(output, target)
loss$backward()
optimizer$step()
loss$item()
}
valid_batch <- function(b, teacher_forcing_ratio = 0) {
output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
target <- b$y$to(device = device)
loss <- nnf_mse_loss(output, target)
loss$item()
}
for (epoch in 1:num_epochs) {
net$train()
train_loss <- c()
coro::loop(for (b in train_dl) {
loss <-train_batch(b, teacher_forcing_ratio = 0.3)
train_loss <- c(train_loss, loss)
})
cat(sprintf("\nEpoch %d, training: loss: %3.5f \n", epoch, mean(train_loss)))
net$eval()
valid_loss <- c()
coro::loop(for (b in valid_dl) {
loss <- valid_batch(b)
valid_loss <- c(valid_loss, loss)
})
cat(sprintf("\nEpoch %d, validation: loss: %3.5f \n", epoch, mean(valid_loss)))
}
Epoch 1, training: loss: 0.37961
Epoch 1, validation: loss: 1.10699
Epoch 2, training: loss: 0.19355
Epoch 2, validation: loss: 1.26462
# ...
# ...
Epoch 49, training: loss: 0.03233
Epoch 49, validation: loss: 0.62286
Epoch 50, training: loss: 0.03091
Epoch 50, validation: loss: 0.54457
It is interesting to compare performance across different settings. teacher_forcing_ratio
. to settings 0.5
, training loss decreases much more slowly. The opposite appears to be the following setup: 0
. However, the verification loss is not significantly affected.
The code that checks test set predictions has not changed.
net$eval()
test_preds <- vector(mode = "list", length = length(test_dl))
i <- 1
coro::loop(for (b in test_dl) {
output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio = 0)
preds <- as.numeric(output)
test_preds[[i]] <- preds
i <<- i + 1
})
vic_elec_jan_2014 <- vic_elec %>%
filter(year(Date) == 2014, month(Date) == 1)
test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))
test_pred2 <- test_preds[[408]]
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))
test_pred3 <- test_preds[[817]]
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)
preds_ts <- vic_elec_jan_2014 %>%
select(Demand) %>%
add_column(
mlp_ex_1 = test_pred1 * train_sd + train_mean,
mlp_ex_2 = test_pred2 * train_sd + train_mean,
mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
pivot_longer(-Time) %>%
update_tsibble(key = name)
preds_ts %>%
autoplot() +
scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
theme_minimal()
![This is the forecast for one week ahead in January 2014.](https://blogs.rstudio.com/tensorflow/posts/images/seq2seq_preds.png)
Figure 1: One week ahead forecast for January 2014.
If you compare this to the predictions obtained from the last RNN-MLP combo, there is not much difference. Is this surprising? For me, yes. If you ask me to guess why, you'll probably say something like this: In all architectures we have used so far, the main carrier of information has been the final hidden state of the RNN (the one and only RNN in the previous two settings, encoder RNN). In the final part of this series, it will be interesting to see what happens if we extend the encoder-decoder architecture to: fist.
Thanks for reading!
Photo: Suzuha Kozuki, Unsplash