Posit AI Blog: torch time series, take three: Sequence-to-sequence prediction

Today we continue our exploration of multilevel time series forecasting. torch. This post is the third in the series.

We first covered the basics of recurrent neural networks (RNNs) and trained a model to predict the very next value in a sequence. We also found that we could predict quite a few steps ahead by feeding back individual predictions in a loop.
Next, we built a model “out of the box” for multilevel forecasting. A small multilayer perceptron (MLP) was used to project the RNN output to multiple points in the future.

Of the two approaches, the latter was more successful. However, there are some aspects that are conceptually unsatisfactory. That is, when an MLP makes estimates and produces output over, say, 10 consecutive time points, there is no causal relationship between them. (Imagine a 10-day weather forecast that has never been updated.)

Now we want to try something a little more intuitively appealing. The input is a sequence. The output is a sequence. This type of task is very common in natural language processing (NLP). This is exactly the situation you see in machine translations or summaries.

Quite appropriately, the type of model used for this purpose is named sequence-to-sequence model (often abbreviated). seq2seq). Simply put, they divide the task into two components: encoding and decoding parts. The former is performed only once per input-target pair. The latter is performed with a loop as in the first attempt. However, the decoder can process more information. In each iteration, processing is based on previous predictions and previous states. The previous state becomes the state of the encoder when the loop starts, and continues to be its own state thereafter.

Before discussing the model in detail, the data entry mechanism needs to be adjusted.

We continue to work together vic_elec provided by tsibbledata.

Again, the dataset definition in the current post looks a bit different than the previous approach. The target shape is different. this time, y equivalence xMoved one space to the left.

The reason we do this is because of the way we train the network. with seq2seqPeople often use a technique called “teacher forcing”, where instead of feeding back their predictions to the decoder module, they pass in their values. must I predicted it. To be clear, this is only done during training and at a configurable level.

library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)

n_timesteps <- 7 * 24 * 2
n_forecast <- n_timesteps

vic_elec_get_year <- function(year, month = NULL) {
  vic_elec %>%
    filter(year(Date) == year, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    select(Demand)
}

elec_train <- vic_elec_get_year(2012) %>% as.matrix()
elec_valid <- vic_elec_get_year(2013) %>% as.matrix()
elec_test <- vic_elec_get_year(2014, 1) %>% as.matrix()

train_mean <- mean(elec_train)
train_sd <- sd(elec_train)

elec_dataset <- dataset(
  name = "elec_dataset",
  
  initialize = function(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- length(self$x) - self$n_timesteps - 1
    
    self$starts <- sort(sample.int(
      n = n,
      size = n * sample_frac
    ))
    
  },
  
  .getitem = function(i) {
    
    start <- self$starts[i]
    end <- start + self$n_timesteps - 1
    lag <- 1
    
    list(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .length = function() {
    length(self$starts) 
  }
)

You can then proceed with instantiating the dataset and dataloader as before.

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Technically, the model consists of three things: module: The encoder and decoder mentioned above, and seq2seq This is the module that coordinates this.

encoder

The encoder takes the input and runs it through the RNN. Of the two things a recurrent neural network returns: output and state, so far we've only used the output. This time, do the opposite. Discards the output and returns only the status.

If the RNN in question is a GRU (assuming the output just takes the last time step we've been doing so far) then there really is no difference. The final state is the same as the final output. However, in LSTM, there is a second kind of state, ‘cell state’. In this case, returning a status instead of a final output conveys more information.

encoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  forward = function(x) {
    
    x <- self$rnn(x)
    
    # return last states for all layers
    # per layer, a single tensor for GRU, a list of 2 tensors for LSTM
    x <- x[[2]]
    x
    
  }
  
)

decoder

Like the encoder, the main component in the decoder is the RNN. However, unlike the architecture shown previously, it does not just return predictions. It also reports back the final state of the RNN.

decoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(hidden_size, 1)
    
  },
  
  forward = function(x, state) {
    
    # input to forward:
    # x is (batch_size, 1, 1)
    # state is (1, batch_size, hidden_size)
    x <- self$rnn(x, state)
    
    # break up RNN return values
    # output is (batch_size, 1, hidden_size)
    # next_hidden is
    c(output, next_hidden) %<-% x
    
    output <- output$squeeze(2)
    output <- self$linear(output)
    
    list(output, next_hidden)
    
  }
  
)

`seq2seq` module

seq2seq This is where the action happens. The plan is to encode once and then call the decoder in a loop.

Looking at the decoder again forward()You can see that it requires two arguments. x and state.

Depending on the situation, x It's one of three things: final input, a priori prediction, or a priori ground truth.

When the decoder is first called in the input sequence, x Maps to the final input value. This is different from tasks like machine translation where you pass a start token. But with time series, we want to continue where the actual measurements leave off.
On further calls, we want the decoder to continue with the most recent prediction. Therefore, it is logical to re-pass previous predictions.
That said, in NLP a technique called “teacher forcing” is commonly used to speed up training. Using teacher coercion, instead of predicting, the decoder delivers the actual ground truth of what it should have predicted. We only do that in some configurable cases, and of course only during training. The rationale for this technique is that without this form of recalibration, cascading prediction errors can quickly erase the remaining signal.

state, it is also multivalent. But there are only two possibilities: encoder state and decoder state.

When the decoder is first called, it is “seeded” with the final state of the encoder. Please note what this looks like only time We use encoding.
From then on, the previous state of the decoder itself is passed on. Remember how we return two values: prediction and status?

seq2seq_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(type = type, input_size = input_size,
                                   hidden_size = hidden_size, num_layers, encoder_dropout)
    self$decoder <- decoder_module(type = type, input_size = input_size,
                                   hidden_size = hidden_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  forward = function(x, y, teacher_forcing_ratio) {
    
    # prepare empty output
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
    
    # encode current input sequence
    hidden <- self$encoder(x)
    
    # prime decoder with final input value and hidden state from the encoder
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
    
    # decompose into predictions and decoder state
    # pred is (batch_size, 1)
    # state is (1, batch_size, hidden_size)
    c(pred, state) %<-% out
    
    # store first prediction
    outputs[ , 1] <- pred$squeeze(2)
    
    # iterate to generate remaining forecasts
    for (t in 2:self$n_forecast) {
      
      # call decoder on either ground truth or previous prediction, plus previous decoder state
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      input <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
      input <- input$unsqueeze(3)
      out <- self$decoder(input, state)
      
      # again, decompose decoder return values
      c(pred, state) %<-% out
      # and store current prediction
      outputs[ , t] <- pred$squeeze(2)
    }
    outputs
  }
  
)

net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)

# training RNNs on the GPU currently prints a warning that may clutter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

net <- net$to(device = device)

The training procedure is mostly unchanged. But we have to decide: teacher_forcing_ratio, the proportion of the input sequence on which you want to perform recalibration. in valid_batch()this is always 0while inside train_batch(), it's up to us (or rather, experiment). Here we set it like this: 0.3.

optimizer <- optim_adam(net$parameters, lr = 0.001)

num_epochs <- 50

train_batch <- function(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  loss$backward()
  optimizer$step()
  
  loss$item()
  
}

valid_batch <- function(b, teacher_forcing_ratio = 0) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  
  loss$item()
  
}

for (epoch in 1:num_epochs) {
  
  net$train()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("\nEpoch %d, training: loss: %3.5f \n", epoch, mean(train_loss)))
  
  net$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("\nEpoch %d, validation: loss: %3.5f \n", epoch, mean(valid_loss)))
}

Epoch 1, training: loss: 0.37961 

Epoch 1, validation: loss: 1.10699 

Epoch 2, training: loss: 0.19355 

Epoch 2, validation: loss: 1.26462 

# ...
# ...

Epoch 49, training: loss: 0.03233 

Epoch 49, validation: loss: 0.62286 

Epoch 50, training: loss: 0.03091 

Epoch 50, validation: loss: 0.54457

It is interesting to compare performance across different settings. teacher_forcing_ratio. to settings 0.5, training loss decreases much more slowly. The opposite appears to be the following setup: 0. However, the verification loss is not significantly affected.

The code that checks test set predictions has not changed.

net$eval()

test_preds <- vector(mode = "list", length = length(test_dl))

i <- 1

coro::loop(for (b in test_dl) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

vic_elec_jan_2014 <- vic_elec %>%
  filter(year(Date) == 2014, month(Date) == 1)

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[408]]
test_pred2 <- c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[817]]
test_pred3 <- c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts <- vic_elec_jan_2014 %>%
  select(Demand) %>%
  add_column(
    mlp_ex_1 = test_pred1 * train_sd + train_mean,
    mlp_ex_2 = test_pred2 * train_sd + train_mean,
    mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = name)


preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()

This is the forecast for one week ahead in January 2014.

Figure 1: One week ahead forecast for January 2014.

If you compare this to the predictions obtained from the last RNN-MLP combo, there is not much difference. Is this surprising? For me, yes. If you ask me to guess why, you'll probably say something like this: In all architectures we have used so far, the main carrier of information has been the final hidden state of the RNN (the one and only RNN in the previous two settings, encoder RNN). In the final part of this series, it will be interesting to see what happens if we extend the encoder-decoder architecture to: fist.

Thanks for reading!

Photo: Suzuha Kozuki, Unsplash

Posit AI Blog: torch time series, take three: Sequence-to-sequence prediction

Injective hydrogel loaded with liposomes-encapsulated MY-1 promotes wound healing and increases tensile strength by accelerating fibroblast migration via the PI3K/AKT-Rac1 signaling pathway | Journal of Nanobiotechnology

Photovoltaic nanocells for high-performance large-scale-integrated organic phototransistors

Graphene Composites USA (GC USA) Selected to Develop Next Gen US Military Footwear

Leave A Reply Cancel Reply

30 Grams of Fiber Cheat Sheet

Emma Roberts Reacts to ‘Quiet on Set’ Doc, Recalls Her Nickelodeon Experience

Parent engagement can make all the difference

Scarleteen Confidential: Helping Youth Handle Rejection

U.S. Job Growth Extends Streak, but Signs of Concern Emerge

Quinoa vs Dalia: Which has more protein

Two Conversations

Injective hydrogel loaded with liposomes-encapsulated MY-1 promotes wound healing and increases tensile strength by accelerating fibroblast migration via the PI3K/AKT-Rac1 signaling pathway | Journal of Nanobiotechnology

Nature-based business networks take off across Rewilding Europe landscapes

Causes and Risk Factors of Hidradenitis Suppurativa (HS)

Taylor Swift’s Eras Tour Dominates Summer — with Travis Kelce’s Support!

Chaos and Confusion: The State of Student Loans

Popular Posts

U.S. Job Growth Extends Streak, but Signs of Concern Emerge

Android banking Trojan evolves to evade detection and strike globally

Facing New ‘Greenwashing’ Law, an Oil Industry Website Goes Dark

Most Read

Gordon Bernell Charter School Unveils New Career Development Center

Ultimate Guide to Norwegian Cruise Line’s Teacher Discount

Impossible’ and ‘Nyad’ pass Climate Reality Check : NPR

Posit AI Blog: torch time series, take three: Sequence-to-sequence prediction

encoder

decoder

seq2seq module

Related Posts

Leave A Reply Cancel Reply

`seq2seq` module