How to make a PyTorch Transformer for time series forecasting

This post will show you how to transform a time series Transformer architecture diagram into PyTorch code step by step.

May 12, 2022

12 min read

👁 A transformer station. Image by WikimediaImages.

A transformer station. Image by WikimediaImages.

Transformer models have shown state of the art performance in a number of time series forecasting problems [1][2][3].

In this post, you will learn how to code a transformer architecture for time series forecasting in PyTorch. Specifically, we’ll code the architecture used in the paper "Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case" [2] and we will use their architecture diagram as the point of departure.

So step by step, I will show how to code each of the components of the diagram. This way, you will learn the generalizable skill of interpreting a transformer architecture diagram and converting it to code.

I will explain the process as if you have never before implemented a transformer model. I do, however, assume that you have basic knowledge of PyTorch and machine learning in general. The final result will be a class that we will callTimeSeriesTransformer where everything comes together.

I will also explain what the inputs to the model’s forward() method must be and how to create them.

It is important to note that there is not one transformer model architecture. Several different transformer architectures exist. It naturally follows from this that when I say, for instance, that the encoder consists of x,y,z, I am referring specifically to the encoder of the transformer architecture we are implementing in this post – not to some universal transformer encoder.

Here is the architecture diagram we will implement in this post:

👁 Figure 1. Image by Wu, Green, Ben & O'Banion, 2020 [2]

Figure 1. Image by Wu, Green, Ben & O’Banion, 2020 [2]

Note that although the diagram depicts only two encoder layers and two decoder layers, the authors actually use four layers in each [2].

Table 1 below gives an overview of all the components needed to build the time series transformer architecture from Figure 1 as well as what class to use to make each component. As you can see, we will only need to implement one custom class. Everything else is available in PyTorch. Yay!

👁 Table 1. Overview of time series transformer components. Image by Kasper Groes Albin Ludvigsen.

Table 1. Overview of time series transformer components. Image by Kasper Groes Albin Ludvigsen.

Something that confused me at first was that in Figure 1, the input layer and positional encoding layer are depicted as being part of the encoder, and on the decoder side the input and linear mapping layers are depicted as being part of the decoder.

This is not the case in the original transformer paper [4], where the input layers, positional encoding layers and linear layer are depicted as being separate from the encoder and decoder (see Figure 2 below).

In keeping with the original transformer paper, I will say in this post that the encoder and decoder consist merely of n stacked encoder or decoder layers and I will consider the other layers as being separate layers outside of the encoder and decoder. In addition, as per the original transformer paper [4], I will not refer to the "Add and normalize" operation as a layer.

👁 Figure 2. In the original transformer paper, the input layers and the positional encoding layers are depicted as being separate from the encoder and decoder which is contrary to [2]. Image by Vaswani et al 2017 [4]

Figure 2. In the original transformer paper, the input layers and the positional encoding layers are depicted as being separate from the encoder and decoder which is contrary to [2]. Image by Vaswani et al 2017 [4]

The remainder of this post is structured as follows:

First, we will see how to make each of the components of the transformer and how to put it all together in class called TimeSeriesTransformer
Then, I will show how to create the inputs provided to the model

I will not provide a detailed description of the inner workings of the components as these are explained well elsewhere (e.g. [5][6]).

When you have read this post, you may want to learn how to use the time series Transformer during inference:

How to run inference with a PyTorch time series Transformer

1. Decomposing the transformer architecture

Let’s decompose the transformer architecture showed in the diagram into its component parts.

1.1. The encoder input layer

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

The encoder input layer is simply implemented as an nn.Linear() layer. The in_features argument must be equal to the number of variables you’re using as input to the model. In a univariate time series forecasting problem, in_features = 1. The out_features argument must be d_model which is a hyperparameter that has the value 512 in [4].

Here’s what the code will look like inside the TimeSeriesTransformer class:

1.2. The positional encoding layer

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

The authors of the original transformer paper describe very succinctly what the positional encoding layer does and why it is needed:

"Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence."

Here is one way of implementing the positional encoder as a class.

And here is how to use the PositionalEncoderclass inside the TimeSeriesTransformer class:

As you can see, dim_val is provided as the d_model argument. This is important, because the encoder input layer produces an output of size dim_val .

1.3. Encoder layers

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

Note that although the diagram depicts only two encoder layers, the authors actually use four encoder layers [2].

The encoder layers used by [2] are identical to those used by [4] on which the PyTorch Transformer library is based, so we can simply use PyTorch to create the encoder layers.

The way to do this is by first making an object, we can call itencoder_layer, with torch.nn.TransformerEncoderLayer like this:

encoder_layer = torch.nn.TransformerEncoderLayer(d_model, nhead, batch_first=True)

By using torch.nn.TransformerEncoderLayerthe layer will automatically have the self-attention layer and the feed forward layer depicted above as well as the "Add & Normalize" in-between. Note that it is not necessary to make encoder_layer an instance attribute of the TimeSeriesTransformerclass because it is simply passed as an argument to nn.TransformerEncoder.

The encoder_layer object is then passed as an argument to torch.nn.TransformerEncoder like this in order to stack 4 identical encoder layers:

self.encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers=4, norm=None)

Note that normis an optional parameter innn.TransformerEncoder and that it is redundant to pass a normalization object when using the standard nn.TransformerEncoderLayer class because nn.TransformerEncoderLayeralready normalizes after each layer. The optional parameter is intended for custom encoder layers which do not include normalization [7].

Here’s a snippet of what the encoder code will look like in the TimeSeriesTransformer class.

1.4. The decoder input layer

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

The decoder input layer is simply a linear layer, just like the encoder input layer. The in_features argument must be equal to the number of variables you’re using as input to the model. In a univariate time series forecasting problem, in_features = 1. The out_features argument must be d_model which is a hyperparameter that has the value 512 in [4]. We will use this value as [2] does not specify it.

Here’s what the code will look like inside the TimeSeriesTransformer class:

1.5. Decoder layers

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

Note that although the diagram depicts only two decoder layers. the authors actually use four decoder layers [2].

The decoder is made the exact same way as the encoder where num_layersis four as per [2].

[2] does not specify how many heads they use, so we will use nheads=8as per [4].

Here’s a snippet showing what the decoder code will look like when put together in the TimeSeriesTransformer class.

1.6. Linear mapping layer

👁 Image by Wu, Green, Ben & O'Banion, 2020 [2] (my emphasis)

Image by Wu, Green, Ben & O’Banion, 2020 [2] (my emphasis)

Although this layer is called a "linear mapping layer", it is in fact identical to the encoder and decoder input layers except for the values of the arguments:

in_features must be equal to the output sequence length multiplied by d_model to accommodate the output from the decoder.

out_features must be equal to the target sequence length because the linear mapping layer is the final layer of the transformer model. So if your time series dataset consists of hourly datapoints, and you want to predict 24 hours ahead, out_featuresmust be 24.

Here’s what the code will look like inside the TimeSeriesTransformer class:

Putting together the transformer model

Now that we have seen how to code each of the components that make up the transformer model illustrated in the diagram, we will put it all together in a class. The default values of parameters are those used in [2].

Initializing the transformer model

Now that we’ve seen how to code the TimeSeriesTransformer class, I wanted to also quickly show how to initialize the model and what values to pass as arguments. The way the model is implemented, only input_size , dec_seq_len , and max_seq_len are required as the remainder have default values.

2. How to create the inputs for a transformer model

As seen in the TimeSeriesTransformerclass, our model’s forward()method takes 4 arguments as input. In this section, I will explain how to create these four objects.

The inputs are:

src
trg
src_mask
trg_mask

2.1. How to create src and trg for a time series transformer model

Let’s first take a closer look at howsrc and trg are made for a time series transformer model.

src is the encoder input and is short for "source". src is simply a subset of consecutive data points from your entire sequence. The length of src determines how many past data points your model is considering when making its forecasts. If your dataset has an hourly resolution, there are 24 data points in a day, and if you want your model to base its forecasts on the past two days of data, the length of src should be 48.

trg is the decoder input. Trg is short for "target", but this is a little misleading as it is not the actual target sequence, but a sequence that consists of the last data point of srcand all the data points of the actual target sequence except that last one. This is why people sometimes refer to the trg sequence as being "shifted right". The length of trg must be equal to the length of the actual target sequence [2]. You’ll sometimes see the termtgt used synonymously.

Here’s a neat explanation of what data pointssrc and trg must contain:

In a typical training setup, we train the model to predict 4 future weekly ILI ratios from 10 trailing weekly datapoints. That is, given the encoder input (x1, x2, …, x10) and the decoder input (x10, …, x13), the decoder aims to output (x11, …, x14). ([2] page 5)

Here’s a function to produce src and trg as well as the actual target sequence, trg_y , given a sequence. The src and trg objects are input to the model, and trg_y is the target sequence against which the output of the model is compared when computing the loss. The sequencegiven to the get_src_trg() function must be a sub-sequence of your entire dataset and have the length input_sequence_length + target_sequence_length.

Here’s the function to create the src , trg and trg_y :

2.2. Masking the decoder input in transformers

We have now seen how to generate the two first inputs that our model’s forward()method requires. Let’s now consider the last two inputs that our model’s forward() method requires: src_mask and trg_mask

But first you should know that there are two types of masking in the context of transformers:

Padding masking. When using sequences of different lengths (sentences would normally be of different lengths), sequences shorter than the selected maximum sequence length (this is a hyperparameter than can have any value, e.g. 50) will be padded with a padding token. The padding tokens must be masked to prevent the model from attending to these tokens.
Decoder input masking (aka "look ahead masking"). This type of masking prevents the decoder from attending to future tokens when it "considers" what "meaning" token t has.

In this post, we will not pad our sequences, because we will implement our custom dataset class in such a way that all sequences will have the same length. For this reason, padding masking is not needed in our case [8], and it is not necessary to mask the encoder input [9]

We will, however, need to use decoder input masking because this type of masking is simply always necessary.

Recall that the decoder receives two inputs:

The encoder output
The decoder input

Both of these need to be masked.

In order to mask these inputs, we will supply the model’s forward() method with two masking tensors:

src_mask which will mask the encoder output
trg_mask which will mask the decoder input

In our case, the src_mask will need to have the size:

[target sequence length, encoder sequence length]

And the trg_mask will need to have size:

[target sequence length, target sequence length]

Here’s how to generate the masks:

And here’s how to use the masks as input to the model:

Complete example of Transformer for time series

I’ve created this repo which contains a complete example with some time series data. The repo also contains code for running inference with the time series Transformer model, and the code is described in my article "How to run inference with a PyTorch time series Transformer."

Multi-step time series forecasting with XGBoost

That’s it! I hope you enjoyed this post 🤞

Please leave a comment letting me know what you think. I’d be very happy to hear from you if you have questions or suggestions 🙌

Follow for more posts related to time series forecasting. I also write about green software engineering and the environmental impact of data science like [here](https://towardsdatascience.com/8-podcast-episodes-on-the-climate-impact-of-machine-learning-54f1c19f52d) and here 🍀

And feel free to connect with me on LinkedIn.