Transformers in Action: Attention Is All You Need

A brief survey, illustration, and implementation

Sep 8, 2022

15 min read

Transformers

👁 Fig. 1. AI-generated artwork. Prompt: Street View Of A Home In The Style Of Storybook Cottage. Photo generated by Stable diffusion. Link to the full prompt.

Fig. 1. AI-generated artwork. Prompt: Street View Of A Home In The Style Of Storybook Cottage. Photo generated by Stable diffusion. Link to the full prompt.

Introduction
A quick recap of attention
Transformer architecture 3.1. [Encoder](#761f) and decoder components 3.1.1. Encoder 3.1.2. Decoder 3.2. Modules in the transformer 3.3. Attention modules 3.3.1. Scaled dot-product attention 3.3.2. Multi-head attention 3.4. Attention variants in the Transformer 3.4.1. Self-attention 3.4.2. Masked Self-attention (autoregressive or causal attention) 3.4.3. Cross-attention 3.5. Position-wise FFN 3.6. Residual connection and normalization 3.7. Positional encoding 3.7.1. Absolute positional information 3.7.2. Relative positional information
Motivations behind using self-attention
Research frontiers
Questions
Summary
TransformerX library
References

1. Introduction

As a successful frontier in the course of research towards artificial intelligence, Transformers are considered novel deep feed-forward artificial neural network architectures that leverage self-attention mechanisms and can handle long-range correlations between the input-sequence items. Thanks to their massive success in the industry and academic research, bountiful transformer architectures – a.k.a. X-formers – have been proposed by researchers since their inception in 2017 by Vaswani et al. [3], and have been adopted in a substantial number of domains such as – initially was proposed for – natural language processing (NLP), computer vision (CV), audio and speech processing, chemistry, and life sciences; they can achieve SOTA performances in the disciplines mentioned earlier. In this article, I have explained the transformer architecture through underlying math, python code implementation, and visualization of different layers. End-to-end examples are available on the TransformerX library repository on GitHub.

2. A quick recap of attention

Lower-level concepts such as attention mechanisms and terminologies related to encoder-decoder models are the underlying ideas of Transformers. Therefore, I have provided a brief summary of these approaches.

Attention is the allocation of a cognitive resource scheme with limited processing power **** [1].

The general idea behind attention as proposed by Bahdanau et al. [2] is that it searches for the most relevant information located in different positions in the input sequence when translating a word in each step. In the next step, it generates translations for the source token (word) wrt. 1) the context vector of these relevant positions and 2) previously generated words, simultaneously.

They can be classified into various categories based on several criteria such as:

The softness of attention:
1. Soft 2. Hard 3. Local 4. Global
Forms of input feature:
1. Item-wise 2. Location-wise
Input representation:1. Co-attention 2. Self-attention 3. Distinctive attention 4. Hierarchical attention
Output representation:
1. Multi-head 2. Single output 3. Multi-dimensional

If you feel attention mechanisms are in uncharted territory, I recommend reading the following article:

Rethinking Thinking: How Do Attention Mechanisms Actually Work?

3. Transformer architecture

The base transformer [3] architecture, consists of two main building blocks i.e. an encoder and a decoder block. The encoder generates an embedding vector 𝒁 = (𝒛₁, …, 𝒛ₙ) from an inputs representation sequence (𝒙₁, …, 𝒙ₙ) and passes it to the decoder to generate the output sequence (𝒚₁, …, 𝒚_ₘ). P_rior to generating an output at each step, the 𝒁 vector is fed into the decoder and hence the model is auto-regressive.

👁 Fig. 2. The Transformer architecture. Photo by Author.

Fig. 2. The Transformer architecture. Photo by Author.

3.1. Encoder and decoder components

Similar to sequence-to-sequence models, the Transformer uses an encoder-decoder architecture.

3.1.1. Encoder

The encoder is simply a stack of multiple components or layers – 𝑵 _i_s 6 in the original paper – which themselves are a set of two sub-layers i.e. a multi-head self-attention block and a simple, position-wise FC FFN (fully connected feed-forward network). To enable a deeper model, researchers have exercised a residual connection by wrapping each of the two sublayers followed by layer normalization. Therefore, the output of each sub-layer is LayerNorm(𝒙 + Sublayer(𝒙)) _an_d Sublayer(𝒙) is __ a function implemented within itself. The output dimension of all sub-layers, as well as embeddings, is 𝒅_model = 512.

Implementation of a Transformer encoder block:

3.1.2. Decoder

Apart from the sub-layers used in the encoder, the decoder applies multi-head attention over the outputs of the encoder component. Like the encoder, the residual connections are attached to the sub-layers followed by layer normalization. To guarantee the fact that the predictions for the position 𝒊 can depend only on previously known positions, another modification is applied to the self-attention sub-layer to prevent positions from attending to other positions along with offsetting the output embeddings by one position.

Implementation of a Transformer decoder block:

3.2. Modules in the transformer

Next, I will discuss the elemental components that comprise the original transformer architecture.

Attention modules
Position-wise feed-forward networks
Residual Connection and Normalization
Positional encoding

3.3. Attention modules

The transformer integrates Query-Key-Value (QKV) concept from information retrieval with attention mechanisms

Scaled dot-product attention
Multi-head attention

3.3.1. Scaled dot-product attention

👁 Fig. 3. Scaled Dot-Product Attention. Photo by author.

Fig. 3. Scaled Dot-Product Attention. Photo by author.

The scaled dot-product attention is formulated as:

👁 Eq. 1

Eq. 1

where 𝑲 ∈ ℝ^𝑀×𝐷𝑘, 𝑸 ∈ ℝ^ 𝑵 ×𝐷𝑘, and 𝑽 ∈ ℝ^ 𝑴×𝐷𝑣 are representation matrices. The length of keys (or values) and queries are denoted by 𝑴 and 𝑵 respectively and their dimensions are represented by 𝐷𝑘 and 𝐷𝑣. The matrix 𝑨 in the eq. 1 is usually called the attention matrix. The reason they have used dot-product attention instead of additive attention, which computes the compatibility function using a feed-forward network with a single hidden layer, is the speed and space efficiency in practice thanks to the matrix multiplication optimization techniques. Nonetheless, there is a substantial drawback with the dot-product for large values of 𝐷𝑘 which pushes the gradients of the softmax function to minuscule gradients. To stifle the gradient vanishing issue of the softmax function, the dot-products of the keys and queries are divided by the square root of 𝐷𝑘, and by virtue of this fact, it is called scaled dot-product.

Implementation of a dot-product attention block:

3.3.2. Multi-head attention

👁 Fig. 4. a. Multi-Head Attention. b. The end-to-end flow of tensor operations in multi-head attention. Photo by author.

Fig. 4. a. Multi-Head Attention. b. The end-to-end flow of tensor operations in multi-head attention. Photo by author.

Introducing multiple attention heads instead of a single attention function, Transformer linearly projects the 𝐷𝑚-dimensional original queries, keys, and values to 𝐷𝑘, 𝐷𝑘, and 𝐷𝑣 dimensions with different, learned linear projections h times, __ respectively; through which, the computation of the attention function(eq. 1) on these projections can be performed in parallel, yielding 𝐷𝑣-dimensional output values. The model then concatenates them and produces a 𝐷𝑚-dimensional representation.

👁 Eq. 2

Eq. 2

where

👁 Eq. 3

Eq. 3

The projections are 𝑾𝑸ᵢ ∈ ℝ^d_model×dk, 𝑾𝑲ᵢ ∈ ℝ^d_model×dk, 𝑾𝑽ᵢ ∈ ℝ^d_model×dv, and 𝑾𝒐 ∈ ℝ^h*dv×d_model matrices.

This process enables the Transformer to jointly attend to different representation subspaces and positions. To make it more tangible, for a specific adjective, one head might capture the intensity of the adjective, while another one might attend to its negativity and positivity.

Implementation of Multi-head attention

As it can be seen, the multi-head attention has three hyperparameters that determine the tensor dimensions:

The number of attention heads
Model size (embedding size): the length of the embedding vector.
Query, key, and value size: Query, key, and value weight sizes used by linear layers which output queries, keys, and values matrices

3.4. Attention variants in the Transformer

Three different ways to use attention have been employed in the original Transformer paper which are distinct in terms of the way the keys, queries, and values are fed into the attention function.

Self-attention
Masked Self-attention (autoregressive or causal attention)
Cross-attention

3.4.1. Self-attention

All keys, queries, and values vectors come from the same sequence, in the case of Transformer, the encoder’s previous step outputs, allowing each position the encoder to simultaneously attend to all positions in its own previous layer i.e. 𝑸 = 𝑲 = 𝑽 = 𝑿 (previous encoder outputs).

👁 Fig. 5. Self-attention tensor operations. Photo by author.

Fig. 5. Self-attention tensor operations. Photo by author.

3.4.2. Masked Self-attention (autoregressive or causal attention)

Despite the encoder layer, in the self-attention of the decoder, the queries are confined to their preceding key-value pairs positions as well as their current position in order to maintain the auto-regressive property. This can be implemented by masking the invalid positions and setting them to negative infinite i.e. 𝑨 𝒊𝒋 = −∞ if 𝒊 < 𝒋.

3.4.3. Cross-attention

This type of attention obtains its queries from the previous decoder layer whereas the keys and values are acquired from the encoder yields. This is basically the attention used in the encoder-decoder attention mechanisms in sequence-to-sequence models. In other words, cross-attention combines two different embedding sequences with the exact dimensions which derive its queries from one sequence and its keys and values from the other. Let’s assume S1 and S2 are two embedding sequences, the cross-attention obtains its keys and values from S1 and its queries from S2 then calculates the attention scores and produces the results sequence with the length of S2. In the case of the Transformer, the keys and values are derived from the encoder and the queries from the previous-step decoder outputs.

👁 Fig. 6. Cross-attention tensor operations. Photo by author.

Fig. 6. Cross-attention tensor operations. Photo by author.

It is worth mentioning that the two input embedding sequences can be of different modalities (i.e. text, image, audio, etc.).

3.5. Position-wise FFN

On top of each sub-layer in the encoder and decoder, a position-wise fully connected feed-forward network is applied to each position individually and exactly in the same way, however, the parameters are distinct from layer to layer. It is a couple of linear layers with a ReLU activation function in between; it is identical to a two-layer convolution with kernel size 1.

👁 Eq. 4

Eq. 4

where x is the previous layer’s output, and 𝑾₁ ∈ ℝ^𝐷_model × 𝐷𝑓, 𝑾₂ ∈ ℝ^𝐷𝑓 × 𝐷_model, 𝒃₁ ∈ ℝ^𝐷𝑓, 𝒃₂ ∈ ℝ^𝐷_model are tra_inabl_e matrices, and the inner-layer 𝐷𝑓 is generally set to be larger than 𝐷_model (in the ca_se of t_he Transformer 𝐷model=512, and 𝐷𝑓=2048)._

Implementation of position-wise FFN:

3.6. Residual connection and normalization

Wrapping each module with residual connections enables deeper architectures while avoiding gradient vanishing/explosion. Therefore, the Transformer employs residual connections around modules followed by a layer normalization. It can be formulated as follows:

𝒙 ′ = LayerNorm(SelfAttention(𝑿) + 𝑿)
𝒙 = LayerNorm(FFN(𝒙’) + 𝒙’)

👁 Fig. 7. Residual connections and layer normalization. Photo by author.

Fig. 7. Residual connections and layer normalization. Photo by author.

Implementation of residual connection and normalization:

3.7. Positional encoding

Researchers in the Transfomer used an interesting idea to inject a sense of ordering into the input tokens since it has no recurrence or convolution. Absolute and relative positional information can be used to imply the sequence order of the inputs, which can be learned or fixed. The summation process between matrices requires matrices of the same size, thus, the positional encoding dimensions are identical to those of the input embeddings. They are infused into the input encodings at the bottoms of the encoder and decoder modules. Vaswani et al. [3] used fixed positional encodings with the help of sine and cosine functions – however, they experimented with relative positional encoding and realized that in their case it produced almost the same results [4]. Let 𝑿 be the input representation that contains n tokens of d_-_dimensional embeddings. The positional encoding produces 𝑿 + 𝑷, where 𝑷 is a positional embedding matrix of the same size. The element on the ith row and (2𝒋)th o_r (2𝒋+1)t_h _colum_n is:

👁 Eq. 5

Eq. 5

and

👁 Eq. 7

Eq. 7

In the positional embedding matrix P, the rows represent the tokens’ positions in the sequence and the columns denote the different positional encoding dimensions.

I have depicted the differences between 4 columns in the matrix 𝑷 in the following visualization. Notice the distinct frequencies for different columns.

👁 Fig. 8. Positional encoding. Photo by author.

Fig. 8. Positional encoding. Photo by author.

3.7.1. Absolute positional information

In the type of positional encoding, the frequency rate alternates based on the position of the element. By way of example, look at the following binary encodings; the numbers on the least valuable positions (right side) fluctuate more frequently while other numbers with more valuable positions have fewer fluctuations with regard to their position, i.e the most valuable position is more stable.

0 -> 000
1 -> 001
2 -> 010
3 -> 011
4 -> 100
5 -> 101
6 -> 110
7 -> 111

3.7.2. Relative positional information

Along with the above positional encoding, another method is to learn to attend by relative positions. For any fixed position 𝛿, the positional encoding at 𝛿+𝒊 can be derived by linearly projecting it at position 𝒊. Let Ψ=1/_(_10000^(2𝒋/d)), any pair of eq. 4 and eq. 5, can be linearly projected to positions at 𝛿+𝒊 for any fixed offset 𝛿:

👁 Eq. 7

Eq. 7

4. Motivations behind using self-attention

Researchers in the "Attention is All You Need" [3] paper have considered multiple criteria when comparing self-attention to convolutional and recurrent layers. These desiderata can be dissected into three main classes:

👁 Table 1. Computational complexity per layer, minimum sequential operations taking place in each layer, and maximum path length. Where n denotes the sequence length, d represents the dimension of the representation, k stands for convolutions' kernel size, and r indicates the neighborhood size in the restricted self-attention. Table from [3].

Table 1. Computational complexity per layer, minimum sequential operations taking place in each layer, and maximum path length. Where n denotes the sequence length, d represents the dimension of the representation, k stands for convolutions’ kernel size, and r indicates the neighborhood size in the restricted self-attention. Table from [3].

Computational complexity: the total amount of computational complexity per layer
Parallelization: to what extent can the computations be parallelized
Learning long-range dependencies: the capability of handling long-range dependencies in the network
Interpretability: the capability of inspecting the learned distributions and the ability to attend to semantic and syntactic features of the inputs

Table 1 illustrates the superiority of self-attention over recurrent layers in computational complexity when the sequence length n is smaller than the representations dimension d which is the common case in the SOTA translation models like word-piece [5] and byte-pair [6] representations. The restricted self-attention is a more sophisticated version of the vanilla self-attention when it comes to computational complexity in very long input sequences which only uses a limited number of neighbors with the size r from the input sequence around the respective output position. Also, it can beat the convolutional layers which require several Conv layers with a complexity of 𝑶(_n/k) w_hen using contiguous kernels and 𝑶(l_og_k(n)) fo_r dilated convolutions [7]. This stacking of layers in turn lengthens the longest paths between any two positions in the network. Convolutional layers are typically k _t_imes more computationally expensive than their recurrent counterparts. It is worth mentioning that the complexity of separable convolutions is much less, yet, in its best-case scenario is comparable to a self-attention and a feed-forward layer combined.

5. Research frontiers

The recent variants have tried to improve the performance of the original work by further exploring different lines of improving the architecture i.e.:

Efficiency: Self-attention causes computation and memory complexity while processing long sequences which has impelled researchers to address this issue by introducing lightweight attention solutions (e.g. sparse attention variants) and Divide-and-conquer methods (e.g., recurrent and hierarchical mechanisms).
Generalization: The transformer needs immense amounts of data to be trained due to their insensitivity to the structural bias of the input data, and hence, efforts such as introducing structural bias or regularization, pre-training on large-scale unlabeled data, etc. have been made to cope with this hurdle.
Adaptation: Owing to the fact that transformers have the ability to be adopted by various areas, researchers have attempted to consolidate them with specific downstream tasks.

6. Questions

In this section, I invite you to ponder about following questions and try to give answers in the comments.

What happens if you replace scaled dot-product attention with additive attention in the transformer?
If we want to use a transformer for language modeling, should we use an encoder, a decoder, or both?
What happens if the inputs of a transformer are too long? how can we deal with it?
What can we do to improve the computational and memory efficiency of the transformers?

Speaking of which, leave a comment describing which parts you find confusing or ambiguous. how this article made a difference for you, and what other topics about writing on Medium you’d like answers to.

We can discuss them further on the Discord server.

7. Summary

In this article, you learned about the Transformer architecture as well as its implementation and saw the major breakthroughs it has brought in different domains such as machine translation, computer vision as well as some other disciplines while reducing their complexity, along with making them more interpretable. One more elemental component of the Transformer is the parallelization capability of different heads since it uses multiheaded self-attention purely instead of using recurrent or convolutional layers. Now you are familiar with the main components of the Transformer.

I hope you’ve found this article helpful. If you have, please share it on your favorite social media channel so others can find it, too.

I write about state-of-the-art research in machine learning and other tech topics. If any of those are of interest to you, check them out and follow me.

8. TransformerX library

TransformerX is a python library that provides researchers, students, and professionals with building blocks needed in developing, training, and evaluating transformers and integrates smoothly into Tensorflow (we will add support for Pytorch, and JAX soon). We are actively working on adding more awesome features. (we’ve recently released its first version and I greatly appreciate giving us a 🌟 on Github🌹 )

I would like to kindly ask you to follow me on GitHub and if you feel like contributing to a cutting-edge deep learning library (TransformerX), feel free to reach out to me, we are looking forward to hearing from you. We will guide you through every step to make your first contribution.

You can also join us on the TransformerX Discord server and Twitter, so we will be in touch.

Soran Ghaderi – Self Employed – Self-employed | LinkedIn

9. References

[1] J. R. Anderson, 2005, Cognitive Psychology and Its Implications, Worth Publishers, 2005. [2] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR. [3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017. [4] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. [5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [6] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. [7] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.

Written By

Soran Ghaderi

See all from Soran Ghaderi

Computer Vision, Deep Dives, Deep Learning, NLP, Transformers

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/transformers-in-action-attention-is-all-you-need-ac10338a023a/