![]() |
VOOZH | about |
Transformer XL is short for Transformer Extra Long. The Transformer-XL model was introduced in the paper titled "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context," authored by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Natural Language Processing has experienced significant progress and Transformer XL is a key influencer in reshaping the landscape of sequence modeling.
The article aims to explore the key features including segment-level recurrence mechanism and relative encoding of the Transformer XL model.
The Transformer was originally developed to solve the problem of sequence-to-sequence tasks, such as machine translation, but has since become a foundational model for various natural language processing (NLP) tasks. The key features of the Transformer are discussed below:
Language modelling is a fundamental task in natural language processing (NLP) and machine learning. It estimates the likelihood of observing a particular sequence in a given language. Language models take into account the context of a word within a sequence. The probability of a word depends on the preceding words, capturing the dependencies and structure of the language. Language models are evaluated based on perplexity metrics, which measure how well the model predicts a given sequence. Lower perplexity indicates better performance.
The utilization of transformers for language modelling has emerged as a critical element in the field of natural language processing, empowering models to comprehend and produce text that closely resembles human language.
For language modelling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed-length segments of a few hundred tokens, and each segment is processed separately. In the vanilla transformer architecture, there is no information flow across segments. Each segment is processed independently.
Two critical limitation of vanilla transformer architecture for utilizing them for language modelling task are:
Transformer XL is an extension of the vanilla transformer architecture designed to address the challenges associated with them for language modeling task as highlighted above. It introduces two key features:
In a standard Transformer, the hidden state at a given position is a vector that encodes information about the token at that position and its relationships with other tokens in the sequence. The hidden state is updated through self-attention mechanisms and feedforward layers in each layer of the Transformer.
The segment-level recurrent mechanism involves updating the hidden states not only within the current segment but also by attending to the hidden states from previous segments. This enables the model to extend its context window beyond the current segment. Let us understand this mathematically,
Let,
Now the hidden state being feed into nth layer of segment SĻ+1 depends not only the hidden state of SĻ+1 at n-1 but also the hidden state of layer n-1 at SĻ . The two hidden state vectors are concatenated along the length dimension. This is expressed as
Here we take the hidden state from previous layer of same segment and hidden state from previous layer of last segment and concatenate its. The SG denotes that the gradient is not backpropagated through previous layer.
This modified hidden state is used in for key and value calculation to key QKV matrices.
Note that modified hidden state is used only for K and V. The Query calculation remains dependent only on hidden state of current segment previous layer. The gradient remains within a segment, but the additional history allows the network to model long-term dependency and avoid context fragmentation.
With this recurrence mechanism applied to every two consecutive segments of a corpus, it essentially creates a segment-level recurrence in the hidden states. Notice that the recurrent dependency between hnĻ+1 and hnā1Ļ shifts one layer downwards per segment. This can be visualized as below:
Relative Positional Encoding
In the original transformer paper, we add the positional encoding vector (U) with the embedding vector(E). We multiply the result of this with weight matrices Wq and Wk to get the Q and K vectors.
The attention score between i and j token is obtained by multiplying the Query of ith vector with Key of jth vector.
This attention score between two tokens at position i and j from the original transformer architecture can be mathematically decomposed into U and E vectors as below.
Here:
The attention score in transformer XL architecture can mathematically be formulated as below:
The four terms can be intuitively understood as:
As per the paper:
The original Transformer model uses fixes length sequence segment and absolute positional encoding, assigning a fixed vector to each token based on its position in the sequence. However, this approach has limitations, such as restricting the model's effectiveness for longer sequences and overlooking relative distances between tokens.
To overcome these challenges, transformer XL introduced segment level recurrent mechanism and relative positional encoding is introduced. The segment level recurrent mechanism utilized information from hidden state of previous layer of previous segment. The relative encoding method employed unique vectors for each token pair, determined by their relative distance. These vectors are incorporated into the attention score, measuring how much each token attends to others. This enhancement enables the model to capture the context of each token, irrespective of its absolute position, and handle longer sequences more effectively without information loss.