VOOZH about

URL: https://www.geeksforgeeks.org/deep-learning/working-of-decoders-in-transformers/

⇱ Working of Decoders in Transformers - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Working of Decoders in Transformers

Last Updated : 24 Jun, 2025

A decoder in deep learning, especially in Transformer architectures, is the part of the model responsible for generating output sequences from encoded representations. In sequence-to-sequence tasks like machine translation, text summarization, or image captioning, the decoder takes the output from the encoder and converts it into a target language or format. It does this step-by-step, attending to both the encoded input and the already generated outputs.

πŸ‘ Encoder---Decoder-Architecture-in-Transformers
Encoder-Decoder Architecture in Transformers

Decoders in Transformers

  • Autoregressive generation: Predicts one token at a time, using previously generated tokens.
  • Masked self-attention: Prevents information leakage from future tokens using Masked self-attention.
  • Encoder-decoder attention: Aligns output tokens with relevant parts of the input. Also uses parallel pre-processing.
  • Stacked architecture: Typically has multiple identical layers (e.g., 6 in original Transformer).
  • Positional encoding: Adds order information to input embeddings.
  • Flexible output: Can be used for both classification and generation tasks.

Role of Decoders

  • The encoder transforms the input sequence into a vector representation.
  • The decoder takes this representation and produces the output sequence, attending to both: Itself, Encoder's output.

Working Principle

πŸ‘ Decoder-Architecture
Architecture and Working of Decoders in Transformers
  1. Input Embeddings are passed into the decoder with positional encodings.
  2. Masked Self-Attention Layer ensures the model can’t β€œsee” future tokens.
  3. Encoder-Decoder Attention allows the decoder to focus on relevant input tokens.
  4. Feedforward Layers refine representations.
  5. A linear layer maps the final output to the vocabulary space.
  6. Softmax provides a probability distribution over tokens for generation.

Components of Transformer Decoder

Each decoder layer contains:

  1. Masked Multi-Head Self-Attention: Computes attention on previously generated tokens. Uses a causal mask to prevent future information leakage.
  2. Multi-Head Encoder-Decoder Attention: Attends to encoder outputs.
  3. Feedforward Network: Applies two linear transformations with a ReLU in between.
  4. Layer Normalization and Residual Connections: Stabilize training and speed up convergence.
  5. Positional Encoding: Adds token position information.

Example Use case

πŸ‘ Transformers-for-Machine-Translation
Machine translation using Transformers

In English-to-French translation, the encoder processes the English sentence, and the decoder generates the French sentence one word at a time, using previously generated words and attention to the encoded sentence.

Mathematical Representation

  • Masked Self-Attention:
  • Encoder-Decoder Attention:
  • Feedforward Network:

Each decoder layer can be defined as:

Where,

  • X: Input to the decoder
  • E: Encoder output

Transformer Decoder Implementation

1. Imports

PyTorch and Math libraries are imported for model building and numerical operations.

2. PositionalEncoding class

  • Adds sinusoidal positional information to token embeddings.
  • Helps the model understand token positions since transformers lack recurrence.
  • Values are added to embeddings before input to the attention layers.

3. TransformerDecoderLayer class

Defines one decoder layer containing:

  • Masked multi-head self-attention to attend to previous tokens.
  • Multi-head encoder-decoder attention to focus on encoder output.
  • Feedforward network for non-linear transformation.
  • Layer normalization and dropout for training stability.

4. TransformerDecoder class

  • Builds the complete decoder by stacking multiple decoder layers.
  • Converts token indices to embeddings.
  • Adds positional encodings.
  • Applies a sequence of decoder layers.
  • Uses a final linear layer to map outputs to vocabulary logits.

5. Hyperparameter setup

The hyperparameter setup includes embedding dimension size, attention heads, feedforward layer hidden size, decoder layers, output tokens, input shape for dummy test.

6. Model instantiation

An instance of the TransformerDecoder is created using defined hyperparameters.

Sample input:

  • tgt: random integers simulating target token indices.
  • memory: random tensor simulating encoder output.

Forward pass:

  • Inputs are passed through the decoder to get output logits.
  • Output shape is (seq_len, batch_size, vocab_size), suitable for classification of each token position over the vocabulary.

Output

πŸ‘ Screenshot-2025-06-23-181000
Sample Output

You can download the source code from .

Applications

  • Machine Translation
  • Text Summarization
  • Speech-to-Text systems
  • Code generation by LLMs
Comment