VOOZH about

URL: https://www.geeksforgeeks.org/deep-learning/working-of-encoders-in-transformers/

⇱ Working of Encoders in Transformers - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Working of Encoders in Transformers

Last Updated : 27 Jun, 2025

An encoder is a neural network component that transforms input sequences (like text) into meaningful numerical representations called embeddings. In transformers, the encoder processes the entire input sequence to capture relationships between all positions. The encoder maps variable-length input sequences to fixed-dimensional feature representations. A common use case is encoding a sentence for classification or question answering.

πŸ‘ Encoder---Decoder-Architecture-in-Transformers
Encoder-Decoder Architecture in Transformers

Encoders in Transformers

The encoder functions as the first half of the transformer model, facilitating the internal representation of input elements. It does not merely compress input into vector space but attempts to encode inter-token dependencies via operations that are both parallel and non-local. The encoder architecture learns invariant and position-aware features without relying on recurrence or convolution.

  • Ability to capture global context and retaining order information
  • Layer normalization and residual connections for stability
  • Stacking of multiple identical layers for deeper understanding
  • Can attend to both past and future tokens simultaneously

Role of Encoders

The encoder serves as a significant component in the transformer architecture and plays an important role:

  • Acts as the first major block in the transformer model
  • Takes input embeddings and generates representations
  • Each encoder layer applies multi-head self-attention and feed-forward networks
  • In machine translation, the encoder processes the source language sentence (e.g., "Hello world") and creates rich representations that capture the meaning, context, and relationships between words, which can then be used by a decoder to generate the target language translation.

Working Principle of Encoders

The encoder follows a encoding and representing approach:

  1. Embedding input to convert tokens to vector representations
  2. Positional Encoding added to input embeddings
  3. Multi-Layer Processing, applying N layers sequentially
  4. Apply non-linear transformation using Feed Forward Network
  5. Output Representation Generation
πŸ‘ Encoder-Architecture
Architecture of Encoders in Transformers

Working of Encoders in Transformer

1. Installing Dependencies

You can refer to these articles to understand more about these libraries: Torch, NN, Math

2. Positional Encoding

Transformers don’t have recurrence or convolution, so they need positional information to understand the order of tokens.

This class adds sinusoidal positional encodings to token embeddings. These are deterministic and help the model differentiate between positions using sin/cos functions based on dimension.

3. Multi-Head Self-Attention

This module allows the model to attend to different parts of the sequence simultaneously. It splits the input into multiple "heads", computes scaled dot-product attention for each, and then concatenates the results. This helps capture diverse relationships between tokens more effectively than single-head attention.

  • Linear projections for Q, K, V
  • Scaled Dot-Product Attention
  • Softmax to get attention weights
  • Concatenate heads

4. Position-wise Feed-Forward Network

Each token's representation is passed through a two-layer MLP with ReLU activation, applied independently. This enhances the model's ability to transform and abstract the attended features, enabling richer representations beyond just attention-based mixing.

5. Encoder Layer

This is a single layer of the Transformer encoder. It combines multi-head self-attention and feed-forward sub-layers, each followed by residual connections and layer normalization.

This setup helps the model learn stable and expressive representations of sequences.

6. Full Encoder (Stack of Encoder Layers)

This stacks multiple Encoder Layer modules to form the full encoder block. It starts with token and positional embeddings, applies dropout, and passes the result through each encoder layer.

  • Token Embedding
  • Add Positional Encoding
  • Pass through N encoder layers

The output is a context-rich representation of the input sequence suitable for downstream tasks like translation or classification.

7. Example Usage

In this example, the encoder is initialized with hyperparameters (embedding size, number of layers/heads, etc.). A random batch of token sequences is passed through, along with a mask to ignore padded tokens during attention. The final output represents the encoded features and the shape.

You can download the source code .

Applications of Transformer Encoders

  • Sentence classification
  • Named Entity Recognition (NER)
  • Question Answering Systems
  • Document Embeddings
  • Machine translation
Comment