Working of Encoders in Transformers

Last Updated : 27 Jun, 2025

An encoder is a neural network component that transforms input sequences (like text) into meaningful numerical representations called embeddings. In transformers, the encoder processes the entire input sequence to capture relationships between all positions. The encoder maps variable-length input sequences to fixed-dimensional feature representations. A common use case is encoding a sentence for classification or question answering.

👁 Encoder---Decoder-Architecture-in-Transformers

Encoder-Decoder Architecture in Transformers

Encoders in Transformers

The encoder functions as the first half of the transformer model, facilitating the internal representation of input elements. It does not merely compress input into vector space but attempts to encode inter-token dependencies via operations that are both parallel and non-local. The encoder architecture learns invariant and position-aware features without relying on recurrence or convolution.

Ability to capture global context and retaining order information
Layer normalization and residual connections for stability
Stacking of multiple identical layers for deeper understanding
Can attend to both past and future tokens simultaneously

Role of Encoders

The encoder serves as a significant component in the transformer architecture and plays an important role:

Acts as the first major block in the transformer model
Takes input embeddings and generates representations
Each encoder layer applies multi-head self-attention and feed-forward networks
In machine translation, the encoder processes the source language sentence (e.g., "Hello world") and creates rich representations that capture the meaning, context, and relationships between words, which can then be used by a decoder to generate the target language translation.

Working Principle of Encoders

The encoder follows a encoding and representing approach:

Embedding input to convert tokens to vector representations
Positional Encoding added to input embeddings
Multi-Layer Processing, applying N layers sequentially
Apply non-linear transformation using Feed Forward Network
Output Representation Generation

👁 Encoder-Architecture

Architecture of Encoders in Transformers

Working of Encoders in Transformer

1. Installing Dependencies

You can refer to these articles to understand more about these libraries: Torch, NN, Math

2. Positional Encoding

Transformers don’t have recurrence or convolution, so they need positional information to understand the order of tokens.

This class adds sinusoidal positional encodings to token embeddings. These are deterministic and help the model differentiate between positions using sin/cos functions based on dimension.

3. Multi-Head Self-Attention

This module allows the model to attend to different parts of the sequence simultaneously. It splits the input into multiple "heads", computes scaled dot-product attention for each, and then concatenates the results. This helps capture diverse relationships between tokens more effectively than single-head attention.

Linear projections for Q, K, V
Scaled Dot-Product Attention
Softmax to get attention weights
Concatenate heads

4. Position-wise Feed-Forward Network

Each token's representation is passed through a two-layer MLP with ReLU activation, applied independently. This enhances the model's ability to transform and abstract the attended features, enabling richer representations beyond just attention-based mixing.

5. Encoder Layer

This is a single layer of the Transformer encoder. It combines multi-head self-attention and feed-forward sub-layers, each followed by residual connections and layer normalization.

This setup helps the model learn stable and expressive representations of sequences.

6. Full Encoder (Stack of Encoder Layers)

This stacks multiple Encoder Layer modules to form the full encoder block. It starts with token and positional embeddings, applies dropout, and passes the result through each encoder layer.

Token Embedding
Add Positional Encoding
Pass through N encoder layers

The output is a context-rich representation of the input sequence suitable for downstream tasks like translation or classification.

7. Example Usage

In this example, the encoder is initialized with hyperparameters (embedding size, number of layers/heads, etc.). A random batch of token sequences is passed through, along with a mask to ignore padded tokens during attention. The final output represents the encoded features and the shape.

You can download the source code .

Applications of Transformer Encoders

Sentence classification
Named Entity Recognition (NER)
Question Answering Systems
Document Embeddings
Machine translation

Comment

Article Tags:

Deep Learning

AI-ML-DS With Python

Deep Learning

Explore

Basics

Neural Networks

Deep Learning Models

Model Evaluation

Deep Learning Frameworks

Projects

Courses

URL: https://www.geeksforgeeks.org/deep-learning/working-of-encoders-in-transformers/

⇱ Working of Encoders in Transformers - GeeksforGeeks