Diffusion Transformers (DiTs)

Last Updated : 8 Jan, 2026

Diffusion models like DALL-E 2 and Stable Diffusion use U-Net CNNs as their backbone. U-Nets are good at capturing local pixel details but struggle with understanding the overall structure and relationships in an image.

👁 diffusion_models_learn_to_gradually_add_noise

Diffusion Models

A Diffusion Transformer is an AI model that combines transformers with a step by step process called diffusion to gradually create or improve data. Diffusion Transformers (DiTs) replace U-Nets with transformers that work on latent image patches. This allows them to capture global context, generate high quality images efficiently and support conditional generation from text, labels or other inputs.

Need for Diffusion Transformers

U-Nets cannot capture global image relationships effectively.
Transformers can model long range dependencies across the image.
High resolution and efficient image generation is needed.
Conditional generation from text or labels is required.

Core Concepts in Image Generation Models

Convolutional U-Net Architecture: U-Nets use an encoder–decoder design with skip connections to capture both local details and global structure, making them effective for image generation tasks.
Vision Transformers (ViT): ViTs split images into patches and process them with self-attention, allowing the model to learn long-range relationships efficiently.
Classifier-Free Guidance (CFG): CFG strengthens conditioning signals (like text or labels) during sampling to make generated images more aligned with the given prompt.
Latent Diffusion Models (LDMs): Generate images by performing diffusion and denoising in a compressed latent space, making the process faster and more efficient than operating on full-resolution images.

Workflow of DiTs

Diffusion Transformers generate images or videos by adding noise to data and training a model to remove that noise. The overall workflow includes preprocessing data, adding noise, training the model and finally generating new outputs.

Data Preprocessing: The input image or video is split into fixed size patches and each patch is converted into a feature vector that the transformer can process.
Noise Introduction: Noise is gradually added to these feature vectors to create a diffusion process turning clean data into increasingly noisy versions.
Model Training: The Diffusion Transformer is trained to reverse this noise process, learning how to recover clean data from noisy inputs.
Image or Video Generation: After training, the model takes random noise (or noisy data) and denoises it step by step to produce new images or videos using what it learned.

Understanding the DiT Architecture

Diffusion Transformers follow the Latent Diffusion Model (LDM) framework but replace the traditional U-Net with a Vision Transformer (ViT). This transformer-based design helps DiTs model long range relationships scale efficiently and generate high quality images and videos.

👁 dit_block_with_adaln_zero

Diffusion Transformer

The Diffusion Transformer (DiT) architecture can be explained through two closely connected aspects:

latent diffusion workflow which describes how noisy data is processed and denoised
transformer-based building blocks, which define how the model performs this processing.

DiT Diffusion Workflow

Input and Conditioning: DiT operates in latent space where a noised latent tensor (e.g. 32 × 32 × 4) is taken as input. Conditioning information such as timestep embeddings and class labels is embedded and used to guide generation so the output aligns with the desired condition.
Patchify and Tokenization: The noisy latent is split into fixed-size patches using a Patchify operation. Each patch is converted into a token forming a sequence that can be processed by transformer layers.
Forward Diffusion (Noise Addition): During training noise is progressively added to the latent representation following a predefined noise schedule producing increasingly noisy latents at each timestep.
Reverse Diffusion (Denoising): At inference DiT predicts the noise at each step and gradually removes it transforming noisy latent tokens back into a clean latent representation that can be decoded into an image.

Transformer Based Architecture of DiT

1. Token-Based Spatial Modeling: Instead of convolutions DiT treats image patches as tokens and processes them using transformer operations enabling global context modeling across the entire image.

2. Positional Encoding: Positional embeddings are added to the patch tokens so the transformer can preserve spatial relationships between patches.

3. DiT Block with adaLN-Zero: Each DiT block replaces U Net convolutional blocks with transformer components and consists of:

Layer Normalization
Multi-Head Self-Attention to model long-range dependencies across patches
Pointwise Feedforward (MLP) for token wise feature transformation
Adaptive LayerNorm (adaLN-Zero) where timestep and label embeddings generate scale and shift parameters () that modulate both attention and MLP layers
Residual Connections scaled by learnable parameters () ensuring stable training and efficient information flow

This adaLN-Zero design allows conditioning to directly control the behavior of each transformer block.

4. Conditioning Mechanisms: DiTs incorporate conditioning through adaLN for in context embedding, cross attention to mix image and conditioning tokens and optional extra conditioning tokens that enhance semantic alignment and performance in larger models.

5. Output Projection: After multiple DiT blocks, the tokens are normalized, reshaped back into spatial form and passed through a linear layer to predict the noise residual used for denoising.

Step by Step Implementation of DiT for Image Generation

Here we load a pretrained Diffusion Transformer (DiT) model, generates image latents using diffusion sampling with classifier-free guidance, decodes them using a VAE and generated images.

Step 1: Clone Repository & Set Working Directory

Enter the repo so scripts and imports work correctly.
Update PYTHONPATH to make the DiT package importable.

Step 2: Install Required Python Dependencies

Install diffusers and timm libraries required for model, VAE and image utilities.

Step 3: Import Modules & Disable Gradients

Import PyTorch, helper utilities, DiT components and VAE loader.
Disable gradients to enable faster inference.

Step 4: Configure Model Settings & Initialize DiT

Define image size, VAE name and latent size.
Create the DiT model and move it to device.

Step 5: Load Pretrained DiT Checkpoint and VAE

Load official DiT weights for the chosen resolution.
Load Stable Diffusion VAE for decoding latent outputs.

Step 6: Set Sampling Parameters & Class Labels

Fix randomness seed.
Set sampling steps, CFG scale and class labels for conditional generation.

Step 7: Create the Diffusion Sampler

Build diffusion scheduler for the required number of steps.

Step 8: Prepare Latent Noise and Labels for CFG Sampling

Create latent noise for sampling.
Duplicate latents for classifier free guidance.
Prepare label tensors and pack into model_kwargs.

Step 9: Run Diffusion Sampling

Iteratively denoise latents into images.
Apply classifier-free guidance.
Extract the conditioned outputs.

Step 10: Decode Latents

Convert DiT latent outputs to real images using the VAE decoder.

Output:

👁 sampl

Output

The output shows images generated by DiT (Diffusion Transformer) where each image is created by gradually transforming random noise into a latent representation guided by the specified class label. The model uses a diffusion process to refine the noise step by step and then the VAE decodes these latent representations into actual pixel images, producing class-specific visual outputs.

You can download full code from here

U-Net vs LDMs vs Diffusion Transformers

Here we compare Diffusion Transformer with U-Net and LDMs

Feature	U-Net Diffusion Models	Latent Diffusion Models	Diffusion Transformers
Type	CNN-based	Depends on U-Net or Transformer	Vision Transformer
Purpose	Denoising step by step	Make diffusion faster	Replace U-Net with Transformer
Strengths	Captures fine local details	Reduces computation & memory	Models long range dependencies very effectively
Outputs	Good at local textures	Good and fast	State of art, better FID scores
Positioning	Implicit via convolutions	Inherited	Explicit via positional embeddings

Applications

Image Generation: Generate high quality images from noise or conditional inputs .
Text to Image Synthesis: Create images based on textual prompts using cross attention or embedding conditioning.
Video Generation and Editing: Extend diffusion over temporal sequences to generate or modify videos.
Super Resolution and Inpainting: Enhance image resolution or fill missing parts by denoising latent features.
Conditional Image Manipulation: Generate images under specific conditions, such as class, style or additional input tokens.
Research and AI Creativity: Useful for generative AI experiments, creative content creation and AI assisted design tools.

Advantages

Global Context Modeling: Transformers capture long range dependencies, producing coherent images with better structure.
High Quality Generation: DiTs achieve lower FID scores compared to U Net LDMs at similar compute budgets.
Scalable Architecture: Models range from 33M to 675M parameters depth, width and tokens can be scaled to improve performance.
Flexible Conditioning: Supports class labels, text embeddings, cross-attention and extra tokens for controlled generation.
Versatile Applications: Useful for image synthesis, video generation and conditional generation tasks.

Limitations

High Computational Cost: Larger DiTs require significant GPU memory and processing power.
Slow Sampling: Iterative denoising during inference is time-consuming compared to direct generation methods.
Complex Training: Requires careful tuning of noise schedules, embeddings and conditioning strategies.
Data Hungry: Performance improves with large datasets; small datasets may lead to lower-quality outputs.
Limited Real Time Use: Due to iterative denoising and large model sizes, real-time applications are challenging.

Comment

Article Tags:

Artificial Intelligence

AI-ML-DS With Python

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Courses

URL: https://www.geeksforgeeks.org/artificial-intelligence/diffusion-transformers-dits/