VOOZH about

URL: https://www.geeksforgeeks.org/artificial-intelligence/diffusion-transformers-dits/

⇱ Diffusion Transformers (DiTs) - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Diffusion Transformers (DiTs)

Last Updated : 8 Jan, 2026

Diffusion models like DALL-E 2 and Stable Diffusion use U-Net CNNs as their backbone. U-Nets are good at capturing local pixel details but struggle with understanding the overall structure and relationships in an image.

👁 diffusion_models_learn_to_gradually_add_noise
Diffusion Models

A Diffusion Transformer is an AI model that combines transformers with a step by step process called diffusion to gradually create or improve data. Diffusion Transformers (DiTs) replace U-Nets with transformers that work on latent image patches. This allows them to capture global context, generate high quality images efficiently and support conditional generation from text, labels or other inputs.

Need for Diffusion Transformers

  • U-Nets cannot capture global image relationships effectively.
  • Transformers can model long range dependencies across the image.
  • High resolution and efficient image generation is needed.
  • Conditional generation from text or labels is required.

Core Concepts in Image Generation Models

  1. Convolutional U-Net Architecture: U-Nets use an encoder–decoder design with skip connections to capture both local details and global structure, making them effective for image generation tasks.
  2. Vision Transformers (ViT): ViTs split images into patches and process them with self-attention, allowing the model to learn long-range relationships efficiently.
  3. Classifier-Free Guidance (CFG): CFG strengthens conditioning signals (like text or labels) during sampling to make generated images more aligned with the given prompt.
  4. Latent Diffusion Models (LDMs): Generate images by performing diffusion and denoising in a compressed latent space, making the process faster and more efficient than operating on full-resolution images.

Workflow of DiTs

Diffusion Transformers generate images or videos by adding noise to data and training a model to remove that noise. The overall workflow includes preprocessing data, adding noise, training the model and finally generating new outputs.

  1. Data Preprocessing: The input image or video is split into fixed size patches and each patch is converted into a feature vector that the transformer can process.
  2. Noise Introduction: Noise is gradually added to these feature vectors to create a diffusion process turning clean data into increasingly noisy versions.
  3. Model Training: The Diffusion Transformer is trained to reverse this noise process, learning how to recover clean data from noisy inputs.
  4. Image or Video Generation: After training, the model takes random noise (or noisy data) and denoises it step by step to produce new images or videos using what it learned.

Understanding the DiT Architecture

Diffusion Transformers follow the Latent Diffusion Model (LDM) framework but replace the traditional U-Net with a Vision Transformer (ViT). This transformer-based design helps DiTs model long range relationships scale efficiently and generate high quality images and videos.

👁 dit_block_with_adaln_zero
Diffusion Transformer

The Diffusion Transformer (DiT) architecture can be explained through two closely connected aspects:

  • latent diffusion workflow which describes how noisy data is processed and denoised
  • transformer-based building blocks, which define how the model performs this processing.

DiT Diffusion Workflow

  1. Input and Conditioning: DiT operates in latent space where a noised latent tensor (e.g. 32 × 32 × 4) is taken as input. Conditioning information such as timestep embeddings and class labels is embedded and used to guide generation so the output aligns with the desired condition.
  2. Patchify and Tokenization: The noisy latent is split into fixed-size patches using a Patchify operation. Each patch is converted into a token forming a sequence that can be processed by transformer layers.
  3. Forward Diffusion (Noise Addition): During training noise is progressively added to the latent representation following a predefined noise schedule producing increasingly noisy latents at each timestep.
  4. Reverse Diffusion (Denoising): At inference DiT predicts the noise at each step and gradually removes it transforming noisy latent tokens back into a clean latent representation that can be decoded into an image.

Transformer Based Architecture of DiT

1. Token-Based Spatial Modeling: Instead of convolutions DiT treats image patches as tokens and processes them using transformer operations enabling global context modeling across the entire image.

2. Positional Encoding: Positional embeddings are added to the patch tokens so the transformer can preserve spatial relationships between patches.

3. DiT Block with adaLN-Zero: Each DiT block replaces U Net convolutional blocks with transformer components and consists of:

  • Layer Normalization
  • Multi-Head Self-Attention to model long-range dependencies across patches
  • Pointwise Feedforward (MLP) for token wise feature transformation
  • Adaptive LayerNorm (adaLN-Zero) where timestep and label embeddings generate scale and shift parameters () that modulate both attention and MLP layers
  • Residual Connections scaled by learnable parameters () ensuring stable training and efficient information flow

This adaLN-Zero design allows conditioning to directly control the behavior of each transformer block.

4. Conditioning Mechanisms: DiTs incorporate conditioning through adaLN for in context embedding, cross attention to mix image and conditioning tokens and optional extra conditioning tokens that enhance semantic alignment and performance in larger models.

5. Output Projection: After multiple DiT blocks, the tokens are normalized, reshaped back into spatial form and passed through a linear layer to predict the noise residual used for denoising.

Step by Step Implementation of DiT for Image Generation

Here we load a pretrained Diffusion Transformer (DiT) model, generates image latents using diffusion sampling with classifier-free guidance, decodes them using a VAE and generated images.

Step 1: Clone Repository & Set Working Directory

  • Enter the repo so scripts and imports work correctly.
  • Update PYTHONPATH to make the DiT package importable.

Step 2: Install Required Python Dependencies

Install diffusers and timm libraries required for model, VAE and image utilities.

Step 3: Import Modules & Disable Gradients

  • Import PyTorch, helper utilities, DiT components and VAE loader.
  • Disable gradients to enable faster inference.

Step 4: Configure Model Settings & Initialize DiT

  • Define image size, VAE name and latent size.
  • Create the DiT model and move it to device.

Step 5: Load Pretrained DiT Checkpoint and VAE

  • Load official DiT weights for the chosen resolution.
  • Load Stable Diffusion VAE for decoding latent outputs.

Step 6: Set Sampling Parameters & Class Labels

  • Fix randomness seed.
  • Set sampling steps, CFG scale and class labels for conditional generation.

Step 7: Create the Diffusion Sampler

Build diffusion scheduler for the required number of steps.

Step 8: Prepare Latent Noise and Labels for CFG Sampling

  • Create latent noise for sampling.
  • Duplicate latents for classifier free guidance.
  • Prepare label tensors and pack into model_kwargs.

Step 9: Run Diffusion Sampling

  • Iteratively denoise latents into images.
  • Apply classifier-free guidance.
  • Extract the conditioned outputs.

Step 10: Decode Latents

Convert DiT latent outputs to real images using the VAE decoder.

Output:

👁 sampl
Output

The output shows images generated by DiT (Diffusion Transformer) where each image is created by gradually transforming random noise into a latent representation guided by the specified class label. The model uses a diffusion process to refine the noise step by step and then the VAE decodes these latent representations into actual pixel images, producing class-specific visual outputs.

You can download full code from here

U-Net vs LDMs vs Diffusion Transformers

Here we compare Diffusion Transformer with U-Net and LDMs

Feature

U-Net Diffusion Models

Latent Diffusion Models

Diffusion Transformers

Type

CNN-based

Depends on U-Net or Transformer

Vision Transformer

Purpose

Denoising step by step

Make diffusion faster

Replace U-Net with Transformer

Strengths

Captures fine local details

Reduces computation & memory

Models long range dependencies very effectively

Outputs

Good at local textures

Good and fast

State of art, better FID scores

Positioning

Implicit via convolutions

Inherited

Explicit via positional embeddings

Applications

  • Image Generation: Generate high quality images from noise or conditional inputs .
  • Text to Image Synthesis: Create images based on textual prompts using cross attention or embedding conditioning.
  • Video Generation and Editing: Extend diffusion over temporal sequences to generate or modify videos.
  • Super Resolution and Inpainting: Enhance image resolution or fill missing parts by denoising latent features.
  • Conditional Image Manipulation: Generate images under specific conditions, such as class, style or additional input tokens.
  • Research and AI Creativity: Useful for generative AI experiments, creative content creation and AI assisted design tools.

Advantages

  • Global Context Modeling: Transformers capture long range dependencies, producing coherent images with better structure.
  • High Quality Generation: DiTs achieve lower FID scores compared to U Net LDMs at similar compute budgets.
  • Scalable Architecture: Models range from 33M to 675M parameters depth, width and tokens can be scaled to improve performance.
  • Flexible Conditioning: Supports class labels, text embeddings, cross-attention and extra tokens for controlled generation.
  • Versatile Applications: Useful for image synthesis, video generation and conditional generation tasks.

Limitations

  • High Computational Cost: Larger DiTs require significant GPU memory and processing power.
  • Slow Sampling: Iterative denoising during inference is time-consuming compared to direct generation methods.
  • Complex Training: Requires careful tuning of noise schedules, embeddings and conditioning strategies.
  • Data Hungry: Performance improves with large datasets; small datasets may lead to lower-quality outputs.
  • Limited Real Time Use: Due to iterative denoising and large model sizes, real-time applications are challenging.
Comment

Explore