VOOZH about

URL: https://huggingface.co/nvidia/PixelDiT-1300M-1024px

โ‡ฑ nvidia/PixelDiT-1300M-1024px ยท Hugging Face


๐Ÿ‘ Image

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu1,2   Wei Xiong1โ€    Weili Nie1   Yichen Sheng1   Shiqiu Liu1   Jiebo Luo2

1NVIDIA   2University of Rochester
โ€ Project Lead and Main Advising

๐Ÿ‘ Image
  ๐Ÿ‘ Image
  ๐Ÿ‘ Image

Key Features

  • VAE-free
  • Dual-level architecture: Patch-level DiT + Pixel-level DiT
  • MM-DiT text-image fusion: Joint attention between text and image tokens
  • Text encoder: Gemma-2-2B-IT
  • Multi-aspect-ratio: Supports various aspect ratios at 1024px

Usage

Installation

pip install -r requirements.txt

Inference

# See the full inference script at: https://github.com/NVlabs/PixelDiT
cd t2i/
python inference.py \
 --config configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml \
 --model_path PixelDiT-T2I-v1.pth \
 --txt_file prompts.txt \
 --custom_height 1024 --custom_width 1024 \
 --cfg_scale 2.75 --seed 2025 \
 --negative_prompt "low quality, worst quality, over-saturated, blurry, deformed, watermark" \
 --work_dir "."

Inference Parameters

Parameter Default Description
--cfg_scale 3.5 Classifier-free guidance scale
--step 50 Number of sampling steps (25 for fast, 50 for quality)
--seed 0 Random seed
--negative_prompt "" Negative prompt for CFG
--interval_guidance [0, 1] CFG application interval
--sampling_algo flow_dpm-solver Sampling algorithm

Model Architecture

Component Value
Parameters 1.3B
Patch size 16
Hidden size 1536
Attention heads 24
Patch-level depth 14
Pixel-level depth 2
Pixel hidden size 16
Pixel attention hidden size 1152
Text embedding dim 2304
Text max length 300
Text encoder Gemma-2-2B-IT

Citation

@inproceedings{yu2026pixeldit,
 title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
 author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
 booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
 year={2026},
}

License

This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

Downloads last month
191

Model tree for nvidia/PixelDiT-1300M-1024px

Adapters
1 model

Space using nvidia/PixelDiT-1300M-1024px 1

Collection including nvidia/PixelDiT-1300M-1024px

Paper for nvidia/PixelDiT-1300M-1024px