The Decoder Was Never Supposed to Be Creative — Now It Has To Be

Published on June 11, 2026

AI/ML Technical Content Strategist

👁 The Decoder Was Never Supposed to Be Creative — Now It Has To Be

TL;DR

VAE decoders are trained to reconstruct pixels from latents, but high-end image generation increasingly needs decoders that can generate detail the latent never stored — especially at 4K and with semantic (RAE-style) latents.
PiD (NVIDIA, May 2026) keeps the latent space but replaces the VAE decoder with a conditional pixel diffusion model, unifying decoding and super-resolution: 512→2048 decoding in ~210 ms on a GB200, 13 GB peak memory on an RTX 5090.
L2P (Tencent Youtu Lab / Nanjing University, May 2026) removes the VAE entirely, transferring a pretrained latent model’s priors into a pure pixel-space model on just 8 GPUs — and unlocking native 4K generation with ~98% lower single-step latency.
For builders: less memory, less latency, no separate upsampler stage, simpler serving — at the cost of a new QA discipline, because a generative decoder can invent detail.

For the past several years, many of the most influential high-end image generation systems have rested on a quiet architectural assumption. Latent diffusion models, and the autoregressive image generators that followed them, all generate in a compressed latent space and then hand the result to a Variational Autoencoder (VAE) decoder, which maps it back to pixels. The diffusion backbone got the research attention, the scaling laws, the billion-parameter budgets. The decoder was treated as solved plumbing: a trusted, fixed inverse function bolted onto the end of the pipeline.

That assumption is now breaking, and two releases from May 2026 mark the break clearly. NVIDIA’s PiD (“Pixel Diffusion Decoder”) keeps the latent space but replaces the VAE decoder with a generative pixel-diffusion model, reducing the VAE to one interchangeable latent source among several. L2P (“Latent-to-Pixel”), from researchers at Tencent Youtu Lab and Nanjing University, goes further and removes the VAE entirely, transferring a pretrained latent model’s knowledge into a pure pixel-space architecture for the cost of eight GPUs — and, for the base-resolution transfer, zero real training data.

These are two different surgical procedures, but they respond to the same diagnosis. The VAE has historically done three jobs at once: it is the compressor that makes diffusion computationally tractable, the representation that the generator learns to target, and the renderer that turns latents back into images. High-end generation is now pulling those three jobs apart — and the renderer, in particular, is being rebuilt from a reconstruction machine into a generative one. The thesis of this piece is simple: frontier image generation no longer needs a decoder that can merely reconstruct pixels. It needs one that can generate them.

System	Keeps latent model?	Uses VAE decoder?	Pixel-space role	Main benefit
Traditional latent diffusion	Yes	Yes	Final reconstruction only	Efficient generation
PiD	Yes	Replaced/demoted	Generative decoder + upsampler	Better high-res decoding
L2P	Transfers from a pretrained latent model	Removed from target model	Native pixel generation	4K generation, lower VAE bottleneck

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

👁 James Skelton

James Skelton

Author

AI/ML Technical Content Strategist

See author profile

Category:

Tutorial

Tags:

AI/ML

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

👁 Creative Commons
This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

Table of contents

Join the many businesses that use DigitalOcean’s AI Inference Cloud.

Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.

👁 Image

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

👁 Image

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

👁 Image

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

👁 Image

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

👁 Image

URL: https://www.digitalocean.com/community/tutorials/why-diffusion-models-are-replacing-vae-decoders