VOOZH about

URL: https://thenewstack.io/how-diffusion-based-llm-ai-speeds-up-reasoning/

⇱ How Diffusion-Based LLM AI Speeds Up Reasoning - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-05-02 07:00:41
How Diffusion-Based LLM AI Speeds Up Reasoning
research,
AI / AI Engineering / Large Language Models

How Diffusion-Based LLM AI Speeds Up Reasoning

LLaDA, a large language model developed at China's Renmin University, uses dynamic masking to accelerate text generation.
May 2nd, 2025 7:00am by Kimberley Mok
👁 Featued image for: How Diffusion-Based LLM AI Speeds Up Reasoning
Featured image by Engin Akyurt, via Pexels.

Many of today’s most well-known large language models (LLMs) are autoregressive AI models, which are designed to generate text sequentially, often from left to right.

But there are newer — and potentially more efficient and faster — LLM contenders that are now opting for diffusion-based techniques to generate text, instead of tried-and-true autoregression methods.

Better known for generating visual images via diffusion AI models like Stable Diffusion and Midjourney, diffusion-based AI models for text generation are now gaining attention, thanks to their comparative efficiency and speed.

One of the latest to emerge is LLaDA (Large Language Diffusion with mAsking), an LLM developed by the ML Group at China’s Renmin University.

Dynamic Masking

LLaDA uses a dynamic masking approach that allows the model to predict multiple tokens simultaneously and, most notably, in a bidirectional fashion.

This technique distinguishes LLaDA from its autoregressive cousins, because while the technique of autoregression generally works quite well for short sequences of text, autoregressive models (ARMs) present some limitations of computational efficiency and bidirectional reasoning when it comes to generating longer, more complex sequences.

Generally, ARMs work by predicting words sequentially, which means that as context windows grow, more complex computations are needed, resulting in significant bottlenecks and issues with latency.

Additionally, conventional autoregressive models are plagued by what is known as the reversal curse, or the inability of autoregressive LLMs to reason backward on causal statements they were previously trained on. In other words, these models learn that A is B, but will struggle to deduce that B is also A, due to their sequential nature.

How LLaDA Works

LLaDA’s main advantage is that it uses a multiple-stage procedure that also works in both forward and backward directions.

“In contrast to traditional autoregressive models, LLaDA leverages a masked diffusion model (MDM), which incorporates a discrete random masking process and trains a mask predictor to approximate its reverse process,” the team wrote in its research paper.

LLaDA engages first in a forward process that will gradually mask tokens in a sequence, and then will undergo a reverse process that uses a vanilla transformer to simultaneously “de-mask”  predicted tokens. It’s similar to the diffusion process for image generation, where a noised input is gradually de-noised to generate a final image.

  • Pre-training phase: The model learns to de-noise and reconstruct text segments across 2.3 trillion tokens that have been randomly masked. This allows it to learn general patterns in language by predicting the next most likely word via self-supervised learning.
  • Supervised fine-tuning phase: Next, the model is then further refined using instruction-response pairs where the response portion is masked. This helps to boost its ability to respond to instructions and generate coherent outputs that may be specific to a certain domain of knowledge, while also helping to maintain bidirectional understanding.
  • Text generation: The model begins with output fields that are masked, and then refines its predictions through an iterative, re-masking process. At each stage of diffusion, the model predicts all masked tokens at the same time, and predictions that don’t have a high level of confidence are re-masked, so that the model can reassess them. This de-masking and re-masking process is done over and over again, until something coherent is generated.

As the research team wrote: “LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens.”

To boost the model’s accuracy, a likelihood evaluation algorithm was used, noted the team: “By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference.”

👁 LLaDA demonstrates considerable performance when compared to LLaMA, and strong scalability. (Source: "Large Language Diffusion Models," ML Group at China's Renmin University.)

LLaDA demonstrates considerable performance when compared to LLaMA and strong scalability. (Source: “Large Language Diffusion Models,” ML Group at China’s Renmin University.)

In evaluating the performance of an 8 billion parameter model, the researchers found that LLaDA had relatively impressive results in bidirectional reasoning tests.

For example, in a test for completing either the next or previous line of a well-known poem, LLaDA was on par with GPT-4 on text generation in a forward direction, while achieving 42% on backward text generation (reversal), compared to 32% for GPT-4.

Similar results were seen with code generation, math- and science-related tasks, where LLaDA fared better on a range of benchmarks than comparable autoregressive models of about the same size. Additionally, LLaDA shows similar performance to its autoregressive cousins of the same model size, but uses much fewer tokens.

Ultimately, diffusion-based large language models like LLaDA and Inception Labs’ Mercury could herald a new direction for LLMs, with potential diffusion-based alternatives — or even hybrid models — that challenge the dominance of current ARMs.

That could mean significant leaps forward for conversational AI, code generation and complex, bidirectional reasoning tasks, particularly when it comes to scaling these diffusion-based systems up — all with an increase of efficiency and speed, and improved context understanding.

Find out more in the team’s paper, project page, and on GitHub.

TRENDING STORIES
Kimberley Mok is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate...
Read more from Kimberley Mok
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.