Voozh

Most of the small reasoning models that have shipped in the past year are variations on a theme. A familiar transformer backbone, a Mixture-of-Experts wrapper, grouped-query attention or something like Gated DeltaNet in Qwen's case for a smaller KV cache, and a heavy reinforcement learning stage at the end. Performance improves year on year, but the architecture of what's actually running is similar to the shape it was when DeepSeek R1 arrived.

Zaya1-8B is the first small model in a while that doesn't look like that. Zyphra's 8.4-billion-parameter Mixture-of-Experts, with only around 760 million parameters active per token, is built on an attention variant that compresses queries, keys, and values into a shared latent space, an inference-time reasoning method that's co-trained into the weights rather than bolted on after, and a router that uses a multilayer perceptron with a proportional-integral-derivative-controller-style bias balancer instead of the usual linear gate. Each one of those is a real research contribution, but put together, they explain how a model with under a billion active parameters can approach much larger models on difficult math and coding benchmarks.

There are some pretty big caveats, though. The headline benchmarks are all reported by Zyphra, and the post-training recipe is specialised enough that it's significantly better at math and code than it is in generalist contexts. With that said, the technical content here is possibly the most interesting advancement I've seen in this space in a long time, and all of it is thanks to the unusual architecture and training stack Zyphra has built.

Compressed Convolutional Attention rewrites how attention works

Everything gets compressed

Credit: Zyphra

The KV cache is the silent killer for any local model. Active parameter count and weights are easy to understand and figure out, but the moment your context window opens up, your VRAM can be totally eaten up by keys and values that are several times larger than the model you're actually running. Multi-head attention is the worst offender. Grouped-query attention (GQA) shares keys and values across head groups to cut the cache. Multi-latent attention (MLA) pushes the cache into a learned latent space. Both help, but both also have ceilings.

Zyphra's Compressed Convolutional Attention (CCA) takes a different angle. Queries, keys, and values all get down-projected into a single shared latent space, and the entire attention computation runs inside that compressed space. On top of that, convolutional sequence and channel mixing gets applied to the compressed queries and keys. The convolution is what stops the quality from collapsing when you compress this aggressively, because it lets neighbouring positions exchange information inside the latent space before the attention scores are computed.

The numbers from the published CCA whitepaper are incredibly interesting. The team measured an eight-fold KV-cache compression compared to standard multi-head attention, with no measurable drop in quality. On top of that, it had a 1.7 times faster prefill at a 16,000-token sequence length on an H100, and the backward pass is around 1.3 times faster on the same hardware. Plus, because CCA compresses parameters, cache, and FLOPs together by the same factor, the user can dial the compression toward either memory or compute, depending on what their hardware is short on.

The variant Zyphra actually ships in Zaya1, called CCGQA, layers grouped-query head sharing on top of the latent-space compression. The paper claims it consistently outperforms both GQA and MLA at equal KV-cache compression in Mixture-of-Experts settings, with four times fewer FLOPs at the same cache budget. This is the part of the model that's most portable to other models, with pretty big implications for long context conversations if it proves itself to be better than standard GQA or GDN.

Markovian RSA is co-trained, not bolted on

Combining multiple reasoning traces at once

Credit: Source: Zyphra

Test-time compute has been a big deal for models for quite a while. If you generate more tokens, you get better answers, and you pay the inference bill to account for that. The catch is that better reasoning usually means longer chains of thought, and longer chains of thought eat your context window until the model loses track of what it was doing. Markovian RSA is Zyphra's answer.

The first half is Recursive Self-Aggregation, which is the "RSA" in Markovian RSA. The model generates several reasoning traces in parallel for the same prompt, then extracts the tail tokens of each, and feeds those tails into an aggregation prompt that asks the model to reconcile them into a better single answer. This isn't the first time we've seen RSA used in LLMs, but it is the first model publicly released built specifically to facilitate it.

The second half is the Markovian Thinker idea: instead of one long sequential chain, reason in fixed-duration chunks and pass only the tail of each chunk forward. Combine the two and you get reasoning that can run as long as you want, on a context window that stays bounded the entire time.

Zyphra co-trained the model on this aggregation format. The prompts were synthetically injected into the supervised fine-tuning data, and they continued through the reasoning warmup, the reinforcement-learning-from-verifiable-environments stage, and the math-and-code RL stages. The model wasn't trained on normal data and then asked to follow the aggregation format at inference time. Instead, it was taught to understand the format throughout post-training, so the parallel-trace-and-merge behaviour is something the weights expect, not something the prompt has to coax out of them.

That design is the reason Zyphra can claim performance approaching frontier models with RSA enabled. Specifically, Zyphra reports that with RSA enabled it reaches 91.9% on AIME 2025 and 89.6% on HMMT 2025 Feb, putting it near much larger reasoning systems and slightly above the GPT-5-High comparison number on HMMT 2025, though not on AIME 2025. Applied naively to another model that hasn't been trained on the format, the same scaffold loses most of its benefit. With a 40,000-token per-rollout reasoning budget and 4,000 tokens forwarded between chunks, Zyphra reports the model approaches DeepSeek-V3.2 and Qwen3-A22B levels on hard math.

The routing layer underneath is the third architectural piece. MoE routers fail in predictable ways. A handful of experts get over-subscribed, the rest under-train, the gating signal collapses, and you end up with a model that's nominally sparse but practically dense across a few hot experts. Zyphra replaced the linear router with a small MLP, and a PID-inspired bias-balancing update, implemented with AdamW over the routing-bias terms. In other words, if a given expert is being over-selected, the controller pushes its bias term down, and if it's under-selected, the bias goes up. The proportional, integral, and derivative terms together stabilise routing without needing a heavy auxiliary load-balancing loss. A learned residual scaling layer on top of that controls how the residual norm grows through depth at what Zyphra describes as negligible parameter and FLOP cost.

Running it locally took two tries

I just used a regular quant instead

The natural first attempt was the 7900 XTX. Zaya1 needs Zyphra's vLLM fork on the zaya1-pr branch, built from source. That part was straightforward. After that, nothing was.

The first failure was an LDS overflow in the sampler kernel. topKPerRowDecode in csrc/sampler.cu asks for 66 KB of shared memory per block. RDNA3 on gfx1100 only has 64 KB of LDS, so the kernel won't launch, but CDNA3 on MI300 has 160 KB. This happens because Zyphra trained on MI300X, validated on MI300X, and the kernel was sized for it. I patched the sampler to take the single-block radix-sort path on ROCm and bypass the 1024-thread merge variant entirely... though it was a rather naive move on my part.

Because I capped merge threads at 512 instead of bypassing the merge path, it corrupted the top-k indices in a way that wasn't obvious until generation started. It compiled fine and let the model load, and it even generated right... for the first token. When I gave it "The capital of France is ", it came back with "Paris", correctly, before it got locked into "transition transition transition transition." Top-1 was right, top-K was broken.

The second version restructured the patch to skip the merge kernel path entirely on ROCm. That didn't work either, and at that point I'd been deep enough for long enough that something else was almost certainly broken downstream, with no obvious next thread to pull. The xgrammar problem in the same build didn't help. It pulled in a CUDA-linked torch_c_dlpack_ext that didn't exist on ROCm, so the import blew up at the start. The workaround was to add "from __future__ import annotations" in vllm/v1/structured_output/backend_xgrammar.py so the type annotation isn't evaluated at import, plus a reinstall of xgrammar with no dependencies in order to satisfy the import without dragging the CUDA extension along. Between the sampler and xgrammar fixes, I was three layers deep into a stack of patches I wasn't confident were correct.

At this point, I just threw in the towel and moved to the Mac. With an M4 Pro MacBook, I was able to run the full BF16 weights through Zyphra's custom transformers code at around 7 tokens a second. That's a rough experience as-is, but with a reasoning model specifically that's especially built for reasoning? That's unusable. I tried to drop it to FP16 to see what would happen, and... yeah, it didn't really work.

Switching to vMLX, an MLX-native inference server for Apple Silicon with an OpenAI-compatible API, and loading an MXFP4 quant pushed throughput to about 42 tokens a second on the same hardware. Outputs on the prompts I tested were indistinguishable between the two runs, so the speedup didn't come at the cost of answer quality. Instead, it came at the cost of weight precision the model apparently didn't need at full BF16 to handle these problems.

The cleanest test I ran was a math problem I wrote to make sure Zaya1 wasn't just recalling its training set. I modified an AIME 2024 question, with three logarithmic equations in three variables:

Let x, y, z be positive real numbers satisfying:

log_3(x / (y^2 z)) = 2/5

log_3(y / (x z^2)) = 3/7

log_3(z / (x^2 y)) = 1/6

If |log_3(x^5 y^2 z^3)| = m/n, where gcd(m, n) = 1,

find m + n.

The answer to that question is 272. I gave the same problem to Zaya1-8B, to GPT-5.5, and to Claude. Claude failed it, GPT-5.5 got it sometimes and failed at other times, but Zaya1 got it repeatedly. Zaya1 also did something I wasn't expecting from an 8B model. Instead of solving for a, b, c, the logs of each variable, and then plugging back in, it noticed that 5a + 2b + 3c was the only quantity it actually needed. It set up a small linear system to find coefficients p, q, r such that p(a-2b-c) + q(-a+b-2c) + r(-2a-b+c) would collapse straight to 5a + 2b + 3c. Solving that gave p=-1, q=-2, r=-2, which it then applied to the right-hand sides: -2/5 - 6/7 - 1/3 = -167/105. Absolute value 167/105, gcd(167,105)=1, m+n=272.

Finally, it then verified the answer by separately solving for a, b, c numerically and recomputing. After about 7,400 reasoning tokens, just under three minutes at 42 tok/s, and it had the right answer using two different methods. The AIME 2024 problem in the same family, with base 2 and a slightly different system of exponents, gives m+n=33. Zaya1 worked through that one just as cleanly in around 5,400 reasoning tokens.

This actually shocked me: an 8B reasoning model running locally on a laptop produced a clean, audited derivation on a question that Opus 4.7 managed to get wrong twice, stating 2273 and 9. In another GPT-5.5 instance, it got it wrong three times, even when corrected. It first said 33, then it said 2563, then it said "It's not 272, the correct answer is 198." To be clear, that is a single problem, not a benchmark. But it's still a shocking result to watch happen on a Mac, and at least implies that the benchmarks shared by the team are directionally correct.

You can't run it with all the bells and whistles

The RSA part can't be run locally yet

The one thing the local deployment doesn't get is Markovian RSA at inference time. The model is co-trained on the format and the weights expect it, but the parallel-trace-and-merge scaffold only runs in Zyphra's cloud deployment for now. There's no local implementation to point at the MXFP4 quant.

To see what the gap looks like in practice, I gave the same prompt to both deployments: a Python "find_meeting_time" function over participants in different timezones, using zoneinfo, with work-hour windows in each participant's local time, busy-period exclusion, and a 14-day search horizon from now. The cloud version, with RSA active, worked through it for about 27,961 reasoning tokens at roughly 58 tokens a second. That's around 482 seconds of inference, call it eight minutes, and it produced a complete working solution at the end. Long, but it finished, and the code it produced was structurally sound, with only minor bugs.

The local MXFP4 run on the same prompt hit its 12k token reasoning cap without ever producing the final code. Reading the trace afterwards, the model was doing the right work the whole way. It was converting busy intervals to UTC, handling the work-hours boundary, sketching the minute-by-minute search loop, and debating inclusive versus exclusive end-of-day. It just couldn't compress that into a finished function inside the budget I gave it, which isn't a fault of the model. However, the technique that makes long reasoning bounded-context with parallel traces aggregated by their tails isn't deployable locally yet. With one long chain and a finite buffer, you're hostage to the cap. With RSA splitting the reasoning into chunks and forwarding only the tail of each, you aren't.

If you want to run this locally, that's the most important part to be aware of. The weights know how to reason their way to a multi-part answer, but the paths to running locally can't actually give them room to do it.

It's AMD trained, but that's the least impressive part

The design is the bigger deal

There's a separate story attached to Zaya1 that most coverage has led with, in that it was pretrained end-to-end on a 1,024-GPU AMD Instinct MI300X cluster on a custom IBM cluster. No Nvidia in the loop, and a good proof-of-concept for AMD's software stack at this point. However, it's not the reason this model is interesting. Oh, and as my 7900 XTX attempt makes clear, "trained on AMD" doesn't automatically mean "runs on AMD" on the consumer cards. I'm sure it will eventually, but even getting it running in the first place, given how new all of it is, is more of a technical demonstration than anything else.

For what it's worth, Zyphra has already been using its Zaya1 model for something even more interesting. In a separate announcement, the company announced Zaya1-8B-Diffusion-Preview, building on the autoregressive Zaya1-8B base checkpoint. This is a discrete diffusion language model that drafts 16 tokens at once, reporting a 4.6x speedup with a lossless sampler and 7.7x with a mixed-logits sampler. That isn't the model I tested here, and it's still in preview, but it's a pretty big deal as well.

For a small, open-weight, Apache-2.0 reasoning model with three huge technical foundations underpinning it, Zaya1-8B is a big deal. The benchmark numbers still need to be independently verified, and the model is narrow enough that it won't replace a generalist one. Still, the architectural advancements are the true story here. It's a big deal for AMD, but the cards used to train the model are honestly a footnote.

URL: https://www.xda-developers.com/tried-new-8b-local-llm-deepseek-r1-design/

⇱ I tried a new 8B local LLM, and its design might be the biggest shift since DeepSeek R1

Compressed Convolutional Attention rewrites how attention works

Everything gets compressed

Markovian RSA is co-trained, not bolted on

Combining multiple reasoning traces at once

Running it locally took two tries

I just used a regular quant instead

You can't run it with all the bells and whistles

The RSA part can't be run locally yet

It's AMD trained, but that's the least impressive part

The design is the bigger deal

URL: https://www.xda-developers.com/tried-new-8b-local-llm-deepseek-r1-design/

⇱ I tried a new 8B local LLM, and its design might be the biggest shift since DeepSeek R1

Compressed Convolutional Attention rewrites how attention works

Everything gets compressed

Markovian RSA is co-trained, not bolted on

Combining multiple reasoning traces at once

Running it locally took two tries

I just used a regular quant instead

You can't run it with all the bells and whistles

The RSA part can't be run locally yet

Subscribe to the newsletter for Zaya1 and model breakthroughs

It's AMD trained, but that's the least impressive part

The design is the bigger deal