DeepSeek V4, Qwen 3.5, and GLM 5: The Next Open Models for Local Inference
February is shaping up to be an interesting month for people who run LLMs locally. New versions of three of the most widely used open model families are expected to land soon: DeepSeek V4, Qwen 3.5, and GLM 5. These models sit at the center of the local LLM community, especially for users who care about quantization, VRAM limits, and performance per dollar.
Details like exact sizes and architectures are not public yet. Still, based on the current generations, we can make some reasonable assumptions about what is coming and how practical these models will be on local hardware.
DeepSeek V4 and the Possibility of Engram Memory
Right now, DeepSeek’s latest releases are DeepSeek V3 and DeepSeek R1, released in March and May 2025. Both are massive 671B MoE models with 128K context. Even at 4-bit quantization, you are looking at roughly 400GB of memory, which makes them unrealistic for most home setups. The exception is unified memory systems like a Mac Studio with an M3 Ultra and 512GB of memory.
DeepSeek V4 is expected to continue this trend. Most signs point to another large MoE model, likely with an even larger context window. From a raw hardware perspective, that alone would not change much for local users. The interesting part is the rumored adoption of Engram-style memory.
Engram architecture effectively splits the model into two parts. There is a smaller, dense backbone that focuses on reasoning, math, and context handling. Separately, there is a very large memory table that stores world knowledge. The key idea is that the main network no longer needs to memorize massive amounts of static information, which allows it to focus on reasoning depth without increasing compute cost.
What matters for local inference is how Engram handles offloading. Unlike traditional CPU offloading, where the GPU stalls while waiting for data over PCIe, Engram memory access is deterministic. The system knows in advance which memory entries it will need. That allows the runtime to prefetch data asynchronously from system RAM while the GPU is still busy processing the previous token.
According to the published benchmarks, offloading a 100B-parameter memory table entirely to host RAM resulted in less than a 3 percent throughput penalty. For local builders, this is a big deal. In practical terms, it suggests a future where the reasoning core runs fully in GPU VRAM on a card like an RTX 3090 or 4090, while the bulk of the model’s knowledge lives in 64GB or 128GB of system RAM. Given the price gap between DDR5 memory and high-VRAM GPUs, this could significantly improve performance per dollar for local inference.
That said, even if DeepSeek V4 uses this architecture, it will take time before local inference engines like llama.cpp fully support it. Unless DeepSeek provides early tooling or reference implementations, adoption will not be instant. A multimodal DeepSeek model would also be a welcome addition, especially if vision support can be exposed cleanly to local runtimes.
Qwen 3.5 as a Broad and Practical Update
Qwen 3.5 looks more like a full refresh of an already well-structured model lineup. The expectation is that Alibaba will update the entire range, from small 0.6B models all the way up to large models like Qwen Code 480B.
For local LLM users, the most important updates are likely to be in the mid-range. New versions of Qwen 3 30B A3B and Qwen 3 32B are particularly interesting. These models are already among the most commonly run local LLMs, alongside models like GPT-OSS 20B. They fit comfortably on a single consumer GPU with 24GB of VRAM at 4-bit quantization, while still leaving enough headroom for a useful context size.
If Qwen 3.5 brings incremental quality improvements, better instruction following, or more efficient attention, these models will likely remain default choices for people who want strong general-purpose performance without moving to multi-GPU setups.
GLM 5 and the Hope for a New Air Model
GLM 5 has been officially announced via a post from a developer, so this release feels more concrete. The most relevant current local model in this family is GLM 4.5 Air. It is a 106B MoE model with 12B active parameters. For local use, it requires around 68GB of memory at practical quantization levels.
In real-world setups, users tend to run it on unified memory systems like AMD Strix Halo or Apple silicon machines with 128GB of memory. At full 131K context, it can reach around 91GB of memory usage, which also makes it viable on workstation GPUs like the RTX Pro 6000 Blackwell.
The biggest open question is whether GLM 5 will introduce a new Air model. The Air line has not been updated since 4.5, and it fills an important niche between smaller dense models and extremely large MoE systems. There is also an expectation that the current 355B state-of-the-art model, last updated as GLM 4.7, will receive a refresh.
If GLM 5 improves efficiency or reduces active parameter counts while maintaining quality, it could remain one of the more attractive options for users with large unified memory systems.
Other Rumors: Meta’s Avocado
There are also rumors about a new Meta model internally referred to as Avocado. At this point, it is unclear whether it will be open, what size it will be, or whether it will be relevant for local inference at all. Until more details are available, it remains speculative.
Conclusion
February could bring meaningful updates across all three major open model families. Qwen 3.5 looks like the safest near-term win for single-GPU users, GLM 5 may strengthen the high-memory niche, and DeepSeek V4 has the potential to reshape local inference if Engram-style memory becomes real and usable.
Read more
No comments yet.
