If you talk about AI in 2026 with someone who isn't knee-deep in it, the conversation usually comes back to three names. Claude, ChatGPT, and Gemini. For most people, those are "AI," and everything else is background noise. That was probably fair a year ago. It isn't anymore.
The open-weight space has had one of the wildest stretches I can remember, with major releases landing almost every week through the first half of 2026. Many of these models run on hardware you can actually own, and some of them match the big three on benchmarks that matter. Plus, a few of them are doing things the closed labs aren't willing to try. Where open-weight AI models come from is also dramatically different to where closed models originate, as most of the serious releases this year have come out of Chinese labs. Aside from OpenAI's gpt-oss models and Google's Gemma 4 models, there hasn't really been an answer.
Will any of these replace Claude for your daily work? It depends, but I'm not going to pretend they're universally better (or even competitive across the board) models. But if your mental map of AI is made up of the big three, you're missing out on the most interesting development happening in the space currently.
The top models aren't the ones you think they are
At least going by benchmarks... and no, it's also not Meta
On April 7, Z.ai (the lab formerly known as Zhipu AI) released GLM-5.1, and it posted a 58.4 on SWE-Bench Pro. That put it above GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3 on the benchmark that's become the de-facto measure of agentic coding. Fourteen days later, Moonshot AI released Kimi K2.6, and it scraped past GLM-5.1 with a 58.6. Both of those models sit higher on that particular leaderboard than anything you can buy from OpenAI, Google, or Anthropic. While it's possible that these models are being trained on data in order to game the system, it's not as simple as it sounds.
What's interesting about both Kimi K2.6 and GLM-5.1 is that neither are under a closed license. GLM-5.1 is under the MIT license, which means you can download it, fine-tune it, and ship products built on top of it with no royalty fees or usage restrictions, and Kimi K2.6 ships under a modified MIT license. The best open models are topping the charts when it comes to coding (if we remove Claude's Mythos from the mix, anyway), and, technically, anyone can just download them and use them.
Still, even if these models aren't "benchmaxed" (as the community refers to it), benchmarks still don't translate perfectly to how a model feels in practice. Other benchmarks matter too, such as Terminal-Bench 2.0 and NL2Repo, where Opus 4.7 still leads the pack. Despite the fact that Opus 4.7 is still probably the top of the food chain when it comes to coding models, the gap between these models is small when running them on most tasks, and it's the smallest it has been. I'm not saying these models are better than Opus 4.7, but I'm pointing out just how close they are.
Meta, notably, just isn't in this conversation. Llama hasn't had a public release in over six months, and the top of the open-weight benchmark list this year belongs to Chinese labs. Meta recently launched its Muse Spark model, but it's closed weights, even if there are plans for open models "in the future".
GLM-5.1 is huge, powerful, and out of reach for most people
And it also doesn't use Nvidia
GLM-5.1 is a mixture-of-experts model, packing somewhere between 700B and 800B parameters, and the most interesting detail about it has nothing to do with its architecture. It was trained on 100,000 Huawei Ascend 910B chips, with no known Nvidia hardware in the pipeline at all. This is especially interesting given that one of the working assumptions of the last three years has been that serious model training only really happens on Nvidia, and GLM-5.1 is proof of the exact opposite.
Z.ai, the lab behind it, completed a Hong Kong IPO on January 8, 2026, raising roughly HKD 4.35 billion, making it the one of the first publicly traded foundation-model companies. GLM-5.1 dropped three months later, demonstrating a pretty big step-up in performance compared to its predecessor. The company has since raised its prices rather dramatically, told legacy users they will be moved to more modern plans, and has even been banning users deemed to "abuse" the platform. The company is clearly struggling for compute, as the groundbreaking nature of GLM-5.1 saw swathes of users switching to it, especially given Claude's usage limits over the past few months.
Unfortunately, running it locally is, for almost everyone, not happening. You're not fitting a 754B MoE on a single workstation even with aggressive quantization, and the people who can run it are the same handful of enthusiasts with multiple A100s stacked in a basement rack. For the rest of us, the practical way to touch GLM-5.1 is through an API, which is fine, but it isn't really a self-hostable option.
What it does give us is a reference point. GLM-5.1 is proof that the open-weight line is still moving up, and you can expect smaller distilled variants to follow. Z.ai has a habit of releasing smaller companions to its flagship models (GLM-4.5-Air came a few weeks after GLM-4.5 last year), and I'd be surprised if the same pattern doesn't play out here.
MiniMax M2.7 wants to evolve itself
And it seems like it has been
MiniMax released M2.7 on March 18, and it sits in a much more interesting spot for anyone thinking about running a model at home. It's a 230B mixture-of-experts with only 10B active parameters per token, eight of 256 experts routed per token, a 205K context window, and a near-GPT-5.3-Codex score of 56.22% on SWE-Pro. On Terminal Bench 2.0 it scores 57.0%, which is the same ballpark as GPT-5.3-Codex. For a model you can download for free, that's a pretty remarkable place to be.
Ten billion active parameters are what determines how fast the model runs and how much memory it needs at inference time. A model with 230B total and 10B active will feel much closer to a smaller model in day-to-day use, as activations happen across a significantly smaller subset of parameters than the overall model size. The trend of these open weight models has so far been sparsity, with labs packing more total parameters while keeping fewer active ones active, which lets you ship models that are easier to run quickly at the expense of some intelligence.
MiniMax has claimed that the model is "self-evolving," describing M2.7 as an early demonstration of models that improve themselves through a closed-loop reinforcement-learning process using the model itself in the loop. The specifics are, predictably, vague in the marketing, but the training reports describe a system that iterates on its own outputs in agentic settings. It's one of the most interesting approaches to training I've seen a lab talk about so far.
The catch is licensing, though. MiniMax initially published M2.7 under a permissive license, then revised the Hugging Face repo to a modified MIT-style non-commercial license shortly after release. Personal use, research, and tinkering are all fine, and there has been some mixed messaging coming from MiniMax overall on other platforms where company members have talked about it. All of this has caused a lot of confusion in the open-source community, which tends to be unforgiving about license rug-pulls.
If you can live with the license, though, M2.7 is one of the best value-for-size models right now. With a 3-bit quant, you can even run it on a system with 128GB of RAM, though the outputs will be, understandly, worse than those at higher quantizations.
The models that actually run on your hardware
Step 3.5 Flash and Qwen3 Coder Next
Benchmarks are fun to talk about, but running the thing is always a better test. There are two models that I'd put in front of anyone who wants a local AI experience that doesn't feel like a compromise, and the hardware requirements aren't that out of reach.
The first is Step 3.5 Flash from Stepfun, and it's an incredible model that I don't see get talked about enough. It's a 196B MoE with 11B active parameters, a 262K context window, and a speed trick most models don't bother with. Step 3.5 Flash uses 3-way Multi-Token Prediction, where the model predicts three tokens ahead in parallel rather than one at a time. That lets it generate tokens way faster in typical use.
Stepfun has also put serious effort into hardware accessibility. The M4 Max Mac Studio and the DGX Spark are both listed as first-class targets, and both will run the model comfortably in production-grade quantizations. The combination of the MTP speed gains and the unified-memory architectures on those machines is why Step 3.5 Flash feels closer to a cloud model than anything else you can run locally today. If you've only ever used local models on a consumer GPU and bounced off because they felt slow, this is the one that might change your mind. You can offload experts to system RAM as well, so you don't need 128GB of unified memory, but it'll still be slower.
Qwen3 Coder Next is the other half of the pair, and my personal favorite that I've talked about quite a bit in the past. It's 80B total parameters, 3B active, with a 256K native context window and a hybrid attention design that makes that context usable on local hardware. The architecture uses Gated DeltaNet linear attention for 75% of its layers and full attention for the remaining 25%, which means the KV cache doesn't grow with the full context the way it does on dense models.
You can pair it with something like an RTX 4080, RTX 5090, or a GB10-based machine and get usable outputs on all three. It still requires high-end hardware, but you don't need a research lab's worth for it to be usable.
The reason I keep going back to Qwen3 Coder Next is that it sits in the sweet spot between capability and practicality. It's built for agentic coding from the ground up. It works with Claude Code as a coding harness, or you can use OpenCode, Pi, or anything else that you want to use. It's one of the only local models where I stop caring about the fact that it's local.
Xiaomi's MiMo is powerful
But proprietary
Xiaomi released MiMo-V2-Pro on March 18 as well, and it's a big deal. It packs a trillion total parameters, 42 billion active parameters, a 1-million-token context window, and it was briefly available on OpenRouter under the codename Hunter Alpha before Xiaomi officially unveiled it. At one point, it was rumored to be a DeepSeek V4 preview, but it turned out to be Xiaomi's.
Here's the thing, MiMo-V2-Pro is proprietary. It's API-only. Luo Fuli (the lead researcher on the project) said Xiaomi plans to open-source a variant of the family "when the models are stable enough to deserve it," but as of right now, you can't download it.
The one you can run is MiMo-V2-Flash, and it released back in December. It's also, arguably, the more interesting release from an engineering standpoint anyway. It's a 309B MoE with 15B active parameters, pre-trained on 27 trillion tokens with Multi-Token Prediction, with a native 32K context that extends to 256K. MiMo-V2-Flash interleaves Sliding Window Attention and Global Attention at a 5:1 ratio with an aggressive 128-token window, which cuts KV-cache storage by almost six times while keeping long-context performance intact via a learnable attention-sink bias.
On benchmarks, MiMo-V2-Flash surpasses dense models significantly larger than it, though Xiaomi has not made quite the same splash with Flash as it has with Pro. That's partly a marketing choice (Flash was released rather quietly) and partly because Flash isn't trying to be the frontier model, but the smart, fast, self-hostable sibling.
Qwen 3.6 just dropped, and it's a different kind of release
It has a new approach to reasoning
If I had to pick most exciting release of the last two months, it would have to be Qwen 3.6. Alibaba's Qwen team recently dropped two open-weight models, Qwen3.6-27B and Qwen3.6-35B-A3B, and they're doing something different from everyone else in this list. Starting with the license, both models are Apache 2.0, which is unusually permissive. You can use it commercially, modify it, and ship it without a separate agreement.
Getting into the specifics, Qwen3.6-27B is a 27B dense model (not MoE), with a 262,144-token native context window that extends to 1,010,000 tokens via YaRN rope scaling. It combines Gated DeltaNet with Gated Attention in a hybrid design, and it's trained with Multi-Token Prediction, which is one of the reasons it's faster than its parameter count would suggest. To make it even better, it's a multimodal model, shipping with a vision encoder that can handle both images and video.
The 35B sibling, Qwen3.6-35B-A3B, is a different shape. It's a 35B-total, 3B-active mixture-of-experts model, and it's also multimodal. The small active-parameter count means it's significantly cheaper to run than its total size implies. For agentic work where you're pushing a lot of tokens and speed is a concern, the 35B is probably the better pick.
The big new feature is that both models ship with thinking preservation, which lets the model retain reasoning context from prior turns of a conversation. Most models either throw away the chain-of-thought between messages or rehash it from scratch, and both cost you tokens. Preservation lets the model pick up a long-running agentic task where it left off, with the reasoning state intact.
Sitting above the open-weight pair is Qwen3.6-Max-Preview, which dropped on April 20 and is the Qwen family's current flagship. It's not open-weight (the Max tier of Qwen has never been), and it's available through Alibaba Cloud Model Studio and Qwen Studio. The context window is 260K, and the same thinking-preservation mechanic from the open-weight releases is wired through to the API. At $1.30 per million input tokens and $7.80 per million output tokens, it's also a fraction of the cost of Claude Opus 4.6 or GPT-5.4.
Like Xiaomi's MiMo V2 Pro, it isn't self-hostable, but the Qwen 3.6 family is interesting precisely because it spans both worlds. The open-weight 27B and 35B-A3B models are a serious release at the community end, and Max-Preview is the frontier-tier model that the same lab built on top of the same research.
Kimi K2.6 wants to be your agentic worker
Built for long-horizon tasks
Moonshot's Kimi K2.6 is the most recent release here, dropping on April 21. It's a 1T parameter MoE with 32B active, 384 experts (eight routed plus one shared), MLA attention, a 256K context window, a modified MIT license, and native INT4 quantization so you're not doing the quant work yourself. Moonshot has also done the day-one integration work, with the model available on its official site, the developer API, Kimi Code, and Ollama on the day it dropped.
The big feature is long-horizon execution, as K2.6 can dynamically scale to 300 sub-agents running across 4,000 coordinated steps simultaneously. It's running a hierarchical planner internally, farming out sub-problems to parallel instances of itself, and coordinating the results. If you do the kind of work where you would want an agent to go away and solve something complex while you do something else, K2.6 is the first open model that really seems to nail that kind of workflow.
This is one of the more exciting releases, and not because it's the best at everything. However, it's the first open model built explicitly for sustained autonomous work that doesn't need to be babysat throughout.
The open-weight world doesn't need to replace Claude
It just needs to exist
The models I've covered here are the big releases, and they're the ones most people will come across first. But there's something even more interesting about what they enable: fine-tunes. Most of them are small, most of them are specialized, and a lot of them are better than a frontier model at the one thing they were trained to do. I've fine-tuned a 7B model myself to write Home Assistant automations, and it works phenomenally well. It's not smarter as a whole when compared to other, much bigger models, but because it was trained on the exact shape of the problem, it gains a very narrow, exceptional capability at one specific task that others don't hae the specialized training to match.
That pattern repeats all over Hugging Face. There are finetunes for summarization of documents, specific programming languages, and so much more. There's even a Pokémon fine-tune of Qwen3 Coder Next, which specializes in team building and strategy. Most of these projects are built by one or two people working on something they care about.
The open-weight world isn't going to replace Claude for most people, but it doesn't need to. Instead, it gives people a choice, and the choice has never been harder to make.
