Qwen3.5-9B has been making waves in the AI enthusiast community, especially given that Alibaba's compact reasoning model outscored OpenAI's gpt-oss-120b on GPQA Diamond, MMLU-Pro, and MMMLU, all while being able to run on rather modest hardware. A 9-billion parameter model beating one with over 120 billion parameters sounds like it should be the story of the year, and across social media and tech blogs, it feels like it has been. After all, it's a classic "David beats Goliath" story; open source wins again, and you don't need a data center to run something competitive. It's a narrative that writes itself.
Here's my problem with that narrative, though. I've spent a lot of time with local LLMs over the past year, including Qwen's own models, and what I keep running into is a disconnect between what the benchmarks promise and what actually happens when you sit down and try to use these things for real work. Qwen3.5-9B's numbers are genuinely impressive, and I don't want to take that away from it. But the way people interpret those numbers, treating them like a definitive ranking of which model is "best," tells you more about our obsession with leaderboards than it does about what you should actually be running on your machine.
I should be clear: I'm not saying Qwen3.5-9B is bad. I'm saying that benchmarks, as they exist right now, are a terrible way to decide what model to use. And the hype around this particular set of scores is a good example of why.
The benchmarking ecosystem has a credibility problem
And it's been building for a while
If you've been following AI model releases over the past year, you've probably noticed a pattern, and it feels as if every new model tops the leaderboard. Every release comes with a chart showing it beating everything that came before. And yet, when you actually download the thing and start using it, the experience rarely matches the chart. That gap isn't a coincidence, and is instead the result of a benchmarking culture that has gradually drifted away from measuring anything useful.
The most obvious example is what happened with Meta and Llama 4. When Meta released Llama 4 last year, the company submitted a specially tuned "experimental chat version" of Maverick to Arena (formerly LMArena), one that was different from what the public could actually download. Arena called them out publicly, saying Meta's "interpretation of our policy did not match what we expect from model providers." That alone should have been a wake-up call. But then, in early 2026, Meta's outgoing AI chief went further and confirmed the results had been "fudged a little bit," with different models swapped in for different benchmarks to produce better numbers across the board.
Meta got caught, but the tactics they used aren't unusual. Companies routinely submit multiple entries per model, test privately, and cherry-pick the results that look best. SurgeAI analyzed 500 Arena votes and disagreed with 52% of them, concluding that confidence beats accuracy and formatting beats facts when it comes to how models get scored. New benchmarks pop up all the time, and within months, labs are optimizing specifically for them. It's become a game, and the score you see on a leaderboard often reflects how well a team played that game rather than how well the model actually performs at the task you care about.
This is the environment Qwen3.5-9B's headline numbers exist in. I'm not suggesting Alibaba gamed anything. The scores appear genuine on third-party evaluations, and the model's hybrid architecture, which combines Gated Delta Networks with a sparse Mixture-of-Experts system, is genuinely clever engineering. But when the broader ecosystem around benchmarking is this muddied, even legitimate results lose some of their meaning. You can't trust the scoreboard when you know half the players have been fudging their stats.
Beating gpt-oss-120b doesn't mean what most people think
The margins are real, but what do these tests actually measure?
First, we need to be specific about what Qwen3.5-9B actually achieved, because the details matter more than the headline. On GPQA Diamond, a graduate-level reasoning benchmark, it scores 81.7 compared to gpt-oss-120b's 80.1. On MMLU-Pro, it hits 82.5 versus 80.8. On the multilingual MMMLU benchmark, it pulls ahead with 81.2 to 78.2. No matter what, the fact that a 9B model is pulling these results against something more than ten times its size is worth talking about, and that's where the nuance gets lost. "9B model beats 120B" is what people read, understand, and share online.
A lot of people sharing these numbers have no idea what these benchmarks actually test, and that context changes the story quite a bit. GPQA Diamond is a set of just 198 multiple-choice questions in biology, physics, and chemistry, written by PhD-holding domain experts. The questions are specifically designed to be "Google-proof," meaning you can't just look up the answer. Even PhD experts in the relevant fields only get about 65% of them right, and skilled non-experts with unrestricted internet access manage just 34%. It's a test of deep academic reasoning in narrow scientific domains, not a test of whether a model can help you debug your Python script or write a decent email.
MMLU-Pro is broader, covering around 12,000 questions across 14 categories, from computer science and engineering to law, history, and philosophy. It's an upgrade over the original MMLU, with ten answer choices instead of four and a heavier emphasis on reasoning over pure knowledge recall. It's a better benchmark than most, but it's still fundamentally a multiple-choice exam. The kind of work most people do with language models, writing, summarizing, coding, brainstorming, doesn't look anything like picking option F out of ten choices on a physics problem.
Then there's MMMLU, which takes the original MMLU questions and translates them into 14 languages. It's useful for measuring whether a model handles languages beyond English, but the underlying test is still the same knowledge-and-reasoning multiple-choice format. If you're not using your model in Korean, Arabic, or German, this score tells you very little about your experience.
What doesn't travel is that gpt-oss-120b still outperforms Qwen3.5-9B on complex reasoning chains and certain code generation tasks. The 9B model wins on tasks that align with its training strengths, particularly academic knowledge and multilingual comprehension. It doesn't win everywhere, and you shouldn't expect it to.
The full benchmark picture tells a different story
Where Qwen3.5-9B wins and where it doesn't
The numbers we've talked about so far only cover the benchmarks that made headlines. When you zoom out and look at all 26 benchmarks published by Alibaba across knowledge, reasoning, coding, instruction following, long context, and multilingual performance, the picture looks a lot less one-sided.
Across every shared benchmark, you can see where each model pulls ahead. Qwen3.5-9B leads on knowledge tasks like C-Eval, long context handling, instruction following, and multilingual evaluations. gpt-oss-120b, on the other hand, dominates on reasoning and coding, particularly on HMMT (competition math), LiveCodeBench, and OJBench. It's not a clean victory for either model, it depends entirely on which category you're looking at.
The category breakdowns make this even clearer. Qwen3.5-9B sweeps the knowledge and STEM benchmarks that made headlines, but look at LiveCodeBench and OJBench, the benchmarks that test actual code generation, and gpt-oss-120b leads by a wide margin (82.7 vs 65.6 on LiveCodeBench, 41.5 vs 29.2 on OJBench). If you care about coding, and a lot of people running local models do, the "9B beats 120B" headline suddenly feels misleading.
When you tally up outright wins across all 26 benchmarks, Qwen3.5-9B takes ten and gpt-oss-120b takes eight. That's a much closer race than the headlines suggest, and gpt-oss-120b's wins are concentrated in the reasoning and coding benchmarks, arguably the ones that matter most when it comes to practical tool calling work. You'll typically need to use MCP servers with tool calling alongside a local model to have an experience in any way akin to a cloud model, and coding models tend to be trained well for tool calling and instructive uses.
With that said, none of this takes away from what Qwen3.5-9B actually achieved. Alibaba built a compact model that competes with something vastly larger on a specific set of evaluations, and it did so with an architecture that runs on consumer hardware. No matter what, that's a meaningful achievement in making AI more accessible, and I don't want to take away from that. But it's a very different claim from "this is the best model," and the internet has mostly been treating it as the latter.
I've used Qwen's models for real work, and that taught me more than any leaderboard
Qwen3-Coder-Next is a phenomenal model
I've been running local LLMs daily for a while now. I've used all sorts of models, from small quantized experiments to bigger ones that will only run on the Lenovo ThinkStation PGX. And my experience with Qwen3-Coder-Next, specifically, has been the best I've had with any local coding model. When I tested it, it blew me away because it genuinely changed what I thought was possible without paying for a cloud provider. Its tool calling just works. It handles multi-step reasoning without falling apart, and when it hits an error, it recovers and tries a different approach instead of spiraling into the same mistake over and over. For a local model, that kind of thing felt incredibly impressive.
What's interesting is that while Qwen3-Coder-Next and Qwen3.5-9B both use Alibaba's hybrid architecture of Gated Delta Networks and sparse Mixture-of-Experts, they're built for very different things. Qwen3.5-9B is a compact generalist with 9 billion parameters, designed to be well-rounded across knowledge, reasoning, and multilingual tasks. Qwen3-Coder-Next is a much larger model at 80 billion total parameters, but it only activates around 3 billion of them at any given time, making it ultra-sparse. More importantly, it was trained in a fundamentally different way: executable task synthesis and reinforcement learning from actually interacting with code environments, not just answering multiple-choice questions about code. Same architectural family, but one is a generalist built to ace a broad set of evaluations, and the other is a specialist built to write and debug code in the real world.
That architectural difference matters more than most people realize. Qwen3.5-9B's benchmark wins come from being a well-rounded generalist on academic evaluations. Qwen3-Coder-Next's strengths come from being purpose-built for agentic coding work, tool calling, multi-step execution, and recovering from errors in real environments. The benchmarks where Qwen3.5-9B shines, like GPQA Diamond and MMLU-Pro, don't test any of that. Meanwhile, on SWE-Bench, which has models actually solving real GitHub issues in real codebases, Qwen3-Coder-Next performs comparably to models with 10 to 20 times more active parameters. That's a benchmark that looks a lot more like the work people use these models for, but these practical tests aren't the ones that make headlines when it comes to models like Qwen3.5-9B.
But here's what benchmarks can't measure conclusively when comparing models: the feel. The feeling of working with a model matters enormously, and you can't quantify it on a leaderboard. Like, how does it handle an ambiguous prompt? Does it ask you to clarify, or does it just barrel ahead and give you something confidently wrong? How does it behave when the context gets long and messy, the way real projects always do? These are things you learn only by sitting down and building something with a model over days and weeks. You don't learn them from a score.
And that gap between evaluation and reality isn't small. When coding agents score in the low nineties during controlled testing, they often drop far below their expected performance when you put them in front of a real codebase. They invent APIs that don't exist. They skip tools that are right there. They loop endlessly on the same failed approach, all while technically having "near-perfect" benchmark scores. I've seen it happen. It's not unique to any one model family, it's a systemic problem with how we evaluate these systems versus how we actually use them. That's why Qwen3-Coder-Next was so impressive, because it didn't fall victim to those same problems when used in an appropriate harness.
So when I see that Qwen3.5-9B "beat" gpt-oss-120b, I'm not wondering by how much, I'm wondering at what specifically. And more often than not, the "what" doesn't tend to matter all that much for most people.
Picking a model should be about your workflow, not a leaderboard
It all depends on what you need
If you're choosing a local model to run on your own hardware, benchmarks should be one input among several, and probably not the most important one. What are you actually trying to do with it? How much VRAM do you have? Do you need tool calling that works reliably, or multilingual support, or raw code generation? Are you comfortable with higher latency if it means better reasoning? These questions matter more than any single score.
Qwen3.5-9B is a strong model for what it is. For multilingual tasks and academic-style reasoning, it's hard to beat at its size. If you want something that runs on a laptop and handles knowledge-heavy tasks well, it's a solid pick. Yet even though it "beats" gpt-oss-120b, I'd still rather use OpenAI's model rather than Qwen's model if the option was on the table. For coding, I'd point you toward Qwen3-Coder-Next, which is far more practical for all kinds of work that involves tool calling. The reality is that you should use different models for different jobs, which is a lot less exciting than the one-size-fits-all approach that people assume these benchmarks point towards.
Thankfully, it's getting to the stage where the gap between benchmark scores and real-world results is getting harder for anyone to ignore, and it's only going to become more obvious as people actually deploy these models and report back on what works. I'd take every leaderboard with a grain of salt, and I'd put a lot more trust in trying a model yourself for a few days than in any chart someone posts online.
Qwen3.5-9B deserves attention, and the engineering behind it is impressive. But it doesn't deserve the crown that the benchmarks alone seem to hand it. No model does right now, because the benchmarks aren't measuring what most people think they're measuring.
