Local LLMs are one of those things that started off as a novelty and ended up being more useful than I expected. After running them for months now, I'm still kind of impressed that you can hand your hardware some open weights and have a real conversation with it offline. The two I keep coming back to are Gemma 4 (the E4B variant on desktop, E2B on mobile) and Qwen 3.5 9B. I've used both for a while now and there are noticeable differences between the two. So I thought I'd put them head-to-head to see which one is actually the better option for my use cases in particular.
Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won't find anywhere else on the site. Subscribe by modifying your newsletter preferences!
I ditched LM Studio for an open-source alternative — and my local model is doing things it couldn't before
It's better in all the ways I needed local AI to be better
A quick overview of Qwen and Gemma
Why these two ended up in my rotation
Qwen 3.5 9B dropped in February as part of the Qwen 3.5 family. It's a dense 9B model with a hybrid Gated DeltaNet architecture, which just means it handles long context without VRAM climbing up the way it tends to with standard transformers. The context window is 262K natively but extendable past a million tokens with YaRN, plus it's multimodal. What got me hooked early on was the GDN side - a 9B running at 60K context length or higher on my 8GB VRAM isn't something I was able to push before.
Gemma 4 is Google's open-weight family with four sizes - the E2B and E4B edge variants made for phones and laptops, plus the bigger 26B MoE and 31B dense models. The E is for "effective" parameters since Per-Layer Embeddings keep the active footprint small, so E4B is roughly 8B total weights with 4.5B effective. The context window goes up to 128K, which is pretty decent. What surprised me about the edge variants, though, is that they actually ship with more than the bigger variants do. Audio input only exists on E2B and E4B because the encoder was built specifically for low-memory devices, so the small variants aren't just stripped-down versions of the larger ones.
Google's pitch for the edge variants is around on-device utility, so things like smart replies, summarization, and anything that needs to run locally without latency. Qwen positions 3.5 as a native multimodal agent with the reasoning side really leaned into. For my use as a general chatbot user, both of them slot really well into a workflow I already had, but if I were to put them side by side for the same tasks...
Just talking to them
Where the brains actually show up
For the chat side, I ran both at their recommended sampling settings for my general usage (which leans a tad creative). So Qwen at temp 1.0, top-p 0.95, presence penalty 1.5, and Gemma at its own standardized config of temp 1.0, top-p 0.95, top-k 64. I also kept the browser MCP disabled. The idea was to see what each one knows from training data alone without external lookup helping out.
Qwen pulls clearly ahead on reasoning. Whenever I ask it to work through something multi-step, say, structuring a study guide or breaking down a topic I'm trying to understand, the responses just feel deeper than what Gemma's giving me on the same prompts. There's a technical reason behind it too. Qwen 3.5's thinking mode is inherited and refined from Qwen 3, so it's more battle-tested. And the 9B actually beats the previous-gen Qwen3-30B on reasoning benchmarks, which is a model more than three times its size.
Gemma doesn't get embarrassed in this department, it's just not where it shines. The responses tend to be shorter and more conversational, which is sometimes exactly what I want when I'm just chatting back and forth. Latency is also faster on my hardware since the model's optimized for edge.
On the tooling side, both models have native function calling built in, so they're roughly equal there on paper. In practice though, how well MCP tool use actually works tends to come down to the runner more than the model.
Everything beyond typing
Beyond text-only chats
Documents up first. Though if I had to be precise about it, multimodal refers to models with actual encoders for processing images, audio, and visual layout directly. So it can "see" a PDF as pixels, charts and all, but the RAG ability depends more on the runner than the model here too. Regardless, documents are part of how I use AI so it was worth a quick pass with these two. Qwen does have the bigger context ceiling, which makes it more suitable for larger reports, but that's not quite what I need it for. To me, after testing three different runners, I'd pick LM Studio for doc work regardless of the model I'm running.
Images are where I started to see the difference more noticeably, and Gemma 4 E4B pulled ahead for me especially when it comes to work at the crossroads of creativity and analytics, such as a screenshot of a design system. There's actually a good reason for this. Gemma 4's documentation specifically calls out screen and UI understanding as a core image use case for the model, alongside chart comprehension and OCR. The model was deliberately trained for that profile. Qwen 3.5 9B isn't weak on vision either, but Gemma's training is just more directly tuned for the kind of work I throw at it.
Score deals on computers & work-setup gear to run LLMs
Audio is a Gemma-only thing. It's got native ASR and speech-to-translated-text on E2B and E4B, and I'm honestly more impressed by the capability than I'm actively using it - I tested it through llama.cpp and the recognition was solid, but it hasn't really worked its way into my regular flow yet.
Where each one still has a place
It's just about picking the right one for your workflow
Qwen 3.5 9B is still the model I open for stuff that's heavier on thinking, its thinking is more mature than Gemma's after all. It's also the better option for longer sessions considering my limited hardware. But Gemma has the wider footprint in my setup overall. E4B is what I reach for on desktop whenever images are involved, and the E2B variant also follows me onto mobile through PocketPal, so it's the more accessible option of the two.
The one I keep opening
For my use cases in particular, it's Gemma 4 E4B that pulls ahead. The responses feel closer to a cloud chatbot in tone, it's better with visual work, and it's more lightweight, and I can access it from pretty much anywhere. None of this means Qwen 3.5 9B isn't a seriously impressive model though - it has the edge in reasoning, making it more suitable for research and studying.
