Local LLMs have gone from a curiosity thing I poked at, to something I actually use on the regular now. Not for everything - cloud AI still wins at most tasks - but mainly personal documents I need to work through, anything where handing a cloud model the context feels like the wrong call. This use case alone made the whole thing worth setting up, and from there, I've just started paying more attention to what open models are actually capable of doing.

Which is what led me to Gemma 4, Google DeepMind's latest open-weight family. It's been sitting in my rotation for a few weeks now because I could actually run it on my hardware. And what I found was less about whether it could keep up with cloud AI, and more about what it could do that many larger models straight up don't offer.

Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won't find anywhere else on the site. Subscribe by modifying your newsletter.

Google built one of the most accessible open models

What makes Gemma 4 unique

Gemma 4 is Google DeepMind's fourth-generation open-weight model family, released in April 2026 and built on the same research that powers their Gemini line. It's under Apache 2.0, which is a bigger deal than it sounds - previous Gemma releases had Google's own restrictive terms attached, so this was the first time you could actually take the weights, fine-tune them, build something commercial, and not have to read the fine print. There are four sizes in the family: E2B, E4B, 26B A4B, and 31B, all of them are multimodal and handle text and images. But only the two smallest ones, E2B and E4B, handle audio. That inversion is what made the E4B interesting to me.

The architecture is why it runs on my smaller GPU without complaining. The E4B is a dense model, not MoE like the 26B variant, so it's not activating a slice of a bigger parameter pool, it's just a smaller model engineered to behave efficiently. It uses Per-Layer Embeddings (PLE) to keep active compute low, and a hybrid attention setup that combines local sliding window attention with global attention only in the final layer, so basically, it's not holding everything in memory at once. At Q4 quantization it fits in around 3 to 6GB of VRAM. That's the edge-optimized design doing its job, it was built to run on phones and Raspberry Pis, so my 8GB PC VRAM is comfortable territory for it.

For someone like me - not a developer, just a regular user with a modest local setup - the E4B slots in as a general-purpose model that happens to handle a bit of everything. The obvious uses are the ones I already lean local for: private documents, anything health or finance adjacent, and also longer research sessions where I don't want a usage cap cutting me off, and things along those lines. But the image and audio capabilities change what's possible in that context. You can drop a screenshot in, hand it an audio clip, prompt it about what it's hearing and it does actual reasoning over that input, not just transcribing it.

My first run with Gemma 4 in LM studio

It was a mixed bag

LM Studio was the obvious starting point. It's what I already run all my other local models through because the simple install and GUI makes local LLMs more approachable. I've been seeing the hype around Gemma 4 and decided to give it a spin. My first impressions of it were a genuine mixed bag.

The text responses were weird. Not bad exactly, just hard to parse - the outputs were mixing the model's thinking/reasoning process in with the actual response, so I couldn't easily tell where one stopped and the other started. I've tried various combinations of parameters, settings, and system prompts, but it kept doing this. After doing a bit of digging, apparently it's a bug in LM Studio, which meant there was a fix. So I added {%- set enable_thinking = false %} to the model's Prompt Template, but that still didn't fix it for me. Which means I'll probably just have to live with it for now.

Its outputs are still decent though - I use parameter presets for different use cases, and it delivers on each one of them. Gemma 4 E4B has a context window of 128k tokens. On my hardware, I can comfortably keep the window around 40k, sometimes pushing it up to 70k. This easily gives me 30+ prompts per session.

Image analysis was a different story. I dropped in some screenshots and design files and the reads were accurate - it picked up on layout issues, flagged inconsistencies, gave me the kind of feedback that actually requires understanding what's on the screen rather than just describing it. That's not a given at this size. My Qwen 3.5 9B handles images well too, but Gemma actually felt more precise with design-specific context.

Then I took a detour with llama.cpp

Where Gemma 4 gets more useful

Llama.cpp is an open-source, high-performance C++ library designed for running large language models efficiently on local, consumer-grade hardware. There are many reasons to pick llama.cpp for running your models over runners like LM studio or Ollama, such as more granular control over inference settings, better stability with newer models, and more precise VRAM management. But for me, I just wanted audio support to test Gemma 4's ASR (automatic speech recognition), because LM Studio doesn't have audio support. This is the thing that sets apart the E4B variant from larger models. The audio encoder was built into the edge models by design - Google shrunk it down specifically for low-memory devices, which is why the bigger models don't have it.

I barely touch my terminal, but llama.cpp was pretty easy to get running. I downloaded the prebuilt version of llama.cpp (no compiling needed), grabbed the Gemma 4 model file plus a second file that handles audio/image input, and dropped them all in the llama.cpp folder. Then I navigated to that folder in PowerShell, and ran one command that started a local server:

.\llama-server.exe --model gemma-4-E4B-it-Q4_K_M.gguf --mmproj mmproj-F16.gguf -ngl 99

Then I just opened the GUI in my browser with the localhost address. From there, I got chatting. The first thing I noticed was the improvement in the text responses - llama.cpp has a separate collapsible box for Gemma's reasoning/thinking process, so it didn't overcrowd the chat space. The downside is that it took a bit longer to generate its responses. The samplings and penalty settings give you more to work with than LM Studio, but I only tweaked my regulars (temperature, repeat penalty, min-p, etc) because the rest of the nobs are a bit niche with a marginal impact on daily chat.

Its image analysis abilities were the same as in LM Studio, so not much difference there. The real test came with audio. Llama.cpp doesn't have a live recording button, so you'll need to upload your audio files, and they need to be in WAV format. I gave it an audio recording of me voicing one of the same text prompts I had sent it before. This was actually the first time I'd utilized audio inputs in a local LLM, so it was a pleasant surprise to see its accuracy in interpreting my input. It gave me a similar response as the first time I sent the same prompt in text - same length, depth, and structure.

While I would like to see a live record function, this still demonstrates foundational capability for audio understanding within a local, private, and customizable environment. And it could be a viable option for individuals who don't use keyboards or mice and navigate their computers with speech recognition.

Open models got better when I wasn't looking

Gemma 4 E4B wasn't what I expected from a free, local model. The image analysis holds up against cloud AI, dare I say. And the audio side - even with the extra steps involved - works well enough to be genuinely useful. The thinking process bleed in LM Studio is still annoying me, but that's a runner issue, not a model issue. Overall, Gemma 4 is proving to me that open-source models are catching up faster than we realize.