Mellum2 Thinking — GGUF (Q4_K_M)
This repository contains a GGUF Q4_K_M quantization of
JetBrains/Mellum2-12B-A2.5B-Thinking, ready to run with
llama.cpp, Ollama, LM Studio, and
other GGUF-compatible runtimes.
This quantization (Q4_K_M): 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.052, 90% top-token agreement) — a good default.
| File | Size |
|---|---|
Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf |
8.1 GB |
Mellum 2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8
activated per token, 131,072-token context) that emits its chain of thought
inside <think>...</think> blocks before the final answer. For the full model
description, evaluation results, and architecture details, see the original
model card: JetBrains/Mellum2-12B-A2.5B-Thinking.
Available quantizations
| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
BF16 |
16-bit, no quantization (reference) | 24.3 GB | — | — |
Q8_0 |
8-bit, effectively lossless | 12.9 GB | 0.004 | 97.4% |
Q6_K |
6-bit k-quant, very high quality | 10.9 GB | 0.014 | 95.1% |
Q4_K_M (this repo) |
4-bit k-quant, balanced (recommended) | 8.1 GB | 0.052 | 89.8% |
MXFP4_MOE |
MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.088 | 87.3% |
KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the
unquantized model.
Download
hf download JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf --local-dir .
Run with llama.cpp
# Pull and serve in one step (downloads the GGUF automatically)
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M \
--ctx-size 131072 \
--temp 0.6 --top-p 0.95 --top-k 20
# Or run a one-off prompt with a local file
llama-cli -m Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf \
--ctx-size 131072 \
--temp 0.6 --top-p 0.95 --top-k 20 \
-p "Is 1024 a power of 2? Explain your reasoning."
The server exposes an OpenAI-compatible API on http://localhost:8080/v1:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp")
chat_response = client.chat.completions.create(
model="JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M",
messages=[
{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."},
],
max_tokens=81920,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20},
)
print(chat_response.choices[0].message.content)
Run with Ollama
ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M
License
Released under the Apache 2.0 license.
For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Thinking.
- Downloads last month
- 12,308
4-bit
Model tree for JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M
Base model
JetBrains/Mellum2-12B-A2.5B-Thinking