Mellum2 Thinking — GGUF (Q4_K_M)

This repository contains a GGUF Q4_K_M quantization of JetBrains/Mellum2-12B-A2.5B-Thinking, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

This quantization (Q4_K_M): 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.052, 90% top-token agreement) — a good default.

File	Size
`Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf`	8.1 GB

Mellum 2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8 activated per token, 131,072-token context) that emits its chain of thought inside <think>...</think> blocks before the final answer. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Thinking.

Available quantizations

Quantization	Description	Size	KLD vs BF16 ↓	Top-token match ↑
`BF16`	16-bit, no quantization (reference)	24.3 GB	—	—
`Q8_0`	8-bit, effectively lossless	12.9 GB	0.004	97.4%
`Q6_K`	6-bit k-quant, very high quality	10.9 GB	0.014	95.1%
`Q4_K_M` (this repo)	4-bit k-quant, balanced (recommended)	8.1 GB	0.052	89.8%
`MXFP4_MOE`	MXFP4 4-bit on MoE experts, smallest	7.0 GB	0.088	87.3%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Download

hf download JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf --local-dir .

Run with llama.cpp

# Pull and serve in one step (downloads the GGUF automatically)
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M \
 --ctx-size 131072 \
 --temp 0.6 --top-p 0.95 --top-k 20

# Or run a one-off prompt with a local file
llama-cli -m Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf \
 --ctx-size 131072 \
 --temp 0.6 --top-p 0.95 --top-k 20 \
 -p "Is 1024 a power of 2? Explain your reasoning."

The server exposes an OpenAI-compatible API on http://localhost:8080/v1:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp")

chat_response = client.chat.completions.create(
 model="JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M",
 messages=[
 {"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."},
 ],
 max_tokens=81920,
 temperature=0.6,
 top_p=0.95,
 extra_body={"top_k": 20},
)
print(chat_response.choices[0].message.content)

Run with Ollama

ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

License

Released under the Apache 2.0 license.

For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Thinking.

Downloads last month: 12,308

GGUF

Model size

12B params

Architecture

mellum

Hardware compatibility

4-bit

Model tree for JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

Base model

JetBrains/Mellum2-12B-A2.5B-Thinking

Quantized

(30)

this model

Space using JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M 1

Collection including JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

Mellum2 GGUF weights • 10 items • Updated 14 days ago • 11

URL: https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

⇱ JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M · Hugging Face