VOOZH about

URL: https://huggingface.co/ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

⇱ ManniX-ITA/Qwen3.6-27B-Omnimerge-v4 · Hugging Face


Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)

Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.

GPQA Diamond: full canonical 198q greedy result = 78.28% pass@1 (flexible-extract) — measured 2026-05-22 on pod 37268930 with the patched eval chain (lm-eval 0.4.11 + max_length=32768 + the api_models.py:545 UnboundLocalError patch + aiohttp lifecycle workaround). Sampler greedy (do_sample=False, T=0.0), --reasoning-budget 8192, max_gen_toks=8192. HumanEval = 83.54% (137/164), MBPP = 73.00% (365/500). Earlier card revisions reported ≈ 84.75 % from a partial 177/198 cache sampled at T=0.6, budget=16384; that number is superseded by this canonical greedy measurement on the full bench — the 6.5 pp difference is driven by methodology (sampler, budget, completeness), not by a model change.

MTP companion (2× decode speedup): the same weights with MTP head retained for llama.cpp --spec-type draft-mtp self-speculative decoding are published as ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF + ollama mannix/omnimerge-v4-mtp. Quality is statistically indistinguishable (HE 137/164 ↔ 137/164, GPQA 155/198 ↔ 154/198 — single-question delta inside the ±2.94% stderr); aggregate decode is 2.0-2.3 × faster on a 24 GB GPU. Use that release for interactive / single-request workloads.

Quantizations

Three release lines:

GGUF (llama.cpp / ollama / text-generation-webui)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.

Also published as ollama tags: mannix/omnimerge-v4.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

MLX 4-bit — text-only (Apple Silicon)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.

from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

prompt = apply_chat_template(processor, config,
 "Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
 max_tokens=512, verbose=True, image=["path/to/image.png"]))

Sources

Source Weight Role
Qwen/Qwen3.6-27B base base + chat template
rico03/Qwen3.6-27B-rico03 0.40 general capability
ValiantLabs/Qwen3.6-27B-Esper3.1 0.35 code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor) 0.25 reasoning anchor

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampler greedy (do_sample=False, T=0.0, top_p=1.0, top_k=0) across all benches — this is the canonical recipe for cross-cohort comparison. Earlier revisions used T=0.6 for GPQA to match v2's published recipe; the canonical 2026-05-22 re-run on pod 37268930 uses greedy throughout and supersedes those numbers.

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs. v4-MLP columns reflect the canonical 2026-05-22 full-bench greedy re-run on pod 37268930.

Benchmark Qwen3.6 base Q6_K (bartowski) Omnimerge-v2 (Qwen3.5 base) Omnimerge-v4-MLP (Qwen3.6 base) Δ vs base Δ vs v2
HumanEval pass@1 (164q) 84.76% (139/164) 79.27% 83.54% (137/164) −1.22 pp +4.27 pp
MBPP pass@1 (500q) — raw lm_eval 56.20% n/a 68.40% +12.20 pp n/a
MBPP pass@1 (500q) — corrected* 57.60% 74.60% 73.00% (365/500) +15.40 pp −1.60 pp
GPQA Diamond pass@1 (flex) — full greedy§ not measured 69.19% (full 198q, T=0.6) 78.28% (155/198) +9.09 pp

Key observations:

  • HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
  • MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
  • GPQA is the strongest reasoning lift — +9.09 pp over v2 on the full-bench greedy comparison. Note this is smaller than the previous partial-cache estimate (≈ +15.5 pp) because v2 was sampled at T=0.6 with budget 16384 (an easier configuration for verbose reasoning) while v4 is now measured under greedy at budget 8192. The marquee win is real, but the magnitude is the +9.09 pp greedy figure, not the +15.5 pp partial-sampled figure.

§ GPQA Diamond full greedy re-measurement (2026-05-22, pod 37268930). Sampler do_sample=False, T=0.0, --reasoning-budget 8192, max_gen_toks=8192. Wall time 4 h 55 min on 3090 Q6_K. Companion strict-match (rigid Answer: X template) is 7.58 % — the model emits CoT verbosely rather than the strict template, so the flexible-extract 78.28 % is the real quality signal. The earlier partial 84.75 % (177 of 198, sampled T=0.6, budget=16384) was a methodology artifact, not a model regression — re-measuring v2 under greedy at budget=8192 would also drop several points. The new 78.28 % is the canonical figure going forward.

* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.

  • v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
  • Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
  • v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

GPQA Diamond eval history (resolved 2026-05-22). The original 2026-05-13 run hit an aiohttp lifecycle bug in lm_eval.models.api_models.amodel_call that crashed on the at-budget reasoning tail (16384-token responses outlasting the ClientSession); we produced a partial 84.75 % (150/177 matched cached responses sampled at T=0.6, budget=16384) and kept restarting until 192/198 cached. The 2026-05-22 canonical re-run on pod 37268930 ran the full 198 under greedy decoding with budget=8192 and max_length=32768, having patched lm_eval's api_models.py:545 UnboundLocalError upstream (it crashed on transient TimeoutError before outputs was assigned) — see the quantize_gguf.py chain script + the omnimergekit pod_v4_q6k_eval_chain.sh for the bit-exact recipe. The canonical headline going forward is 78.28 % flexible-extract / 7.58 % strict-match on 198/198. The earlier 84.75 % partial-sampled figure is superseded but kept here for transparency about the prior methodology drift.

Why "MLP-passthrough"

When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.

We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Test Clean Qwen3.6 base v4 (full merge, broken) v4-MLP (this model)
<think> open rate (mbpp-10 isolation) 40% 80% 0%
Unclosed </think> 0/4 88% of opens 0/10
MBPP pass@1 (mbpp-10 isolation) 40% 20% 50%
Empty response (chat-completions) low 80% 0/10

Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.

The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

  • transformers (BF16) — both use_cache=True and False paths
  • llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
  • vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Script Purpose
dare_ties_merge.py Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip).
v4_mlp_passthrough.py Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
inspect_v4_delta.py Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.
pod_omnimerge_v4_build.sh Full reproducible build script (download sources, run merge, convert + quantize Q6_K).
pod_omnimerge_v4mlp_eval_raw.sh Eval orchestrator: mbpp + humaneval via raw /v1/completions. Required for reasoning-tag-emitting models — apply_chat_template + deepseek extraction strips think blocks and returns empty.
rescore_mbpp_strip_think.py Re-scoring tool that strips <think> blocks and markdown fences before exec(code+tests). Recovered 25 of 158 false failures on this model's mbpp run.
score_gpqa_partial.py Partial-cache GPQA scorer. Replicates lm_eval's multi_choice_regex flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's hash_args("generate_until", [prompt, gen_kwargs]) SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail.
pod_v4mlp_gpqa.sh Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology).

Reproducing the merge

python scripts/dare_ties_merge.py \
 --method omnimerge_v2 \
 --base /path/to/Qwen3.6-27B \
 --source /path/to/Qwen3.6-rico03 \
 --source /path/to/Qwen3.6-Esper3.1 \
 --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
 --weights 0.40,0.35,0.25 \
 --density 0.53 \
 --darex-q 0.75 \
 --output ./Qwen3.6-27B-Omnimerge-v4 \
 --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

  • Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
  • MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Downloads last month
44
Safetensors
Model size
28B params
Tensor type
BF16
·

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Merge model
this model
Quantizations
6 models