ColQwen3.5 Optimization Trail
776+ MTEB evaluation results from the development of ColQwen3.5-v1, ColQwen3.5-v2, and ColQwen3.5-v3, three visual document retrieval models using ColBERT-style late interaction with Qwen3.5-VL (4.5B params).
Published benchmark scores are the output of a development process: train candidates, evaluate them, select the best. This dataset captures that entire process, not just the winners. Every seed, ablation, merge variant, and soup ratio that was tried is here, along with the scores that drove each decision. Each selection decision was evaluated against the same public ViDoRe benchmarks used for final reporting.
An associated analysis quantifying the effect of this selection on reported scores is forthcoming.
Selection Decisions
V1: Phase and Seed Selection
V1 went through four training phases, each with multiple seeds and ablations (bidirectional attention, pairwise loss, same-source sampling, extended visual tokens). At each stage, candidates were evaluated on ViDoRe V1 (10 tasks) and the best was carried forward.
The final decision: Phase 4 merged (nDCG@5 = 0.9166) was selected over the best individual seed (0.9199) for broader generalization across tasks. The merged model scored lower on the headline metric but was more consistent.
V2: Soup Ratio Selection
V2 trained three seeds with hard negatives, merged them, then tested three soup ratios against V1:
| Soup Ratio | V1 @5 | V3 @5 | V3 @10 | Selected? |
|---|---|---|---|---|
| 50/50 | 0.9169 | 0.5894 | 0.6162 | No |
| 25/75 | 0.9184 | 0.5814 | 0.6106 | No |
| 55/45 | 0.9172 | 0.5913 | 0.6177 | Yes |
The 55/45 ratio was selected for the highest V3@10. This is the same metric reported on the model card.
V3: HPO, Merge Methods, and Model Soup
V3 introduced automated HPO (16 Optuna trials optimizing V1+V3 jointly), three merge methods (full state dict averaging, correct DARE-TIES, and broken PEFT merges), and per-layer evolutionary soup optimization (11 trials, 14 parameters).
Key artifacts:
- control_v2defaults: 500-step adapter with V2's default config (no HPO). V1=0.8928, V3=0.5498.
- merged_full: Full state dict averaging of 3 seeds. V1=0.9193, V3=0.5857.
- dare_ties_fixed: Correct DARE-TIES (operating on task vectors). V1=0.9166, V3=0.5855.
- merged_dare_ties: Broken PEFT DARE-TIES (operating on LoRA A/B). V1=0.6896.
- merged_linear: Broken PEFT linear merge. V1=0.5174.
- soup_best: Per-layer soup with V2 (trial 6). V1=0.9156, V3=0.5905. Selected for publication.
The PEFT bug (add_weighted_adapter producing avg(A)@avg(B) instead of avg(A@B)) is documented via the broken merge evals.
Published Model Scores
| Model | V1 @5 | V3 @5 | V3 @10 | V2 @5 |
|---|---|---|---|---|
| ColQwen3.5-v3 | 0.9156 | 0.5905 | 0.6180 | 0.6350 |
| ColQwen3.5-v2 | 0.9172 | 0.5913 | 0.6177 | 0.6131 |
| ColQwen3.5-v1 | 0.9166 | 0.5830 | 0.6105 | 0.6035 |
Benchmarks
All evaluations used MTEB with ViDoRe:
- ViDoRe V1: 10 visual document retrieval tasks (nDCG@5)
- ViDoRe V2: 4 tasks, English split (nDCG@5)
- ViDoRe V3: 8 tasks (nDCG@5, nDCG@10)
Dataset Structure
V1/ 427 files
├── phase1/ Training ablations
│ ├── bidir/ Bidirectional attention
│ ├── pairwise/ Pairwise loss
│ ├── samesource/ Same-source batch sampling
│ ├── vistokens1280/ Extended visual tokens
│ └── seed42, seed123, seed456
├── phase2/ Hard negative training
│ ├── seed42, seed123, seed456, seed789
│ └── merged/
├── phase3/ Targeted domain training
│ ├── phase3_seed42, seed123, seed456, seed789
│ └── phase3_merged/
├── phase4/ Final refinement
│ ├── seed42, seed123, seed456
│ └── merged/
└── published_evals/ V1 model on V2/V3 benchmarks
├── v2_run2 through v2_run6
└── v3/
V2/ 162 files
├── phase1/ Seed training
│ ├── seed42, seed123, seed456
│ ├── merged/
│ └── v1_run2, v2_run6, v3/
└── phase2/ Soup selection
├── v2_merged_plus_v1/
├── v2_soup_25_75/
└── v2_soup_55_45/ Selected for publication
V3/ 187 files
├── control_v2defaults/ No-HPO baseline (V2 default config, 500 steps)
├── merged_full/ Full state dict averaging (3 seeds)
├── dare_ties_fixed/ Correct DARE-TIES (task vector method)
├── merged_dare_ties/ Broken PEFT DARE-TIES (for bug documentation)
├── merged_linear/ Broken PEFT linear merge (for bug documentation)
├── seed123_sanity/ Individual seed sanity check
└── soup_best/ Per-layer soup with V2 (selected for publication)
File Format
Standard MTEB result JSON:
{
"scores": {
"test": [{"ndcg_at_5": 0.9155, "ndcg_at_10": 0.9210, ...}]
},
"task_name": "VidoreArxivQARetrieval"
}
Citation
@misc{colqwen35_optimization_trail,
title={ColQwen3.5 Optimization Trail},
author={Athrael Soju},
year={2026},
url={https://huggingface.co/datasets/athrael-soju/colqwen-optimization-trail}
}
License
Apache 2.0
- Downloads last month
- 11
