Dataset Viewer

ColQwen3.5 Optimization Trail

776+ MTEB evaluation results from the development of ColQwen3.5-v1, ColQwen3.5-v2, and ColQwen3.5-v3, three visual document retrieval models using ColBERT-style late interaction with Qwen3.5-VL (4.5B params).

Published benchmark scores are the output of a development process: train candidates, evaluate them, select the best. This dataset captures that entire process, not just the winners. Every seed, ablation, merge variant, and soup ratio that was tried is here, along with the scores that drove each decision. Each selection decision was evaluated against the same public ViDoRe benchmarks used for final reporting.

An associated analysis quantifying the effect of this selection on reported scores is forthcoming.

Selection Decisions

V1: Phase and Seed Selection

V1 went through four training phases, each with multiple seeds and ablations (bidirectional attention, pairwise loss, same-source sampling, extended visual tokens). At each stage, candidates were evaluated on ViDoRe V1 (10 tasks) and the best was carried forward.

The final decision: Phase 4 merged (nDCG@5 = 0.9166) was selected over the best individual seed (0.9199) for broader generalization across tasks. The merged model scored lower on the headline metric but was more consistent.

V2: Soup Ratio Selection

V2 trained three seeds with hard negatives, merged them, then tested three soup ratios against V1:

Soup Ratio	V1 @5	V3 @5	V3 @10	Selected?
50/50	0.9169	0.5894	0.6162	No
25/75	0.9184	0.5814	0.6106	No
55/45	0.9172	0.5913	0.6177	Yes

The 55/45 ratio was selected for the highest V3@10. This is the same metric reported on the model card.

V3: HPO, Merge Methods, and Model Soup

V3 introduced automated HPO (16 Optuna trials optimizing V1+V3 jointly), three merge methods (full state dict averaging, correct DARE-TIES, and broken PEFT merges), and per-layer evolutionary soup optimization (11 trials, 14 parameters).

Key artifacts:

control_v2defaults: 500-step adapter with V2's default config (no HPO). V1=0.8928, V3=0.5498.
merged_full: Full state dict averaging of 3 seeds. V1=0.9193, V3=0.5857.
dare_ties_fixed: Correct DARE-TIES (operating on task vectors). V1=0.9166, V3=0.5855.
merged_dare_ties: Broken PEFT DARE-TIES (operating on LoRA A/B). V1=0.6896.
merged_linear: Broken PEFT linear merge. V1=0.5174.
soup_best: Per-layer soup with V2 (trial 6). V1=0.9156, V3=0.5905. Selected for publication.

The PEFT bug (add_weighted_adapter producing avg(A)@avg(B) instead of avg(A@B)) is documented via the broken merge evals.

Published Model Scores

Model	V1 @5	V3 @5	V3 @10	V2 @5
ColQwen3.5-v3	0.9156	0.5905	0.6180	0.6350
ColQwen3.5-v2	0.9172	0.5913	0.6177	0.6131
ColQwen3.5-v1	0.9166	0.5830	0.6105	0.6035

Benchmarks

All evaluations used MTEB with ViDoRe:

ViDoRe V1: 10 visual document retrieval tasks (nDCG@5)
ViDoRe V2: 4 tasks, English split (nDCG@5)
ViDoRe V3: 8 tasks (nDCG@5, nDCG@10)

Dataset Structure

V1/ 427 files
├── phase1/ Training ablations
│ ├── bidir/ Bidirectional attention
│ ├── pairwise/ Pairwise loss
│ ├── samesource/ Same-source batch sampling
│ ├── vistokens1280/ Extended visual tokens
│ └── seed42, seed123, seed456
├── phase2/ Hard negative training
│ ├── seed42, seed123, seed456, seed789
│ └── merged/
├── phase3/ Targeted domain training
│ ├── phase3_seed42, seed123, seed456, seed789
│ └── phase3_merged/
├── phase4/ Final refinement
│ ├── seed42, seed123, seed456
│ └── merged/
└── published_evals/ V1 model on V2/V3 benchmarks
 ├── v2_run2 through v2_run6
 └── v3/

V2/ 162 files
├── phase1/ Seed training
│ ├── seed42, seed123, seed456
│ ├── merged/
│ └── v1_run2, v2_run6, v3/
└── phase2/ Soup selection
 ├── v2_merged_plus_v1/
 ├── v2_soup_25_75/
 └── v2_soup_55_45/ Selected for publication

V3/ 187 files
├── control_v2defaults/ No-HPO baseline (V2 default config, 500 steps)
├── merged_full/ Full state dict averaging (3 seeds)
├── dare_ties_fixed/ Correct DARE-TIES (task vector method)
├── merged_dare_ties/ Broken PEFT DARE-TIES (for bug documentation)
├── merged_linear/ Broken PEFT linear merge (for bug documentation)
├── seed123_sanity/ Individual seed sanity check
└── soup_best/ Per-layer soup with V2 (selected for publication)

File Format

Standard MTEB result JSON:

{
 "scores": {
 "test": [{"ndcg_at_5": 0.9155, "ndcg_at_10": 0.9210, ...}]
 },
 "task_name": "VidoreArxivQARetrieval"
}

Citation

@misc{colqwen35_optimization_trail,
 title={ColQwen3.5 Optimization Trail},
 author={Athrael Soju},
 year={2026},
 url={https://huggingface.co/datasets/athrael-soju/colqwen-optimization-trail}
}

License

Apache 2.0

Downloads last month: 11

URL: https://huggingface.co/datasets/athrael-soju/colqwen-optimization-trail

⇱ athrael-soju/colqwen-optimization-trail · Datasets at Hugging Face