voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.
In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:
Qwen/Qwen3.5-9BDavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTvoidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.
Model Summary
- Base model:
Qwen/Qwen3.5-9B - Model type: Causal language model
- Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
- Strengths: ARC, BoolQ, aggregate benchmark performance
- Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA
Evaluation Setup
We compare three models under the same zero-shot setting:
- 0-shot
- No few-shot examples
- Same benchmark suite
- Reported with standard error
Compared Models
Qwen/Qwen3.5-9BDavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTvoidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
Main Results
Representative 7-task Average
We use acc_norm when available, and acc otherwise.
| Model | Avg. Score |
|---|---|
Qwen/Qwen3.5-9B |
0.7041 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT |
0.6927 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
0.7133 |
Macro Average over All 12 Reported Metrics
| Model | Macro Avg. |
|---|---|
Qwen/Qwen3.5-9B |
0.6655 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT |
0.6587 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
0.6749 |
These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.
Benchmark Results
| Task | Metric | Qwen3.5-9B | DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT | voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
|---|---|---|---|---|
| arc_challenge | acc | 0.5427 | 0.5205 | 0.5631 |
| arc_challenge | acc_norm | 0.5555 | 0.5469 | 0.5836 |
| arc_easy | acc | 0.8140 | 0.8018 | 0.8354 |
| arc_easy | acc_norm | 0.7433 | 0.7348 | 0.7950 |
| boolq | acc | 0.8927 | 0.7878 | 0.8792 |
| hellaswag | acc | 0.5827 | 0.6062 | 0.5882 |
| hellaswag | acc_norm | 0.7806 | 0.7944 | 0.7856 |
| openbookqa | acc | 0.3280 | 0.3360 | 0.3240 |
| openbookqa | acc_norm | 0.4280 | 0.4520 | 0.4260 |
| piqa | acc | 0.7905 | 0.7905 | 0.7949 |
| piqa | acc_norm | 0.8014 | 0.8036 | 0.7992 |
| winogrande | acc | 0.7269 | 0.7293 | 0.7245 |
Key Observations
Strengths
- The model achieves the best overall average score across the compared models.
- The model shows clear improvements on ARC-Challenge and ARC-Easy.
- The model strongly outperforms
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTon BoolQ. - The gains are especially visible on reasoning-oriented benchmarks.
Trade-offs
- The model is not the top model on every benchmark.
- The model is slightly behind
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTon HellaSwag and OpenBookQA. - The model is close to the baselines on PIQA and Winogrande.
Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.
Interpretation
The benchmark pattern suggests that this model improves:
- structured answer selection
- reasoning-oriented multiple-choice QA
- calibration on science and reading-style benchmarks
At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.
This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.
Limitations
- The evaluation here is limited to a small set of common zero-shot benchmarks.
- Some benchmark differences are small and may fall within the reported standard error.
- The model should not be described as universally better on every task.
- Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.
Conclusion
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.
In this comparison, it delivers:
- the best overall aggregate benchmark score
- the strongest ARC performance
- strong BoolQ performance
- competitive general capability on other zero-shot commonsense tasks
This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.
Raw Results
Qwen/Qwen3.5-9B
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5427 | 0.0146 |
| arc_challenge | acc_norm | 0.5555 | 0.0145 |
| arc_easy | acc | 0.8140 | 0.0080 |
| arc_easy | acc_norm | 0.7433 | 0.0090 |
| boolq | acc | 0.8927 | 0.0054 |
| hellaswag | acc | 0.5827 | 0.0049 |
| hellaswag | acc_norm | 0.7806 | 0.0041 |
| openbookqa | acc | 0.3280 | 0.0210 |
| openbookqa | acc_norm | 0.4280 | 0.0221 |
| piqa | acc | 0.7905 | 0.0095 |
| piqa | acc_norm | 0.8014 | 0.0093 |
| winogrande | acc | 0.7269 | 0.0125 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5205 | 0.0146 |
| arc_challenge | acc_norm | 0.5469 | 0.0145 |
| arc_easy | acc | 0.8018 | 0.0082 |
| arc_easy | acc_norm | 0.7348 | 0.0091 |
| boolq | acc | 0.7878 | 0.0072 |
| hellaswag | acc | 0.6062 | 0.0049 |
| hellaswag | acc_norm | 0.7944 | 0.0040 |
| openbookqa | acc | 0.3360 | 0.0211 |
| openbookqa | acc_norm | 0.4520 | 0.0223 |
| piqa | acc | 0.7905 | 0.0095 |
| piqa | acc_norm | 0.8036 | 0.0093 |
| winogrande | acc | 0.7293 | 0.0125 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5631 | 0.0145 |
| arc_challenge | acc_norm | 0.5836 | 0.0144 |
| arc_easy | acc | 0.8354 | 0.0076 |
| arc_easy | acc_norm | 0.7950 | 0.0083 |
| boolq | acc | 0.8792 | 0.0057 |
| hellaswag | acc | 0.5882 | 0.0049 |
| hellaswag | acc_norm | 0.7856 | 0.0041 |
| openbookqa | acc | 0.3240 | 0.0210 |
| openbookqa | acc_norm | 0.4260 | 0.0221 |
| piqa | acc | 0.7949 | 0.0094 |
| piqa | acc_norm | 0.7992 | 0.0093 |
| winogrande | acc | 0.7245 | 0.0126 |
Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
This model is a fine-tuned version of Qwen/Qwen3.5-9B on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 80
- gradient_accumulation_steps: 4
- total_train_batch_size: 320
- total_eval_batch_size: 80
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- training_steps: 6
Training results
Framework versions
- Transformers 5.3.0
- Pytorch 2.10.0+cu128
- Datasets 4.5.0
- Tokenizers 0.22.2
- Downloads last month
- 917
