voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.

In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:

Qwen/Qwen3.5-9B
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.

Model Summary

Base model: Qwen/Qwen3.5-9B
Model type: Causal language model
Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
Strengths: ARC, BoolQ, aggregate benchmark performance
Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA

Evaluation Setup

We compare three models under the same zero-shot setting:

0-shot
No few-shot examples
Same benchmark suite
Reported with standard error

Compared Models

Qwen/Qwen3.5-9B
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Main Results

Representative 7-task Average

We use acc_norm when available, and acc otherwise.

Model	Avg. Score
`Qwen/Qwen3.5-9B`	0.7041
`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`	0.6927
`voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning`	0.7133

Macro Average over All 12 Reported Metrics

Model	Macro Avg.
`Qwen/Qwen3.5-9B`	0.6655
`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`	0.6587
`voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning`	0.6749

These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.

Benchmark Results

Task	Metric	Qwen3.5-9B	DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT	voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
arc_challenge	acc	0.5427	0.5205	0.5631
arc_challenge	acc_norm	0.5555	0.5469	0.5836
arc_easy	acc	0.8140	0.8018	0.8354
arc_easy	acc_norm	0.7433	0.7348	0.7950
boolq	acc	0.8927	0.7878	0.8792
hellaswag	acc	0.5827	0.6062	0.5882
hellaswag	acc_norm	0.7806	0.7944	0.7856
openbookqa	acc	0.3280	0.3360	0.3240
openbookqa	acc_norm	0.4280	0.4520	0.4260
piqa	acc	0.7905	0.7905	0.7949
piqa	acc_norm	0.8014	0.8036	0.7992
winogrande	acc	0.7269	0.7293	0.7245

Key Observations

Strengths

The model achieves the best overall average score across the compared models.
The model shows clear improvements on ARC-Challenge and ARC-Easy.
The model strongly outperforms DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on BoolQ.
The gains are especially visible on reasoning-oriented benchmarks.

Trade-offs

The model is not the top model on every benchmark.
The model is slightly behind DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on HellaSwag and OpenBookQA.
The model is close to the baselines on PIQA and Winogrande.

Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.

Interpretation

The benchmark pattern suggests that this model improves:

structured answer selection
reasoning-oriented multiple-choice QA
calibration on science and reading-style benchmarks

At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.

This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.

Limitations

The evaluation here is limited to a small set of common zero-shot benchmarks.
Some benchmark differences are small and may fall within the reported standard error.
The model should not be described as universally better on every task.
Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.

Conclusion

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.

In this comparison, it delivers:

the best overall aggregate benchmark score
the strongest ARC performance
strong BoolQ performance
competitive general capability on other zero-shot commonsense tasks

This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.

Raw Results

`Qwen/Qwen3.5-9B`

Task	Metric	Value	Stderr
arc_challenge	acc	0.5427	0.0146
arc_challenge	acc_norm	0.5555	0.0145
arc_easy	acc	0.8140	0.0080
arc_easy	acc_norm	0.7433	0.0090
boolq	acc	0.8927	0.0054
hellaswag	acc	0.5827	0.0049
hellaswag	acc_norm	0.7806	0.0041
openbookqa	acc	0.3280	0.0210
openbookqa	acc_norm	0.4280	0.0221
piqa	acc	0.7905	0.0095
piqa	acc_norm	0.8014	0.0093
winogrande	acc	0.7269	0.0125

`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`

Task	Metric	Value	Stderr
arc_challenge	acc	0.5205	0.0146
arc_challenge	acc_norm	0.5469	0.0145
arc_easy	acc	0.8018	0.0082
arc_easy	acc_norm	0.7348	0.0091
boolq	acc	0.7878	0.0072
hellaswag	acc	0.6062	0.0049
hellaswag	acc_norm	0.7944	0.0040
openbookqa	acc	0.3360	0.0211
openbookqa	acc_norm	0.4520	0.0223
piqa	acc	0.7905	0.0095
piqa	acc_norm	0.8036	0.0093
winogrande	acc	0.7293	0.0125