VOOZH about

URL: https://huggingface.co/Momix-44/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

⇱ Momix-44/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning · Hugging Face


voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.

In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:

  • Qwen/Qwen3.5-9B
  • DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
  • voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.


Model Summary

  • Base model: Qwen/Qwen3.5-9B
  • Model type: Causal language model
  • Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
  • Strengths: ARC, BoolQ, aggregate benchmark performance
  • Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA

Evaluation Setup

We compare three models under the same zero-shot setting:

  • 0-shot
  • No few-shot examples
  • Same benchmark suite
  • Reported with standard error

Compared Models

  1. Qwen/Qwen3.5-9B
  2. DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
  3. voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Main Results

Representative 7-task Average

We use acc_norm when available, and acc otherwise.

Model Avg. Score
Qwen/Qwen3.5-9B 0.7041
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT 0.6927
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning 0.7133

Macro Average over All 12 Reported Metrics

Model Macro Avg.
Qwen/Qwen3.5-9B 0.6655
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT 0.6587
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning 0.6749

These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.


Benchmark Results

Task Metric Qwen3.5-9B DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
arc_challenge acc 0.5427 0.5205 0.5631
arc_challenge acc_norm 0.5555 0.5469 0.5836
arc_easy acc 0.8140 0.8018 0.8354
arc_easy acc_norm 0.7433 0.7348 0.7950
boolq acc 0.8927 0.7878 0.8792
hellaswag acc 0.5827 0.6062 0.5882
hellaswag acc_norm 0.7806 0.7944 0.7856
openbookqa acc 0.3280 0.3360 0.3240
openbookqa acc_norm 0.4280 0.4520 0.4260
piqa acc 0.7905 0.7905 0.7949
piqa acc_norm 0.8014 0.8036 0.7992
winogrande acc 0.7269 0.7293 0.7245

Key Observations

Strengths

  • The model achieves the best overall average score across the compared models.
  • The model shows clear improvements on ARC-Challenge and ARC-Easy.
  • The model strongly outperforms DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on BoolQ.
  • The gains are especially visible on reasoning-oriented benchmarks.

Trade-offs

  • The model is not the top model on every benchmark.
  • The model is slightly behind DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on HellaSwag and OpenBookQA.
  • The model is close to the baselines on PIQA and Winogrande.

Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.


Interpretation

The benchmark pattern suggests that this model improves:

  • structured answer selection
  • reasoning-oriented multiple-choice QA
  • calibration on science and reading-style benchmarks

At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.

This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.


Limitations

  • The evaluation here is limited to a small set of common zero-shot benchmarks.
  • Some benchmark differences are small and may fall within the reported standard error.
  • The model should not be described as universally better on every task.
  • Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.

Conclusion

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.

In this comparison, it delivers:

  • the best overall aggregate benchmark score
  • the strongest ARC performance
  • strong BoolQ performance
  • competitive general capability on other zero-shot commonsense tasks

This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.


Raw Results

Qwen/Qwen3.5-9B

Task Metric Value Stderr
arc_challenge acc 0.5427 0.0146
arc_challenge acc_norm 0.5555 0.0145
arc_easy acc 0.8140 0.0080
arc_easy acc_norm 0.7433 0.0090
boolq acc 0.8927 0.0054
hellaswag acc 0.5827 0.0049
hellaswag acc_norm 0.7806 0.0041
openbookqa acc 0.3280 0.0210
openbookqa acc_norm 0.4280 0.0221
piqa acc 0.7905 0.0095
piqa acc_norm 0.8014 0.0093
winogrande acc 0.7269 0.0125

DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT

Task Metric Value Stderr
arc_challenge acc 0.5205 0.0146
arc_challenge acc_norm 0.5469 0.0145
arc_easy acc 0.8018 0.0082
arc_easy acc_norm 0.7348 0.0091
boolq acc 0.7878 0.0072
hellaswag acc 0.6062 0.0049
hellaswag acc_norm 0.7944 0.0040
openbookqa acc 0.3360 0.0211
openbookqa acc_norm 0.4520 0.0223
piqa acc 0.7905 0.0095
piqa acc_norm 0.8036 0.0093
winogrande acc 0.7293 0.0125

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Task Metric Value Stderr
arc_challenge acc 0.5631 0.0145
arc_challenge acc_norm 0.5836 0.0144
arc_easy acc 0.8354 0.0076
arc_easy acc_norm 0.7950 0.0083
boolq acc 0.8792 0.0057
hellaswag acc 0.5882 0.0049
hellaswag acc_norm 0.7856 0.0041
openbookqa acc 0.3240 0.0210
openbookqa acc_norm 0.4260 0.0221
piqa acc 0.7949 0.0094
piqa acc_norm 0.7992 0.0093
winogrande acc 0.7245 0.0126

👁 Built with Axolotl


Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

This model is a fine-tuned version of Qwen/Qwen3.5-9B on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 80
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 320
  • total_eval_batch_size: 80
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • training_steps: 6

Training results

Framework versions

  • Transformers 5.3.0
  • Pytorch 2.10.0+cu128
  • Datasets 4.5.0
  • Tokenizers 0.22.2
Downloads last month
917
Safetensors
Model size
9B params
Tensor type
BF16
·

Model tree for Momix-44/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Finetuned
Qwen/Qwen3.5-9B
Quantized
(300)
this model

Dataset used to train Momix-44/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning