Gemma-4-31B Musica v1

RP/storygen/conversational tune of Gemma-4-31B-it, the second model in Musica series, following TQ3.5-27B-Musica-v1. Feel like it is a decent overall upgrade over the Qwen version, and honestly I've liked it way more than stock Gemma in its domains.

Both reasoning and non-reasoning work, though it sometimes (rarely, thankfully) might skip reasoning even if its enabled, just regen in this case. Reasoning styles also seem to work, so prefilling with <|channel>thought\n Okay, let's see will make it use DeepSeek-esque reasoning most of the time.

Really liked instruction following on this one, it's very steerable, same or better as base. Refusals are non-existent. Swipe diversity seems quite a bit better than base.

This training run was sponsored by ArliAI

Training Notes

Gemma is a MAJOR pain to train. We had to track down a working Axolotl commit (thanks to ConicCat for suggesting a working one), it didn't have the hybrid FA-SDPA support, so it was purely SDPA, which is le S L O W, so it took ~35 hours for one epoch, compared to 17 hours for two on Qwen. But it seemed to converge on ~around the same loss earlier than Qwen, so it probably didn't need more than 1 epoch.

I've used fizzAI/Kaitan-Pretokenization to pretokenize my dataset, with 8192 seqlen (I had to use lower seqlen than my usual 16384 because Gemma is slow and memory hungry to train as is), and only last turn training, to bypass a bouquet of Gemma-specific problems with training reasoning. It seems to have worked.

r64a64 LoRA, 1e-5, 1 epoch, constant w/ warmup. 35 hours on 2xRTX Pro 6000 Blackwell.

allura-forge/musica-sft-v1-gemma4-pretok - pretokenized dataset.

CometML project - training graphs and stats.

AuriAetherwiing/G4-31B-Musica-v1-lora - LoRA adapter.

Recommended Samplers