franken-gemma-4-dense-1b: untrained
A frankenstein-init Gemma 4 (dense) image/text model with ~1b params:
assembled by weight-transplant from Gemma 3 1B (text backbone) and Gemma 4 E2B-IT (vision tower + tokenizer + processor).
Architecturally mirrors google/gemma-4-31B-it (hybrid attention head-dim, no MoE, no PLE, no shared KV) but smol
This is a trained model.. NOT!
It will not produce coherent text out of the box.
It is intended for testing fine-tuning frameworks/configurations (Axolotl, TRL, DeepSpeed, FSDP) at a 'pilot' scale
should train more easily than.. random weights though
Architecture
| component | value |
|---|---|
| hidden_size | 1152 |
| intermediate_size | 6912 |
| num_hidden_layers | 18 (15 sliding + 3 full, pattern 5:1) |
| num_attention_heads | 4 |
| num_key_value_heads | 1 |
| head_dim (sliding) | 256 |
| head_dim (global) | 512 |
| sliding_window | 1024 |
| max_position_embeddings | 32768 |
| attention_k_eq_v | True (global layers) |
| final_logit_softcapping | 30.0 |
| vocab_size | 262148 (Gemma 4 tokenizer) |
Vision tower: hidden=768, 16 layers, head_dim=64 (copied from Gemma 4 E2B-IT)
As parameter counts/modules:
=============================================================
Layer (type) Param # Trainable
=============================================================
Gemma4TextScaledWordEmbedding 301,989,888 True
ModuleList 490,237,440 True
Gemma4RMSNorm 1,152 True
Gemma4TextRotaryEmbedding -- False
Gemma4TextModel 792,228,480 True
Gemma4VisionPatchEmbedder 16,318,464 True
Gemma4VisionEncoder 151,046,144 True
Gemma4VisionPooler -- False
Gemma4VisionModel 167,364,608 True
Linear 884,736 True
Gemma4RMSNorm -- False
Gemma4MultimodalEmbedder 884,736 True
Gemma4Model 960,477,824 True
Linear 301,989,888 True
Gemma4ForConditionalGeneration 960,477,824 True
=============================================================
Total params: 960,477,824
Trainable params: 960,477,824
Non-trainable params: --
=============================================================
frankenstein component inventory
| Component | Source | Method |
|---|---|---|
| Text embeddings | gemma-3-1b-it | Direct copy + 4 rows mean-resized for Gemma 4 special tokens |
| Text MLP weights | gemma-3-1b-it | Direct copy |
| Sliding-attention Q/K/V/O | gemma-3-1b-it | Direct copy |
| Global-attention Q/K | gemma-3-1b-it | Per-head tile (256 → 512) |
| Global-attention O | gemma-3-1b-it | Per-head split-halves (preserves O @ V = O_old @ V_old at init) |
| Global-attention V | --- | Dropped (attention_k_eq_v=True; V reuses K) |
| RMSNorm weights | gemma-3-1b-it | Convention-converted (1.0 + w) |
| q_norm / k_norm | gemma-3-1b-it | Rescaled by 1/√head_dim to compensate for Gemma 4's scaling=1.0 |
| Vision tower | gemma-4-e2b-it | Direct copy |
| embed_vision projection | --- | Fresh init (shape mismatch 768→1536 vs 768→1152) |
| Tokenizer + processor | gemma-4-e2b-it | Wholesale |
License
Gemma Terms of Use apply. This is a derivative of Gemma 3 1B and Gemma 4 E2B-IT weights. See https://ai.google.dev/gemma/terms
- Downloads last month
- 12
