franken-gemma-4-dense-1b: untrained

A frankenstein-init Gemma 4 (dense) image/text model with ~1b params:

assembled by weight-transplant from Gemma 3 1B (text backbone) and Gemma 4 E2B-IT (vision tower + tokenizer + processor).
Architecturally mirrors google/gemma-4-31B-it (hybrid attention head-dim, no MoE, no PLE, no shared KV) but smol

This is a trained model.. NOT!

It will not produce coherent text out of the box.
It is intended for testing fine-tuning frameworks/configurations (Axolotl, TRL, DeepSpeed, FSDP) at a 'pilot' scale
should train more easily than.. random weights though

Architecture

component	value
hidden_size	1152
intermediate_size	6912
num_hidden_layers	18 (15 sliding + 3 full, pattern 5:1)
num_attention_heads	4
num_key_value_heads	1
head_dim (sliding)	256
head_dim (global)	512
sliding_window	1024
max_position_embeddings	32768
attention_k_eq_v	True (global layers)
final_logit_softcapping	30.0
vocab_size	262148 (Gemma 4 tokenizer)

Vision tower: hidden=768, 16 layers, head_dim=64 (copied from Gemma 4 E2B-IT)

As parameter counts/modules:

=============================================================
Layer (type) Param # Trainable
=============================================================
 Gemma4TextScaledWordEmbedding 301,989,888 True
 ModuleList 490,237,440 True
 Gemma4RMSNorm 1,152 True
 Gemma4TextRotaryEmbedding -- False
 Gemma4TextModel 792,228,480 True
 Gemma4VisionPatchEmbedder 16,318,464 True
 Gemma4VisionEncoder 151,046,144 True
 Gemma4VisionPooler -- False
 Gemma4VisionModel 167,364,608 True
 Linear 884,736 True
 Gemma4RMSNorm -- False
 Gemma4MultimodalEmbedder 884,736 True
 Gemma4Model 960,477,824 True
 Linear 301,989,888 True
Gemma4ForConditionalGeneration 960,477,824 True
=============================================================
Total params: 960,477,824
Trainable params: 960,477,824
Non-trainable params: --
=============================================================

frankenstein component inventory

Component	Source	Method
Text embeddings	gemma-3-1b-it	Direct copy + 4 rows mean-resized for Gemma 4 special tokens
Text MLP weights	gemma-3-1b-it	Direct copy
Sliding-attention Q/K/V/O	gemma-3-1b-it	Direct copy
Global-attention Q/K	gemma-3-1b-it	Per-head tile (256 → 512)
Global-attention O	gemma-3-1b-it	Per-head split-halves (preserves O @ V = O_old @ V_old at init)
Global-attention V	---	Dropped (attention_k_eq_v=True; V reuses K)
RMSNorm weights	gemma-3-1b-it	Convention-converted (1.0 + w)
q_norm / k_norm	gemma-3-1b-it	Rescaled by 1/√head_dim to compensate for Gemma 4's scaling=1.0
Vision tower	gemma-4-e2b-it	Direct copy
embed_vision projection	---	Fresh init (shape mismatch 768→1536 vs 768→1152)
Tokenizer + processor	gemma-4-e2b-it	Wholesale

License

Gemma Terms of Use apply. This is a derivative of Gemma 3 1B and Gemma 4 E2B-IT weights. See https://ai.google.dev/gemma/terms

Downloads last month: 12

Safetensors

Model size

1.0B params

Tensor type

BF16

Model tree for pszemraj/franken-gemma-4-dense-1b-untrained

Base model

google/gemma-3-1b-pt

Finetuned

google/gemma-3-1b-it

Finetuned

(558)

this model

Finetunes

1 model

URL: https://huggingface.co/pszemraj/franken-gemma-4-dense-1b-untrained

⇱ pszemraj/franken-gemma-4-dense-1b-untrained · Hugging Face

franken-gemma-4-dense-1b: untrained

Architecture

frankenstein component inventory

License

Model tree for pszemraj/franken-gemma-4-dense-1b-untrained