Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-code

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Not a production code model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.

A 164 M-parameter code-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat and cognica/Cognica-PoE-v1.0-1.3B-stage-math — all three branch from the same base.

What this artifact demonstrates:

Cleanest convergence of the three specialists. Val bpb descends monotonically from 2.55 → 2.36 across 2 131 steps with no spikes. Code instruction data (Magicoder-Evol-Instruct) is homogeneous and well-curated, which makes the learning trajectory smoother than chat or math.
Short training, coherent code. 4-layer specialist (164 M trainable) + 282 k curated code conversations reaches val bpb 2.36 in ~2 h on 4 × A100. The model emits recognizable Python idioms (def, return sum(lst), list comprehensions) though correctness on non-trivial logic is still limited by the 1.3 B backbone.
Single-node distributed training for specialist stages. This run used world_size=4 (single node) instead of the 8-GPU two-node setup the other stages used. The delta-training flow handles this transparently — useful when only partial infra is free.

Caveat: this is still a 1.3 B model. Expect surface-level API correctness but regular logical bugs, wrong return types, and off-by-one errors. Do not deploy for real code generation.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-code"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Write a Python function that returns the sum of a list."
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)

with torch.no_grad():
 out = model.generate(
 input_ids, max_new_tokens=160, do_sample=False,
 repetition_penalty=1.15, pad_token_id=BOS,
 )
print(tokenizer.decode(out[0, len(ids):]))

Use repetition_penalty >= 1.1 or sampling — greedy decoding loops at this model scale.

Architecture

Component	Detail
Parent	`cognica/Cognica-PoE-v1.0-1.3B-base` (PoE α=0.0, d24, step 26430, val bpb 0.7209)
New transformer layers	4 appended at positions 24–27 (d24 → d28)
Frozen layers	24 (all base layers)
Dual-head	Yes — additive specialist `lm_head_stage` (shape 32768 × 1536, zero-init at training start)
Final projection	`logits = lm_head_base(x) + lm_head_stage(x)`
Total params	1,597,726,798 (~1.60 B)
Trainable params at training	163,577,912 (~164 M, 10.6 %)
Shipped delta	28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors)
VE pattern	Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE

Training


Objective	Cross-entropy over assistant turns only (user turns and padding masked out)
Data	`sahil2801/CodeAlpaca-20k` × 3 epochs (20 k → ~60 k) + `ise-uiuc/Magicoder-Evol-Instruct-110K` × 2 epochs (111 k → ~222 k) → 282 k total
Held-out val	512 conversations (same shuffle, outside the train subset)
Case augmentation	First-user greetings duplicated with case variants → 281,920 → 283,249 conversations
Sequence length	2 048
Per-GPU batch	8 × 2 048
World size	4 (single node, 4 × A100 80 GB)
Tokens per step	65 536 (half the 8-GPU runs)
Steps	2 131 (≈ 140 M tokens)
tok/param	0.85 (164 M trainable)
Optimizer	MuonAdamW with per-group LR scaling
`lm_head_stage` LR	1.00 × 10⁻⁴, weight decay 0.1
AdamW master scaling	0.7071
Warmup / warmdown	5 % / 90 %
Eval / save cadence	every 200 / 200 steps
Best checkpoint shipped	step 2 000, val bpb 2.3610

Validation bpb trajectory

Step	val bpb	Δ
200	2.5453	—
400	2.5447	-0.0006
600	2.5059	-0.0388
800	2.4594	-0.0465
1000	2.4299	-0.0295
1200	2.4055	-0.0244
1400	2.3872	-0.0183
1600	2.3749	-0.0123
1800	2.3662	-0.0087
2000	2.3610	-0.0052 ← best, shipped
2131	(final, no eval at this step)	—

Smooth monotonic descent with no spikes — the cleanest of the three sibling specialists.

Stacking further stages

base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-code → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection.

Files

File	Purpose
`config.json`	Model + stage config
`delta.safetensors`	28-tensor stage delta (bf16, ~428 MB)
`modeling_cognica_poe.py`	Cascade loader + `_GPT` with dual-head forward
`configuration_cognica_poe.py`	`CognicaPoEConfig` with stage fields
`tokenization_cognica_poe.py`	Byte-level tokenizer (includes the numeric-token decode fix)
`tokenizer.pkl`, `tokenizer_config.json`, `special_tokens_map.json`, `token_bytes.pt`	Tokenizer assets
`convert_stage_delta.py`	Converts a nanochat `save_stage_delta` `.pt` file into `delta.safetensors`

Limitations — explicit list

Research preview at 1.3 B. Do not deploy for real code generation. Expect API name correctness and idiom pattern matching, but wrong logic, wrong return types, infinite loops, and silent bugs.
Greedy decoding loops. Use repetition_penalty >= 1.1 or sampling.
No RLHF / preference tuning / safety tuning.
Assistant-only loss — validation bpb is measured on assistant tokens only.

Citation

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-code-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-code: Code-domain dual-head specialist (4-layer) over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-code}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (CodeAlpaca-20k, Magicoder-Evol-Instruct-110K) each carry their own licenses and are acknowledged in NOTICE.

Downloads last month: 35

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-code

Base model

cognica/Cognica-PoE-v1.0-1.3B-base

Finetuned

(6)

this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-code

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-code

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.3B across five modularity axes. • 7 items • Updated Apr 22

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-code

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-code · Hugging Face