Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Cognica-PoE-v1.0-1.3B-stage-code
Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)
Not a production code model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.
A 164 M-parameter code-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat and cognica/Cognica-PoE-v1.0-1.3B-stage-math — all three branch from the same base.
What this artifact demonstrates:
- Cleanest convergence of the three specialists. Val bpb descends monotonically from 2.55 → 2.36 across 2 131 steps with no spikes. Code instruction data (Magicoder-Evol-Instruct) is homogeneous and well-curated, which makes the learning trajectory smoother than chat or math.
- Short training, coherent code. 4-layer specialist (164 M trainable) + 282 k curated code conversations reaches val bpb 2.36 in ~2 h on 4 × A100. The model emits recognizable Python idioms (
def,return sum(lst), list comprehensions) though correctness on non-trivial logic is still limited by the 1.3 B backbone. - Single-node distributed training for specialist stages. This run used
world_size=4(single node) instead of the 8-GPU two-node setup the other stages used. The delta-training flow handles this transparently — useful when only partial infra is free.
Caveat: this is still a 1.3 B model. Expect surface-level API correctness but regular logical bugs, wrong return types, and off-by-one errors. Do not deploy for real code generation.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-code"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Write a Python function that returns the sum of a list."
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)
with torch.no_grad():
out = model.generate(
input_ids, max_new_tokens=160, do_sample=False,
repetition_penalty=1.15, pad_token_id=BOS,
)
print(tokenizer.decode(out[0, len(ids):]))
Use repetition_penalty >= 1.1 or sampling — greedy decoding loops at this model scale.
Architecture
| Component | Detail |
|---|---|
| Parent | cognica/Cognica-PoE-v1.0-1.3B-base (PoE α=0.0, d24, step 26430, val bpb 0.7209) |
| New transformer layers | 4 appended at positions 24–27 (d24 → d28) |
| Frozen layers | 24 (all base layers) |
| Dual-head | Yes — additive specialist lm_head_stage (shape 32768 × 1536, zero-init at training start) |
| Final projection | logits = lm_head_base(x) + lm_head_stage(x) |
| Total params | 1,597,726,798 (~1.60 B) |
| Trainable params at training | 163,577,912 (~164 M, 10.6 %) |
| Shipped delta | 28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors) |
| VE pattern | Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE |
Training
| Objective | Cross-entropy over assistant turns only (user turns and padding masked out) |
| Data | sahil2801/CodeAlpaca-20k × 3 epochs (20 k → ~60 k) + ise-uiuc/Magicoder-Evol-Instruct-110K × 2 epochs (111 k → ~222 k) → 282 k total |
| Held-out val | 512 conversations (same shuffle, outside the train subset) |
| Case augmentation | First-user greetings duplicated with case variants → 281,920 → 283,249 conversations |
| Sequence length | 2 048 |
| Per-GPU batch | 8 × 2 048 |
| World size | 4 (single node, 4 × A100 80 GB) |
| Tokens per step | 65 536 (half the 8-GPU runs) |
| Steps | 2 131 (≈ 140 M tokens) |
| tok/param | 0.85 (164 M trainable) |
| Optimizer | MuonAdamW with per-group LR scaling |
lm_head_stage LR |
1.00 × 10⁻⁴, weight decay 0.1 |
| AdamW master scaling | 0.7071 |
| Warmup / warmdown | 5 % / 90 % |
| Eval / save cadence | every 200 / 200 steps |
| Best checkpoint shipped | step 2 000, val bpb 2.3610 |
Validation bpb trajectory
| Step | val bpb | Δ |
|---|---|---|
| 200 | 2.5453 | — |
| 400 | 2.5447 | -0.0006 |
| 600 | 2.5059 | -0.0388 |
| 800 | 2.4594 | -0.0465 |
| 1000 | 2.4299 | -0.0295 |
| 1200 | 2.4055 | -0.0244 |
| 1400 | 2.3872 | -0.0183 |
| 1600 | 2.3749 | -0.0123 |
| 1800 | 2.3662 | -0.0087 |
| 2000 | 2.3610 | -0.0052 ← best, shipped |
| 2131 | (final, no eval at this step) | — |
Smooth monotonic descent with no spikes — the cleanest of the three sibling specialists.
Stacking further stages
base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-code → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection.
Files
| File | Purpose |
|---|---|
config.json |
Model + stage config |
delta.safetensors |
28-tensor stage delta (bf16, ~428 MB) |
modeling_cognica_poe.py |
Cascade loader + _GPT with dual-head forward |
configuration_cognica_poe.py |
CognicaPoEConfig with stage fields |
tokenization_cognica_poe.py |
Byte-level tokenizer (includes the numeric-token decode fix) |
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt |
Tokenizer assets |
convert_stage_delta.py |
Converts a nanochat save_stage_delta .pt file into delta.safetensors |
Limitations — explicit list
- Research preview at 1.3 B. Do not deploy for real code generation. Expect API name correctness and idiom pattern matching, but wrong logic, wrong return types, infinite loops, and silent bugs.
- Greedy decoding loops. Use
repetition_penalty >= 1.1or sampling. - No RLHF / preference tuning / safety tuning.
- Assistant-only loss — validation bpb is measured on assistant tokens only.
Citation
@article{jeong2026poe,
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
author = {Jeong, Jaepil},
year = {2026},
institution = {Cognica, Inc.},
doi = {10.5281/zenodo.19547653},
url = {https://doi.org/10.5281/zenodo.19547653}
}
@misc{cognica-poe-stage-code-2026,
title = {Cognica-PoE-v1.0-1.3B-stage-code: Code-domain dual-head specialist (4-layer) over a PoE base (research preview)},
author = {{Cognica, Inc.}},
year = {2026},
howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-code}}
}
License
Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (CodeAlpaca-20k, Magicoder-Evol-Instruct-110K) each carry their own licenses and are acknowledged in NOTICE.
- Downloads last month
- 35
Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-code
Base model
cognica/Cognica-PoE-v1.0-1.3B-base