VOOZH about

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-code

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-code · Hugging Face


Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-code

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Not a production code model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.

A 164 M-parameter code-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat and cognica/Cognica-PoE-v1.0-1.3B-stage-math — all three branch from the same base.

What this artifact demonstrates:

  1. Cleanest convergence of the three specialists. Val bpb descends monotonically from 2.55 → 2.36 across 2 131 steps with no spikes. Code instruction data (Magicoder-Evol-Instruct) is homogeneous and well-curated, which makes the learning trajectory smoother than chat or math.
  2. Short training, coherent code. 4-layer specialist (164 M trainable) + 282 k curated code conversations reaches val bpb 2.36 in ~2 h on 4 × A100. The model emits recognizable Python idioms (def, return sum(lst), list comprehensions) though correctness on non-trivial logic is still limited by the 1.3 B backbone.
  3. Single-node distributed training for specialist stages. This run used world_size=4 (single node) instead of the 8-GPU two-node setup the other stages used. The delta-training flow handles this transparently — useful when only partial infra is free.

Caveat: this is still a 1.3 B model. Expect surface-level API correctness but regular logical bugs, wrong return types, and off-by-one errors. Do not deploy for real code generation.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-code"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Write a Python function that returns the sum of a list."
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)

with torch.no_grad():
 out = model.generate(
 input_ids, max_new_tokens=160, do_sample=False,
 repetition_penalty=1.15, pad_token_id=BOS,
 )
print(tokenizer.decode(out[0, len(ids):]))

Use repetition_penalty >= 1.1 or sampling — greedy decoding loops at this model scale.

Architecture

Component Detail
Parent cognica/Cognica-PoE-v1.0-1.3B-base (PoE α=0.0, d24, step 26430, val bpb 0.7209)
New transformer layers 4 appended at positions 24–27 (d24 → d28)
Frozen layers 24 (all base layers)
Dual-head Yes — additive specialist lm_head_stage (shape 32768 × 1536, zero-init at training start)
Final projection logits = lm_head_base(x) + lm_head_stage(x)
Total params 1,597,726,798 (~1.60 B)
Trainable params at training 163,577,912 (~164 M, 10.6 %)
Shipped delta 28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors)
VE pattern Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE

Training

Objective Cross-entropy over assistant turns only (user turns and padding masked out)
Data sahil2801/CodeAlpaca-20k × 3 epochs (20 k → ~60 k) + ise-uiuc/Magicoder-Evol-Instruct-110K × 2 epochs (111 k → ~222 k) → 282 k total
Held-out val 512 conversations (same shuffle, outside the train subset)
Case augmentation First-user greetings duplicated with case variants → 281,920 → 283,249 conversations
Sequence length 2 048
Per-GPU batch 8 × 2 048
World size 4 (single node, 4 × A100 80 GB)
Tokens per step 65 536 (half the 8-GPU runs)
Steps 2 131 (≈ 140 M tokens)
tok/param 0.85 (164 M trainable)
Optimizer MuonAdamW with per-group LR scaling
lm_head_stage LR 1.00 × 10⁻⁴, weight decay 0.1
AdamW master scaling 0.7071
Warmup / warmdown 5 % / 90 %
Eval / save cadence every 200 / 200 steps
Best checkpoint shipped step 2 000, val bpb 2.3610

Validation bpb trajectory

Step val bpb Δ
200 2.5453
400 2.5447 -0.0006
600 2.5059 -0.0388
800 2.4594 -0.0465
1000 2.4299 -0.0295
1200 2.4055 -0.0244
1400 2.3872 -0.0183
1600 2.3749 -0.0123
1800 2.3662 -0.0087
2000 2.3610 -0.0052 ← best, shipped
2131 (final, no eval at this step)

Smooth monotonic descent with no spikes — the cleanest of the three sibling specialists.

Stacking further stages

base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-code → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection.

Files

File Purpose
config.json Model + stage config
delta.safetensors 28-tensor stage delta (bf16, ~428 MB)
modeling_cognica_poe.py Cascade loader + _GPT with dual-head forward
configuration_cognica_poe.py CognicaPoEConfig with stage fields
tokenization_cognica_poe.py Byte-level tokenizer (includes the numeric-token decode fix)
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt Tokenizer assets
convert_stage_delta.py Converts a nanochat save_stage_delta .pt file into delta.safetensors

Limitations — explicit list

  • Research preview at 1.3 B. Do not deploy for real code generation. Expect API name correctness and idiom pattern matching, but wrong logic, wrong return types, infinite loops, and silent bugs.
  • Greedy decoding loops. Use repetition_penalty >= 1.1 or sampling.
  • No RLHF / preference tuning / safety tuning.
  • Assistant-only loss — validation bpb is measured on assistant tokens only.

Citation

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-code-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-code: Code-domain dual-head specialist (4-layer) over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-code}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (CodeAlpaca-20k, Magicoder-Evol-Instruct-110K) each carry their own licenses and are acknowledged in NOTICE.

Downloads last month
35

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-code

Finetuned
(6)
this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-code

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-code