Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Cognica-PoE-v1.0-1.3B-stage-math
Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)
Not a production math model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.
A 164 M-parameter math-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat — both branch from the same base, not from each other. Use the cascade loader with this repo (or compose both stage heads additively in a custom inference script; see the paper).
At 1.3 B base + 164 M specialist, arithmetic answers are still often wrong; the specialist's contribution is in format and reasoning chain, not exact numeric computation. Emergent arithmetic in the literature requires 7 B+ — that scaling ceiling cannot be moved by a specialist head.
What this artifact does demonstrate:
- A deeper (4-layer) specialist trains cleanly. Val bpb converges monotonically from 2.78 → 2.41 across 2 267 steps with no divergence. One transient blip at step 1 200 (2.692 → 2.716) self-corrects by step 1 400.
- Dataset composition for math. Mixing GSM8K main/train (×20 epochs) with MathInstruct (×4 epochs) gives a balanced reasoning-style corpus at ~297 M tokens total.
- Additive composition at inference. The new cascade loader (v1.0.1+) folds ancestor stages'
lm_head_stageinto the effectivelm_head_basewhen chaining, sostage-chat → stage-mathcomposition produces correct summed logits.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-math"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# Chat-format prompt (the SFT data was formatted with SmolTalk special tokens):
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Sarah has 15 candies. She gives 3 to each of her 4 friends. How many candies does she have left?"
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)
with torch.no_grad():
out = model.generate(
input_ids, max_new_tokens=120, do_sample=False,
repetition_penalty=1.15, pad_token_id=BOS,
)
print(tokenizer.decode(out[0, len(ids):]))
repetition_penalty=1.15 or sampling (do_sample=True, temperature=0.7, top_p=0.9) is recommended — greedy decoding loops on most prompts at this model scale.
Architecture
| Component | Detail |
|---|---|
| Parent | cognica/Cognica-PoE-v1.0-1.3B-base (PoE α=0.0, d24, step 26430, val bpb 0.7209) |
| New transformer layers | 4 appended at positions 24–27 (d24 → d28) |
| Frozen layers | 24 (all base layers) |
| Dual-head | Yes — additive specialist lm_head_stage (shape 32768 × 1536, zero-init at training start) |
| Final projection | logits = lm_head_base(x) + lm_head_stage(x) |
| Total params | 1,597,726,798 (~1.60 B) |
| Trainable params at training | 163,577,912 (~164 M, 10.6 %) |
| Shipped delta | 28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors) |
| VE pattern | Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE |
Training
| Objective | Cross-entropy over assistant turns only (user turns and padding masked out) |
| Data | openai/gsm8k main/train × 20 epochs (7 473 → ~149 k) + TIGER-Lab/MathInstruct × 4 epochs (262 k → ~1.05 M) → 1.20 M total |
| Held-out val | 512 conversations (same shuffle, outside the train subset) |
| Case augmentation | First-user greetings duplicated with case variants → 1,197,104 → 1,198,173 conversations |
| Sequence length | 2 048 |
| Per-GPU batch | 8 × 2 048 |
| World size | 8 (2 nodes × 4 × A100 80 GB) |
| Tokens per step | 131 072 |
| Steps | 2 267 (≈ 297 M tokens) |
| tok/param | 1.82 (164 M trainable) |
| Optimizer | MuonAdamW with per-group LR scaling |
lm_head_stage LR |
1.00 × 10⁻⁴, weight decay 0.1 |
| AdamW master scaling | 0.7071 |
| Warmup / warmdown | 5 % / 90 % |
| Eval / save cadence | every 200 / 200 steps |
| Best checkpoint shipped | step 2 200, val bpb 2.4118 |
Validation bpb trajectory
| Step | val bpb | Δ |
|---|---|---|
| 200 | 2.7796 | — |
| 400 | 2.7195 | -0.0601 |
| 600 | 2.7023 | -0.0172 |
| 800 | 2.6994 | -0.0029 |
| 1000 | 2.6910 | -0.0084 |
| 1200 | 2.7159 | +0.0249 ⚠ (transient spike, self-corrected) |
| 1400 | 2.5098 | -0.2061 |
| 1600 | 2.4468 | -0.0630 |
| 1800 | 2.4247 | -0.0221 |
| 2000 | 2.4173 | -0.0074 |
| 2200 | 2.4118 | -0.0055 ← best, shipped |
| 2267 | (final, no eval at this step) | — |
Stacking further stages
base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-math → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection (logits = lm_head_base + Σ lm_head_stage_k). See the paper for the formal account of this construction.
Files
| File | Purpose |
|---|---|
config.json |
Model + stage config (base_model_name_or_path, new_layers=4, frozen_layers=24, full stage_training block) |
delta.safetensors |
28-tensor stage delta (bf16, ~428 MB) |
modeling_cognica_poe.py |
Cascade loader + _GPT with dual-head forward (same code as the base repo) |
configuration_cognica_poe.py |
CognicaPoEConfig with stage fields |
tokenization_cognica_poe.py |
Byte-level tokenizer (unchanged from base; includes the numeric-token decode fix) |
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt |
Tokenizer assets (unchanged from base) |
convert_stage_delta.py |
Converts a nanochat save_stage_delta .pt file into delta.safetensors |
Limitations — explicit list
- Research preview at 1.3 B. Arithmetic answers are unreliable at this parameter scale. GSM8K in training does not close this.
- The specialist learns format, not exact computation. Expect chain-of-thought style output with wrong final numbers.
- Greedy decoding loops. Use
repetition_penalty >= 1.1or sampling. - No RLHF / preference tuning / safety tuning.
- Assistant-only loss — validation bpb is measured on assistant tokens only; do not compare to the base's full-text
train_val_bpb = 0.7209.
Citation
@article{jeong2026poe,
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
author = {Jeong, Jaepil},
year = {2026},
institution = {Cognica, Inc.},
doi = {10.5281/zenodo.19547653},
url = {https://doi.org/10.5281/zenodo.19547653}
}
@misc{cognica-poe-stage-math-2026,
title = {Cognica-PoE-v1.0-1.3B-stage-math: Math-domain dual-head specialist (4-layer) over a PoE base (research preview)},
author = {{Cognica, Inc.}},
year = {2026},
howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-math}}
}
License
Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (GSM8K, MathInstruct) each carry their own licenses and are acknowledged in NOTICE.
- Downloads last month
- 37
Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-math
Base model
cognica/Cognica-PoE-v1.0-1.3B-base