Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-math

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Not a production math model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.

A 164 M-parameter math-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat — both branch from the same base, not from each other. Use the cascade loader with this repo (or compose both stage heads additively in a custom inference script; see the paper).

At 1.3 B base + 164 M specialist, arithmetic answers are still often wrong; the specialist's contribution is in format and reasoning chain, not exact numeric computation. Emergent arithmetic in the literature requires 7 B+ — that scaling ceiling cannot be moved by a specialist head.

What this artifact does demonstrate:

A deeper (4-layer) specialist trains cleanly. Val bpb converges monotonically from 2.78 → 2.41 across 2 267 steps with no divergence. One transient blip at step 1 200 (2.692 → 2.716) self-corrects by step 1 400.
Dataset composition for math. Mixing GSM8K main/train (×20 epochs) with MathInstruct (×4 epochs) gives a balanced reasoning-style corpus at ~297 M tokens total.
Additive composition at inference. The new cascade loader (v1.0.1+) folds ancestor stages' lm_head_stage into the effective lm_head_base when chaining, so stage-chat → stage-math composition produces correct summed logits.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-math"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

# Chat-format prompt (the SFT data was formatted with SmolTalk special tokens):
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Sarah has 15 candies. She gives 3 to each of her 4 friends. How many candies does she have left?"
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)

with torch.no_grad():
 out = model.generate(
 input_ids, max_new_tokens=120, do_sample=False,
 repetition_penalty=1.15, pad_token_id=BOS,
 )
print(tokenizer.decode(out[0, len(ids):]))

repetition_penalty=1.15 or sampling (do_sample=True, temperature=0.7, top_p=0.9) is recommended — greedy decoding loops on most prompts at this model scale.

Architecture

Component	Detail
Parent	`cognica/Cognica-PoE-v1.0-1.3B-base` (PoE α=0.0, d24, step 26430, val bpb 0.7209)
New transformer layers	4 appended at positions 24–27 (d24 → d28)
Frozen layers	24 (all base layers)
Dual-head	Yes — additive specialist `lm_head_stage` (shape 32768 × 1536, zero-init at training start)
Final projection	`logits = lm_head_base(x) + lm_head_stage(x)`
Total params	1,597,726,798 (~1.60 B)
Trainable params at training	163,577,912 (~164 M, 10.6 %)
Shipped delta	28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors)
VE pattern	Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE

Training


Objective	Cross-entropy over assistant turns only (user turns and padding masked out)
Data	`openai/gsm8k` main/train × 20 epochs (7 473 → ~149 k) + `TIGER-Lab/MathInstruct` × 4 epochs (262 k → ~1.05 M) → 1.20 M total
Held-out val	512 conversations (same shuffle, outside the train subset)
Case augmentation	First-user greetings duplicated with case variants → 1,197,104 → 1,198,173 conversations
Sequence length	2 048
Per-GPU batch	8 × 2 048
World size	8 (2 nodes × 4 × A100 80 GB)
Tokens per step	131 072
Steps	2 267 (≈ 297 M tokens)
tok/param	1.82 (164 M trainable)
Optimizer	MuonAdamW with per-group LR scaling
`lm_head_stage` LR	1.00 × 10⁻⁴, weight decay 0.1
AdamW master scaling	0.7071
Warmup / warmdown	5 % / 90 %
Eval / save cadence	every 200 / 200 steps
Best checkpoint shipped	step 2 200, val bpb 2.4118

Validation bpb trajectory

Step	val bpb	Δ
200	2.7796	—
400	2.7195	-0.0601
600	2.7023	-0.0172
800	2.6994	-0.0029
1000	2.6910	-0.0084
1200	2.7159	+0.0249 ⚠ (transient spike, self-corrected)
1400	2.5098	-0.2061
1600	2.4468	-0.0630
1800	2.4247	-0.0221
2000	2.4173	-0.0074
2200	2.4118	-0.0055 ← best, shipped
2267	(final, no eval at this step)	—

Stacking further stages

base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-math → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection (logits = lm_head_base + Σ lm_head_stage_k). See the paper for the formal account of this construction.

Files

File	Purpose
`config.json`	Model + stage config (`base_model_name_or_path`, `new_layers=4`, `frozen_layers=24`, full `stage_training` block)
`delta.safetensors`	28-tensor stage delta (bf16, ~428 MB)
`modeling_cognica_poe.py`	Cascade loader + `_GPT` with dual-head forward (same code as the base repo)
`configuration_cognica_poe.py`	`CognicaPoEConfig` with stage fields
`tokenization_cognica_poe.py`	Byte-level tokenizer (unchanged from base; includes the numeric-token decode fix)
`tokenizer.pkl`, `tokenizer_config.json`, `special_tokens_map.json`, `token_bytes.pt`	Tokenizer assets (unchanged from base)
`convert_stage_delta.py`	Converts a nanochat `save_stage_delta` `.pt` file into `delta.safetensors`

Limitations — explicit list

Research preview at 1.3 B. Arithmetic answers are unreliable at this parameter scale. GSM8K in training does not close this.
The specialist learns format, not exact computation. Expect chain-of-thought style output with wrong final numbers.
Greedy decoding loops. Use repetition_penalty >= 1.1 or sampling.
No RLHF / preference tuning / safety tuning.
Assistant-only loss — validation bpb is measured on assistant tokens only; do not compare to the base's full-text train_val_bpb = 0.7209.

Citation

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-math-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-math: Math-domain dual-head specialist (4-layer) over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-math}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (GSM8K, MathInstruct) each carry their own licenses and are acknowledged in NOTICE.

Downloads last month: 37

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-math

Base model

cognica/Cognica-PoE-v1.0-1.3B-base

Finetuned

(6)

this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-math

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-math

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.3B across five modularity axes. • 7 items • Updated Apr 22

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-math

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-math · Hugging Face