VOOZH about

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-math

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-math · Hugging Face


Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-math

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Not a production math model. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth — paper §6.5 extended.

A 164 M-parameter math-domain SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat — both branch from the same base, not from each other. Use the cascade loader with this repo (or compose both stage heads additively in a custom inference script; see the paper).

At 1.3 B base + 164 M specialist, arithmetic answers are still often wrong; the specialist's contribution is in format and reasoning chain, not exact numeric computation. Emergent arithmetic in the literature requires 7 B+ — that scaling ceiling cannot be moved by a specialist head.

What this artifact does demonstrate:

  1. A deeper (4-layer) specialist trains cleanly. Val bpb converges monotonically from 2.78 → 2.41 across 2 267 steps with no divergence. One transient blip at step 1 200 (2.692 → 2.716) self-corrects by step 1 400.
  2. Dataset composition for math. Mixing GSM8K main/train (×20 epochs) with MathInstruct (×4 epochs) gives a balanced reasoning-style corpus at ~297 M tokens total.
  3. Additive composition at inference. The new cascade loader (v1.0.1+) folds ancestor stages' lm_head_stage into the effective lm_head_base when chaining, so stage-chat → stage-math composition produces correct summed logits.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-math"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

# Chat-format prompt (the SFT data was formatted with SmolTalk special tokens):
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
q = "Sarah has 15 candies. She gives 3 to each of her 4 friends. How many candies does she have left?"
ids = [BOS, USR_S] + tokenizer.encode(q) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)

with torch.no_grad():
 out = model.generate(
 input_ids, max_new_tokens=120, do_sample=False,
 repetition_penalty=1.15, pad_token_id=BOS,
 )
print(tokenizer.decode(out[0, len(ids):]))

repetition_penalty=1.15 or sampling (do_sample=True, temperature=0.7, top_p=0.9) is recommended — greedy decoding loops on most prompts at this model scale.

Architecture

Component Detail
Parent cognica/Cognica-PoE-v1.0-1.3B-base (PoE α=0.0, d24, step 26430, val bpb 0.7209)
New transformer layers 4 appended at positions 24–27 (d24 → d28)
Frozen layers 24 (all base layers)
Dual-head Yes — additive specialist lm_head_stage (shape 32768 × 1536, zero-init at training start)
Final projection logits = lm_head_base(x) + lm_head_stage(x)
Total params 1,597,726,798 (~1.60 B)
Trainable params at training 163,577,912 (~164 M, 10.6 %)
Shipped delta 28 tensors, 213,909,560 params, ~428 MB (bf16 safetensors)
VE pattern Preserved from base — 12 value-embeds at layers [1, 3, …, 23]; new layers carry no VE

Training

Objective Cross-entropy over assistant turns only (user turns and padding masked out)
Data openai/gsm8k main/train × 20 epochs (7 473 → ~149 k) + TIGER-Lab/MathInstruct × 4 epochs (262 k → ~1.05 M) → 1.20 M total
Held-out val 512 conversations (same shuffle, outside the train subset)
Case augmentation First-user greetings duplicated with case variants → 1,197,104 → 1,198,173 conversations
Sequence length 2 048
Per-GPU batch 8 × 2 048
World size 8 (2 nodes × 4 × A100 80 GB)
Tokens per step 131 072
Steps 2 267 (≈ 297 M tokens)
tok/param 1.82 (164 M trainable)
Optimizer MuonAdamW with per-group LR scaling
lm_head_stage LR 1.00 × 10⁻⁴, weight decay 0.1
AdamW master scaling 0.7071
Warmup / warmdown 5 % / 90 %
Eval / save cadence every 200 / 200 steps
Best checkpoint shipped step 2 200, val bpb 2.4118

Validation bpb trajectory

Step val bpb Δ
200 2.7796
400 2.7195 -0.0601
600 2.7023 -0.0172
800 2.6994 -0.0029
1000 2.6910 -0.0084
1200 2.7159 +0.0249 ⚠ (transient spike, self-corrected)
1400 2.5098 -0.2061
1600 2.4468 -0.0630
1800 2.4247 -0.0221
2000 2.4173 -0.0074
2200 2.4118 -0.0055 ← best, shipped
2267 (final, no eval at this step)

Stacking further stages

base_model_name_or_path supports chaining. Point a new stage repo's config at this repo and the cascade loader will resolve base → stage-math → new stage transparently. The loader folds each ancestor's lm_head_stage into the effective lm_head_base at load time, so all specialist heads compose additively into the final projection (logits = lm_head_base + Σ lm_head_stage_k). See the paper for the formal account of this construction.

Files

File Purpose
config.json Model + stage config (base_model_name_or_path, new_layers=4, frozen_layers=24, full stage_training block)
delta.safetensors 28-tensor stage delta (bf16, ~428 MB)
modeling_cognica_poe.py Cascade loader + _GPT with dual-head forward (same code as the base repo)
configuration_cognica_poe.py CognicaPoEConfig with stage fields
tokenization_cognica_poe.py Byte-level tokenizer (unchanged from base; includes the numeric-token decode fix)
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt Tokenizer assets (unchanged from base)
convert_stage_delta.py Converts a nanochat save_stage_delta .pt file into delta.safetensors

Limitations — explicit list

  • Research preview at 1.3 B. Arithmetic answers are unreliable at this parameter scale. GSM8K in training does not close this.
  • The specialist learns format, not exact computation. Expect chain-of-thought style output with wrong final numbers.
  • Greedy decoding loops. Use repetition_penalty >= 1.1 or sampling.
  • No RLHF / preference tuning / safety tuning.
  • Assistant-only loss — validation bpb is measured on assistant tokens only; do not compare to the base's full-text train_val_bpb = 0.7209.

Citation

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-math-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-math: Math-domain dual-head specialist (4-layer) over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-math}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (GSM8K, MathInstruct) each carry their own licenses and are acknowledged in NOTICE.

Downloads last month
37

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-math

Finetuned
(6)
this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-math

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-math