Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Cognica-PoE-v1.0-1.3B-stage-chat
Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)
Not a production chat model. This release is empirical validation of PoE's dual-head specialist construction — paper §6.5.
A 107 M-parameter SFT specialist stage trained on top of the frozen Cognica-PoE-v1.0-1.3B-base. At this parameter scale (1.3 B base + 107 M specialist, d26 total), chat behavior is bounded by capacity, not by the specialist construction. What this artifact does demonstrate — reliably — is:
- Base preservation is bit-identical. The base
lm_headand layers 0–23 are frozen during SFT. Base-style continuation (<|bos|>+ raw text) remains strong and coherent after specialist training — empirically verified on four scientific-knowledge probes (Sun, photosynthesis, quantum mechanics, language-learning advice). This is the live-model evidence for the paper'sΔlogit = 0.0000claim across 12 checkpoints. - Post-hoc specialist composition works. The specialist delta (300 MB, 16 tensors) can be loaded on demand and composed additively with the base (
logits = lm_head_base(x) + lm_head_stage(x)). The loader supports arbitrary-depth chaining, so this stage is a building block for math / code / domain specialists that stack on top. - Delta-only distribution scales. The shipped stage is a 300 MB delta, not a 2.6 GB full-model copy. The cascade loader pulls the base once and assembles the d26 model in memory.
What this artifact does not demonstrate, and should not be compared against:
- Instruction following in the GPT-3.5 / GPT-4 sense.
- Arithmetic or multi-step reasoning. At 1.3 B, these are capacity-limited regardless of SFT data mix; including GSM8K train does not change this.
- Multi-turn context tracking across many turns.
- Strict format constraints ("answer yes or no", "exactly 3 bullets").
If you want a production chat model, this is not it. If you are evaluating PoE's architectural claims — whether a frozen-base + trainable-specialist construction can produce a composable SFT delta without degrading base capabilities — this is the artifact to look at.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-chat"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# Chat-style inference uses the SmolTalk special-token protocol:
# <|bos|><|user_start|> USER <|user_end|><|assistant_start|> ASSISTANT <|assistant_end|>
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
user_msg = "What is the boiling point of water in Fahrenheit?"
ids = [BOS, USR_S] + tokenizer.encode(user_msg) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)
with torch.no_grad():
out = model.generate(input_ids, max_new_tokens=80, do_sample=False)
print(tokenizer.decode(out[0, len(ids):]))
# -> "The boiling point of water is 100°C (212°F). At sea level, ..."
Base-style prompts (just <|bos|> + raw text, no chat wrapper) work better than the chat format for open-ended continuation — see the evaluation table below. For greedy decoding prefer a repetition penalty (repetition_penalty=1.15); without it most prompts loop.
The cascade loader does the following when you call from_pretrained:
- Reads
base_model_name_or_pathfromconfig.jsonand recursively loads the parent model. - Instantiates an extended architecture at
num_hidden_layers=26(base 24 + this stage's 2 new layers). - Copies the parent's weights into layers 0-23, layers 0-23 KV-embeds, and the base
lm_head. - Loads
delta.safetensorson top - this fills the 2 new layer blocks, the additivelm_head_stage, the warm-initializedwte, andresid_lambdas/x0_lambdasat their new length 26.
Architecture
This stage is trained per paper Section 6.5 (dual-head SFT) and Section 8.8 (elastic depth):
| Component | Detail |
|---|---|
| Parent | cognica/Cognica-PoE-v1.0-1.3B-base (PoE α=0.0, d24, step 26430, val bpb 0.7209) |
| New transformer layers | 2 appended at positions 24 and 25 (d24 → d26) |
| Frozen layers | 24 (all base layers) |
| Dual-head | Yes — additive specialist lm_head_stage (shape 32768 × 1536, zero-init at training start) |
| Final projection | logits = lm_head_base(x) + lm_head_stage(x) |
| Total params | 1,491,076,878 (~1.49 B) |
| Trainable params at training | 106,954,804 (~107 M, 7.2 %) |
| Shipped delta | 16 tensors, 157,286,452 params, 300 MB (bf16 safetensors) |
| VE pattern | Preserved from base — 12 value-embeds at layers [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23]; new layers carry no VE |
Why dual-head and not a full fine-tune? Freezing the base lm_head and training a zero-init additive specialist lets the stage specialize without touching the base's next-token prediction of general web text. At training step 0 the stage contributes nothing (logits identical to the base); by the final step it has learned chat-specific biases and token-level corrections on top of the frozen base — a controllable, resettable delta. The key empirical claim is that base capability is preserved exactly (Δlogit = 0.0000 across checkpoints) — see the evaluation table below for a live demonstration that base-style continuation remains strong.
Training
| Objective | Cross-entropy over assistant turns only (user turns and padding masked out) |
| Data | HuggingFaceTB/smoltalk (1.04 M convs) + cais/mmlu auxiliary_train × 3 epochs (297 k) + openai/gsm8k main/train × 4 epochs (30 k) → 1.37 M total, subsampled to 500 k (seed 42, deterministic) |
| Held-out val | 512 conversations (same shuffle, outside the train subset) |
| Case augmentation | First-user greetings duplicated with case variants (Hi / hi / HI / Hiya / …) → 500 000 → 513 210 train conversations |
| Sequence length | 2 048 |
| Per-GPU batch | 8 × 2 048 |
| World size | 8 (2 nodes × 4 × A100 80 GB) |
| Tokens per step | 131 072 |
| Steps | 2 748 (≈ 360 M tokens) |
| tok/param | 3.37 (107 M trainable) |
| Optimizer | MuonAdamW with per-group LR scaling |
lm_head_stage LR |
1.00 × 10⁻⁴, weight decay 0.1 |
| AdamW master scaling | 0.7071 (paper Section 6.5.2) |
| Warmup / warmdown | 5 % / 90 % |
| Eval / save cadence | every 200 / 200 steps |
| Best checkpoint shipped | step 2 600, val bpb 2.0610 |
All numbers are also in config.json → stage_training for programmatic access.
Validation bpb trajectory
| Step | val bpb | Δ |
|---|---|---|
| 200 | 2.2110 | — |
| 400 | 2.1856 | -0.0254 |
| 600 | 2.1670 | -0.0186 |
| 800 | 2.1511 | -0.0159 |
| 1000 | 2.1329 | -0.0182 |
| 1200 | 2.1190 | -0.0140 |
| 1400 | 2.1071 | -0.0119 |
| 1600 | 2.0981 | -0.0089 |
| 1800 | 2.0930 | -0.0052 |
| 2000 | 2.0922 | -0.0007 |
| 2200 | 2.0917 | -0.0005 |
| 2400 | 2.0744 | -0.0173 |
| 2600 | 2.0610 | -0.0134 ← best, shipped |
| 2748 | (final, no eval at this step) | — |
Assistant-only bpb. Warmdown (90 %) drives steady convergence; the acceleration at step 2400–2600 coincides with the learning rate decaying below ~0.15 of peak.
Evaluation (honest)
This is a 1.3 B research preview, not a chat product. Below are the actual outputs we observed on a diverse subjective probe. We report Pass / Partial / Fail verdicts so readers can calibrate expectations.
| Probe type | # prompts | Pass | Partial | Fail | Representative observation |
|---|---|---|---|---|---|
| Base-style continuation (`< | bos | >` + raw text) | 4 | 4 | 0 |
| Factual QA (chat format) | 8 | 5 | 2 | 1 | ✅ Canberra, seven continents, Au = atomic number 79, DNA structure, water boils at 100 °C / 212 °F; ⚠ Leonardo da Vinci (correct author, fabricated biographical detail), WWII year (gave 1941 — actual 1945); ❌ Pride and Prejudice author not extracted. |
| Reasoning (commonsense / spatial / syllogism) | 5 | 0 | 0 | 5 | Greedy loops; no correct conclusions. Capacity-limited at 1.3 B. |
| Arithmetic / word problems | 6 | 0 | 2 | 4 | Numerals render correctly (post tokenizer fix), but answers are wrong (17 + 25 → 100, 12 × 8 → 120). GSM8K in the data mix does not close this gap at 1.3 B. |
| Code | 5 | 1 | 2 | 2 | ✅ def sum_list(lst): return sum(lst); ⚠ string reverse syntactically near-miss; ❌ len('hello') not understood. |
| Instruction following (format constraints) | 5 | 0 | 0 | 5 | "Exactly 3 bullets" → 4+, "one sentence" → paragraph, "yes/no" → essay. |
| Creative writing (sampling) | 3 | 0 | 0 | 3 | Writes meta-commentary instead of the requested artifact. |
| Edge cases (hi / Hi / HI / empty) | 6 | 0 | 0 | 6 | All collapse to "The issue with …". Consistent across casings, so case-augmentation worked — but the fallback content is irrelevant. |
| Multi-turn context tracking | 2 | 0 | 0 | 2 | Memory test: "What is my name?" → "Alex! Alex! Alex!" loop. |
| Sampling diversity (temp=0.8, 3 samples) | 6 samples | 4 | 2 | 0 | Modest diversity with occasional drift. |
Interpretation
- Base preservation (4/4) is the primary finding this artifact contributes. Dual-head SFT did not damage the base's knowledge representation, consistent with the paper's §6.5
Δlogit = 0.0000result measured over 12 training checkpoints. - Chat-format weakness is a capacity story, not an architecture story. The specialist head (107 M) plus 2 new layers cannot overcome the 1.3 B backbone's limits on arithmetic, multi-step reasoning, or strict formatting. 1.3 B-class chat SFTs in the literature (Phi-1.5, SmolLM-1.7B, TinyLlama-1.1B) exhibit the same profile — emergent reasoning in chat style typically requires 7 B+ (scaling laws).
- Greedy loops are a sampling-strategy issue. Using
repetition_penalty=1.15or nucleus sampling (do_sample=True, top_p=0.9, temperature=0.8) visibly reduces loops in our probe. The README example above uses greedy for minimal surprise; production use should pick different defaults.
Stacking further stages
base_model_name_or_path supports chaining: point a new stage repo's config at this repo and the cascade loader will resolve base → this SFT stage → new stage transparently. Each stage adds its own lm_head_stage and, at load time, the loader folds ancestor stages' lm_head_stage into the effective lm_head_base so all specialists compose additively into the final projection (logits = lm_head_base + Σ lm_head_stage_k). See the paper for the formal account of this construction. Planned sibling stages (math, code) will publish at separate repos and may be stacked on top of this one or attached directly to the base.
Files
| File | Purpose |
|---|---|
config.json |
base_model_name_or_path, new_layers, frozen_layers, dual_head, full stage_training block |
delta.safetensors |
16-tensor stage delta (bf16, 300 MB) |
modeling_cognica_poe.py |
Cascade loader + _GPT with dual-head forward (same code as the base repo) |
configuration_cognica_poe.py |
CognicaPoEConfig with stage fields |
tokenization_cognica_poe.py |
Byte-level tokenizer (unchanged from base) |
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt |
Tokenizer assets (unchanged from base) |
convert_stage_delta.py |
Converts a nanochat save_stage_delta .pt file into delta.safetensors |
Limitations — explicit list
- Research preview at 1.3 B. Chat behavior is bounded by capacity. Do not deploy in production chat surfaces.
- Arithmetic and reasoning are unreliable at this parameter scale regardless of SFT data. GSM8K in training does not close this.
- Greedy decoding loops. Use
repetition_penalty >= 1.1or sampling. - Strict format constraints are not respected (numbered lists, length limits, yes/no).
- Multi-turn context is fragile beyond 2–3 turns.
- No RLHF / preference tuning / safety tuning. Output may be factually wrong or include fabricated biographical details (we observed this on art-history prompts).
- Assistant-only loss — validation bpb
2.0610is measured on assistant tokens only; do not compare to the base's full-texttrain_val_bpb = 0.7209.
Citation
If you use this model, please cite the companion paper and the nanochat toolkit:
@article{jeong2026poe,
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
author = {Jeong, Jaepil},
year = {2026},
institution = {Cognica, Inc.},
doi = {10.5281/zenodo.19547653},
url = {https://doi.org/10.5281/zenodo.19547653}
}
@misc{cognica-poe-stage-chat-2026,
title = {Cognica-PoE-v1.0-1.3B-stage-chat: Dual-head SFT specialist over a PoE base (research preview)},
author = {{Cognica, Inc.}},
year = {2026},
howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-chat}}
}
License
Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (SmolTalk, MMLU, GSM8K) each carry their own licenses (Apache 2.0, MIT, MIT respectively) and are acknowledged in NOTICE.
- Downloads last month
- 31
Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-chat
Base model
cognica/Cognica-PoE-v1.0-1.3B-base