Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-chat

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Not a production chat model. This release is empirical validation of PoE's dual-head specialist construction — paper §6.5.

A 107 M-parameter SFT specialist stage trained on top of the frozen Cognica-PoE-v1.0-1.3B-base. At this parameter scale (1.3 B base + 107 M specialist, d26 total), chat behavior is bounded by capacity, not by the specialist construction. What this artifact does demonstrate — reliably — is:

Base preservation is bit-identical. The base lm_head and layers 0–23 are frozen during SFT. Base-style continuation (<|bos|> + raw text) remains strong and coherent after specialist training — empirically verified on four scientific-knowledge probes (Sun, photosynthesis, quantum mechanics, language-learning advice). This is the live-model evidence for the paper's Δlogit = 0.0000 claim across 12 checkpoints.
Post-hoc specialist composition works. The specialist delta (300 MB, 16 tensors) can be loaded on demand and composed additively with the base (logits = lm_head_base(x) + lm_head_stage(x)). The loader supports arbitrary-depth chaining, so this stage is a building block for math / code / domain specialists that stack on top.
Delta-only distribution scales. The shipped stage is a 300 MB delta, not a 2.6 GB full-model copy. The cascade loader pulls the base once and assembles the d26 model in memory.

What this artifact does not demonstrate, and should not be compared against:

Instruction following in the GPT-3.5 / GPT-4 sense.
Arithmetic or multi-step reasoning. At 1.3 B, these are capacity-limited regardless of SFT data mix; including GSM8K train does not change this.
Multi-turn context tracking across many turns.
Strict format constraints ("answer yes or no", "exactly 3 bullets").

If you want a production chat model, this is not it. If you are evaluating PoE's architectural claims — whether a frozen-base + trainable-specialist construction can produce a composable SFT delta without degrading base capabilities — this is the artifact to look at.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "cognica/Cognica-PoE-v1.0-1.3B-stage-chat"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 trust_remote_code=True,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
model.eval()

# Chat-style inference uses the SmolTalk special-token protocol:
# <|bos|><|user_start|> USER <|user_end|><|assistant_start|> ASSISTANT <|assistant_end|>
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
user_msg = "What is the boiling point of water in Fahrenheit?"
ids = [BOS, USR_S] + tokenizer.encode(user_msg) + [USR_E, ASS_S]
input_ids = torch.tensor([ids], device=model.device)

with torch.no_grad():
 out = model.generate(input_ids, max_new_tokens=80, do_sample=False)
print(tokenizer.decode(out[0, len(ids):]))
# -> "The boiling point of water is 100°C (212°F). At sea level, ..."

Base-style prompts (just <|bos|> + raw text, no chat wrapper) work better than the chat format for open-ended continuation — see the evaluation table below. For greedy decoding prefer a repetition penalty (repetition_penalty=1.15); without it most prompts loop.

The cascade loader does the following when you call from_pretrained:

Reads base_model_name_or_path from config.json and recursively loads the parent model.
Instantiates an extended architecture at num_hidden_layers=26 (base 24 + this stage's 2 new layers).
Copies the parent's weights into layers 0-23, layers 0-23 KV-embeds, and the base lm_head.
Loads delta.safetensors on top - this fills the 2 new layer blocks, the additive lm_head_stage, the warm-initialized wte, and resid_lambdas / x0_lambdas at their new length 26.

Architecture

This stage is trained per paper Section 6.5 (dual-head SFT) and Section 8.8 (elastic depth):

Component	Detail
Parent	`cognica/Cognica-PoE-v1.0-1.3B-base` (PoE α=0.0, d24, step 26430, val bpb 0.7209)
New transformer layers	2 appended at positions 24 and 25 (d24 → d26)
Frozen layers	24 (all base layers)
Dual-head	Yes — additive specialist `lm_head_stage` (shape 32768 × 1536, zero-init at training start)
Final projection	`logits = lm_head_base(x) + lm_head_stage(x)`
Total params	1,491,076,878 (~1.49 B)
Trainable params at training	106,954,804 (~107 M, 7.2 %)
Shipped delta	16 tensors, 157,286,452 params, 300 MB (bf16 safetensors)
VE pattern	Preserved from base — 12 value-embeds at layers [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23]; new layers carry no VE

Why dual-head and not a full fine-tune? Freezing the base lm_head and training a zero-init additive specialist lets the stage specialize without touching the base's next-token prediction of general web text. At training step 0 the stage contributes nothing (logits identical to the base); by the final step it has learned chat-specific biases and token-level corrections on top of the frozen base — a controllable, resettable delta. The key empirical claim is that base capability is preserved exactly (Δlogit = 0.0000 across checkpoints) — see the evaluation table below for a live demonstration that base-style continuation remains strong.

Training


Objective	Cross-entropy over assistant turns only (user turns and padding masked out)
Data	`HuggingFaceTB/smoltalk` (1.04 M convs) + `cais/mmlu` auxiliary_train × 3 epochs (297 k) + `openai/gsm8k` main/train × 4 epochs (30 k) → 1.37 M total, subsampled to 500 k (seed 42, deterministic)
Held-out val	512 conversations (same shuffle, outside the train subset)
Case augmentation	First-user greetings duplicated with case variants (Hi / hi / HI / Hiya / …) → 500 000 → 513 210 train conversations
Sequence length	2 048
Per-GPU batch	8 × 2 048
World size	8 (2 nodes × 4 × A100 80 GB)
Tokens per step	131 072
Steps	2 748 (≈ 360 M tokens)
tok/param	3.37 (107 M trainable)
Optimizer	MuonAdamW with per-group LR scaling
`lm_head_stage` LR	1.00 × 10⁻⁴, weight decay 0.1
AdamW master scaling	0.7071 (paper Section 6.5.2)
Warmup / warmdown	5 % / 90 %
Eval / save cadence	every 200 / 200 steps
Best checkpoint shipped	step 2 600, val bpb 2.0610

All numbers are also in config.json → stage_training for programmatic access.

Validation bpb trajectory

Step	val bpb	Δ
200	2.2110	—
400	2.1856	-0.0254
600	2.1670	-0.0186
800	2.1511	-0.0159
1000	2.1329	-0.0182
1200	2.1190	-0.0140
1400	2.1071	-0.0119
1600	2.0981	-0.0089
1800	2.0930	-0.0052
2000	2.0922	-0.0007
2200	2.0917	-0.0005
2400	2.0744	-0.0173
2600	2.0610	-0.0134 ← best, shipped
2748	(final, no eval at this step)	—

Assistant-only bpb. Warmdown (90 %) drives steady convergence; the acceleration at step 2400–2600 coincides with the learning rate decaying below ~0.15 of peak.

Evaluation (honest)

This is a 1.3 B research preview, not a chat product. Below are the actual outputs we observed on a diverse subjective probe. We report Pass / Partial / Fail verdicts so readers can calibrate expectations.

Probe type	# prompts	Pass	Partial	Fail	Representative observation
Base-style continuation (`<	bos	>` + raw text)	4	4	0
Factual QA (chat format)	8	5	2	1	✅ Canberra, seven continents, Au = atomic number 79, DNA structure, water boils at 100 °C / 212 °F; ⚠ Leonardo da Vinci (correct author, fabricated biographical detail), WWII year (gave 1941 — actual 1945); ❌ Pride and Prejudice author not extracted.
Reasoning (commonsense / spatial / syllogism)	5	0	0	5	Greedy loops; no correct conclusions. Capacity-limited at 1.3 B.
Arithmetic / word problems	6	0	2	4	Numerals render correctly (post tokenizer fix), but answers are wrong (17 + 25 → 100, 12 × 8 → 120). GSM8K in the data mix does not close this gap at 1.3 B.
Code	5	1	2	2	✅ `def sum_list(lst): return sum(lst)`; ⚠ string reverse syntactically near-miss; ❌ `len('hello')` not understood.
Instruction following (format constraints)	5	0	0	5	"Exactly 3 bullets" → 4+, "one sentence" → paragraph, "yes/no" → essay.
Creative writing (sampling)	3	0	0	3	Writes meta-commentary instead of the requested artifact.
Edge cases (hi / Hi / HI / empty)	6	0	0	6	All collapse to "The issue with …". Consistent across casings, so case-augmentation worked — but the fallback content is irrelevant.
Multi-turn context tracking	2	0	0	2	Memory test: "What is my name?" → "Alex! Alex! Alex!" loop.
Sampling diversity (temp=0.8, 3 samples)	6 samples	4	2	0	Modest diversity with occasional drift.

Interpretation

Base preservation (4/4) is the primary finding this artifact contributes. Dual-head SFT did not damage the base's knowledge representation, consistent with the paper's §6.5 Δlogit = 0.0000 result measured over 12 training checkpoints.
Chat-format weakness is a capacity story, not an architecture story. The specialist head (107 M) plus 2 new layers cannot overcome the 1.3 B backbone's limits on arithmetic, multi-step reasoning, or strict formatting. 1.3 B-class chat SFTs in the literature (Phi-1.5, SmolLM-1.7B, TinyLlama-1.1B) exhibit the same profile — emergent reasoning in chat style typically requires 7 B+ (scaling laws).
Greedy loops are a sampling-strategy issue. Using repetition_penalty=1.15 or nucleus sampling (do_sample=True, top_p=0.9, temperature=0.8) visibly reduces loops in our probe. The README example above uses greedy for minimal surprise; production use should pick different defaults.

Stacking further stages

base_model_name_or_path supports chaining: point a new stage repo's config at this repo and the cascade loader will resolve base → this SFT stage → new stage transparently. Each stage adds its own lm_head_stage and, at load time, the loader folds ancestor stages' lm_head_stage into the effective lm_head_base so all specialists compose additively into the final projection (logits = lm_head_base + Σ lm_head_stage_k). See the paper for the formal account of this construction. Planned sibling stages (math, code) will publish at separate repos and may be stacked on top of this one or attached directly to the base.

Files

File	Purpose
`config.json`	`base_model_name_or_path`, `new_layers`, `frozen_layers`, `dual_head`, full `stage_training` block
`delta.safetensors`	16-tensor stage delta (bf16, 300 MB)
`modeling_cognica_poe.py`	Cascade loader + `_GPT` with dual-head forward (same code as the base repo)
`configuration_cognica_poe.py`	`CognicaPoEConfig` with stage fields
`tokenization_cognica_poe.py`	Byte-level tokenizer (unchanged from base)
`tokenizer.pkl`, `tokenizer_config.json`, `special_tokens_map.json`, `token_bytes.pt`	Tokenizer assets (unchanged from base)
`convert_stage_delta.py`	Converts a nanochat `save_stage_delta` `.pt` file into `delta.safetensors`

Limitations — explicit list

Research preview at 1.3 B. Chat behavior is bounded by capacity. Do not deploy in production chat surfaces.
Arithmetic and reasoning are unreliable at this parameter scale regardless of SFT data. GSM8K in training does not close this.
Greedy decoding loops. Use repetition_penalty >= 1.1 or sampling.
Strict format constraints are not respected (numbered lists, length limits, yes/no).
Multi-turn context is fragile beyond 2–3 turns.
No RLHF / preference tuning / safety tuning. Output may be factually wrong or include fabricated biographical details (we observed this on art-history prompts).
Assistant-only loss — validation bpb 2.0610 is measured on assistant tokens only; do not compare to the base's full-text train_val_bpb = 0.7209.

Citation

If you use this model, please cite the companion paper and the nanochat toolkit:

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-chat-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-chat: Dual-head SFT specialist over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-chat}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (SmolTalk, MMLU, GSM8K) each carry their own licenses (Apache 2.0, MIT, MIT respectively) and are acknowledged in NOTICE.

Downloads last month: 31

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-chat

Base model

cognica/Cognica-PoE-v1.0-1.3B-base

Finetuned

(6)

this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-chat

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-chat

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.3B across five modularity axes. • 7 items • Updated Apr 22

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-chat

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-chat · Hugging Face