Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-PoE-v1.0-1.3B-stage-tool

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

Research preview — function-call emission at 1.3B scale. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth, applied to the structured-output task of JSON function calling — paper §6.5 extended.

A 164 M-parameter tool/function-calling SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat, -stage-math, and -stage-code — all four branch from the same base.

TL;DR

164 M trainable params (4 new transformer layers + additive lm_head_stage) on top of 1.3 B frozen base
Training data: glaive-function-calling-v2 (x4 epochs) + Salesforce/xlam-function-calling-60k (x3 epochs), 947 k case-augmented conversations, 2 619 optimizer steps
Best val bpb 1.4288 @ step 2600 (vs a glaive-only ablation that plateaued at 1.883)
Emits <functioncall> {...} syntax when the prompt format matches training (multi-line indented JSON)
Full weights fit in a 408 MB delta.safetensors; loads via cascade on top of the base repo

CRITICAL: prompt format

At 1.3 B scale this model is a template matcher, not a prompt-format generalizer. Function-call emission only fires reliably when the system prompt uses the exact indentation / whitespace of the glaive training distribution: multi-line JSON with 4-space indent, produced by json.dumps(fn, indent=4). Compact single-line JSON crosses the distribution boundary and the model falls back to natural-language answers even when a function is clearly applicable.

Working template (produces <functioncall> {...} output):

import json

weather_fn = {
 "name": "get_weather",
 "description": "Get current weather by city",
 "parameters": {
 "type": "object",
 "properties": {
 "city": {"type": "string", "description": "City name"},
 },
 "required": ["city"],
 },
}

system = (
 "You are a helpful assistant with access to the following functions. "
 "Use them if required -\n"
 + json.dumps(weather_fn, indent=4)
)
user_query = "Can you tell me the current weather in Seoul?"
content = f"{system}\n\n{user_query}"

Breaking template (produces natural-language fallback):

# Do NOT collapse the JSON to a single line — the model will not emit a function call.
content = "... {\"name\": \"get_weather\", \"parameters\": {...}} ... Can you tell me the current weather in Seoul?"

This is a real research finding about small-scale SFT specialists and is called out explicitly in the paper's limitations section.

Quick start

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
 "cognica/Cognica-PoE-v1.0-1.3B-stage-tool",
 trust_remote_code=True,
 torch_dtype=torch.float32,
).eval()
tok = AutoTokenizer.from_pretrained(
 "cognica/Cognica-PoE-v1.0-1.3B-stage-tool",
 trust_remote_code=True,
)

BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762

weather_fn = {
 "name": "get_weather",
 "description": "Get current weather by city",
 "parameters": {
 "type": "object",
 "properties": {"city": {"type": "string", "description": "City name"}},
 "required": ["city"],
 },
}
system = (
 "You are a helpful assistant with access to the following functions. "
 "Use them if required -\n" + json.dumps(weather_fn, indent=4)
)
content = f"{system}\n\nCan you tell me the current weather in Seoul?"
ids = [BOS, USR_S] + tok.encode(content) + [USR_E, ASS_S]

out = model.generate(
 torch.tensor([ids]),
 max_new_tokens=80,
 do_sample=False,
 repetition_penalty=1.15,
 pad_token_id=BOS,
)
print(tok.decode(out[0].tolist()[len(ids):]))
# -> '<functioncall> {"name": "get_weather", "arguments": \'{"city": "Seoul", "country": "South Korea"}\'}'

Architecture

Base model: cognica/Cognica-PoE-v1.0-1.3B-base at step 26 430 (frozen)
Frozen layers: 0..23 (24 base transformer blocks, unchanged from base)
New layers: 24..27 (4 transformer blocks, randomly initialized and trained on tool data)
lm_head_stage: additive head; logits = lm_head_base(x) + lm_head_stage(x) at inference
Vocab / tokenizer: identical to base (same token_bytes.pt, same <|bos|>, <|user_*|>, <|assistant_*|>, etc.)
Delta size: 408 MB in bf16 (28 tensors, 213 909 560 params)

Training

Setting	Value
Parent	`cognica/Cognica-PoE-v1.0-1.3B-base` step 26430 (frozen)
New layers	4 (indices 24–27, trained)
Optimizer	MuonAdamW hybrid, AdamW scaling 0.707
Matrix LR	0.002, init 0.2x, warmup 5%, warmdown 90%
`lm_head_stage` LR	1e-4, weight decay 0.1
Batch	131 072 tokens/step, seq len 2048
Datasets	glaive-function-calling-v2 (x4) + Salesforce/xlam-function-calling-60k (x3)
Convs	631 840 raw, 947 571 case-augmented
Steps	2 619 (2 600 shipped)
Hardware	2 nodes x 4 A100 80 GB (asia-southeast1-c)

Validation bpb trajectory (every 200 steps)

Step	Val bpb	Δ
200	2.1707	—
400	2.1156	−0.055
600	2.0368	−0.079
800	1.8921	−0.145
1000	1.7406	−0.151
1200	1.6313	−0.109
1400	1.5513	−0.080
1600	1.4979	−0.053
1800	1.4707	−0.027
2000	1.4510	−0.020
2200	1.4384	−0.013
2400	1.4318	−0.007
2600	1.4288	−0.003 (shipped)

Comparison to sibling specialists (4-layer @ same base, same step budget)

Specialist	Val bpb @ shipped step	Notes
stage-math	2.4118 @ s2200	GSM8K + MathInstruct
stage-code	2.3610 @ s2000	CodeAlpaca + Magicoder-Evol-Instruct
stage-tool	1.4288 @ s2600	glaive + xlam (structured JSON)

Structured tool-call data has much lower token entropy than math or code prose, which is why BPB converges so much lower. This does not mean the model is better at reasoning — it means the data distribution is narrower.

Evaluation — honest

Working cases (exact training format):

Query	Emitted
"Can you tell me the current weather in Seoul?"	`<functioncall> {"name": "get_weather", "arguments": '{"city": "Seoul", ...}'}`
"What is the temperature in New York?"	`<functioncall> {"name": "get_weather", "arguments": '{"city": "New York", ...}'}`

Failure modes:

Compact / non-indented JSON in the system prompt → natural-language answer, no function call.
No function defined that matches the query → "I'm sorry, as an AI I don't have the capability..." (glaive refusal template learned verbatim).
After emitting the function call, greedy decoding sometimes enters a ]}]}]}] loop because the model doesn't learn a strong stop signal for the closing brace. Use a stop-token set including <|endoftext|>, <|assistant_end|>, or cap max_new_tokens at 80–100.
Argument hallucination: the model may add extra fields (e.g. "country") that were not in the function schema — a known glaive training artifact.

This artifact is a research preview; treat its emissions as schema suggestions, not as validated tool invocations. Always schema-check before dispatching.

Stacking further stages

The cascade loader's from_pretrained supports recursive parent-stage chaining; the parent's lm_head_stage folds into lm_head so a child stage trained on top of this one starts from logits = lm_head_base + lm_head_tool + lm_head_child. This stage is the mechanism by which §6.5 of the paper claims "multiple specialists stack" — the release here is a sibling, but the loader already supports the chained case.

Files

delta.safetensors — 408 MB bf16 stage delta (28 tensors, 213 909 560 params)
config.json — stage + base config (points to cognica/Cognica-PoE-v1.0-1.3B-base)
modeling_cognica_poe.py, configuration_cognica_poe.py, tokenization_cognica_poe.py — Hugging Face trust_remote_code modules
convert_stage_delta.py — reproduces delta.safetensors from a nanochat stage checkpoint
tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt — tokenizer (identical to base)

Limitations

1.3 B scale. Function-call emission is template-level pattern completion, not general tool-use reasoning. Does not compose well with unseen schemas or ambiguous instructions.
Prompt-format brittle. Requires training-matched multi-line indented JSON; single-line JSON causes silent failure.
Stop-token weak. Greedy decoding enters closing-brace loops; cap max_new_tokens and post-process to extract the first complete JSON object.
Argument hallucination. Glaive training data occasionally invents fields; downstream code must validate arguments against the real schema.
Refusal template baked in. If no function in the system prompt matches the query, the model will emit the glaive refusal string near-verbatim.

Citation

@article{jeong2026poe,
 title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
 author = {Jeong, Jaepil},
 year = {2026},
 institution = {Cognica, Inc.},
 doi = {10.5281/zenodo.19547653},
 url = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-poe-stage-tool-2026,
 title = {Cognica-PoE-v1.0-1.3B-stage-tool: Dual-head SFT function-calling specialist over a PoE base (research preview)},
 author = {{Cognica, Inc.}},
 year = {2026},
 howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-tool}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (glaive-function-calling-v2, xlam-function-calling-60k) each carry their own licenses (Apache 2.0, CC-BY-4.0 respectively) and are acknowledged in NOTICE.

Downloads last month: 38

Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-tool

Base model

cognica/Cognica-PoE-v1.0-1.3B-base

Finetuned

(6)

this model

Datasets used to train cognica/Cognica-PoE-v1.0-1.3B-stage-tool

Collection including cognica/Cognica-PoE-v1.0-1.3B-stage-tool

Product of Experts (PoE) replaces backprop's global state with local learning, validated at 1.3B across five modularity axes. • 7 items • Updated Apr 22

URL: https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-tool

⇱ cognica/Cognica-PoE-v1.0-1.3B-stage-tool · Hugging Face