Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Cognica-PoE-v1.0-1.3B-stage-tool
Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)
Research preview — function-call emission at 1.3B scale. This release is empirical validation of PoE's post-hoc specialist construction at 4-layer depth, applied to the structured-output task of JSON function calling — paper §6.5 extended.
A 164 M-parameter tool/function-calling SFT specialist stage trained directly on the frozen Cognica-PoE-v1.0-1.3B-base. The stage is a sibling of cognica/Cognica-PoE-v1.0-1.3B-stage-chat, -stage-math, and -stage-code — all four branch from the same base.
TL;DR
- 164 M trainable params (4 new transformer layers + additive
lm_head_stage) on top of 1.3 B frozen base - Training data: glaive-function-calling-v2 (x4 epochs) + Salesforce/xlam-function-calling-60k (x3 epochs), 947 k case-augmented conversations, 2 619 optimizer steps
- Best val bpb 1.4288 @ step 2600 (vs a glaive-only ablation that plateaued at 1.883)
- Emits
<functioncall> {...}syntax when the prompt format matches training (multi-line indented JSON) - Full weights fit in a 408 MB
delta.safetensors; loads via cascade on top of the base repo
CRITICAL: prompt format
At 1.3 B scale this model is a template matcher, not a prompt-format generalizer. Function-call emission only fires reliably when the system prompt uses the exact indentation / whitespace of the glaive training distribution: multi-line JSON with 4-space indent, produced by json.dumps(fn, indent=4). Compact single-line JSON crosses the distribution boundary and the model falls back to natural-language answers even when a function is clearly applicable.
Working template (produces <functioncall> {...} output):
import json
weather_fn = {
"name": "get_weather",
"description": "Get current weather by city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
}
system = (
"You are a helpful assistant with access to the following functions. "
"Use them if required -\n"
+ json.dumps(weather_fn, indent=4)
)
user_query = "Can you tell me the current weather in Seoul?"
content = f"{system}\n\n{user_query}"
Breaking template (produces natural-language fallback):
# Do NOT collapse the JSON to a single line — the model will not emit a function call.
content = "... {\"name\": \"get_weather\", \"parameters\": {...}} ... Can you tell me the current weather in Seoul?"
This is a real research finding about small-scale SFT specialists and is called out explicitly in the paper's limitations section.
Quick start
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"cognica/Cognica-PoE-v1.0-1.3B-stage-tool",
trust_remote_code=True,
torch_dtype=torch.float32,
).eval()
tok = AutoTokenizer.from_pretrained(
"cognica/Cognica-PoE-v1.0-1.3B-stage-tool",
trust_remote_code=True,
)
BOS, USR_S, USR_E, ASS_S = 32759, 32760, 32761, 32762
weather_fn = {
"name": "get_weather",
"description": "Get current weather by city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string", "description": "City name"}},
"required": ["city"],
},
}
system = (
"You are a helpful assistant with access to the following functions. "
"Use them if required -\n" + json.dumps(weather_fn, indent=4)
)
content = f"{system}\n\nCan you tell me the current weather in Seoul?"
ids = [BOS, USR_S] + tok.encode(content) + [USR_E, ASS_S]
out = model.generate(
torch.tensor([ids]),
max_new_tokens=80,
do_sample=False,
repetition_penalty=1.15,
pad_token_id=BOS,
)
print(tok.decode(out[0].tolist()[len(ids):]))
# -> '<functioncall> {"name": "get_weather", "arguments": \'{"city": "Seoul", "country": "South Korea"}\'}'
Architecture
- Base model:
cognica/Cognica-PoE-v1.0-1.3B-baseat step 26 430 (frozen) - Frozen layers: 0..23 (24 base transformer blocks, unchanged from base)
- New layers: 24..27 (4 transformer blocks, randomly initialized and trained on tool data)
lm_head_stage: additive head;logits = lm_head_base(x) + lm_head_stage(x)at inference- Vocab / tokenizer: identical to base (same
token_bytes.pt, same<|bos|>,<|user_*|>,<|assistant_*|>, etc.) - Delta size: 408 MB in bf16 (28 tensors, 213 909 560 params)
Training
| Setting | Value |
|---|---|
| Parent | cognica/Cognica-PoE-v1.0-1.3B-base step 26430 (frozen) |
| New layers | 4 (indices 24–27, trained) |
| Optimizer | MuonAdamW hybrid, AdamW scaling 0.707 |
| Matrix LR | 0.002, init 0.2x, warmup 5%, warmdown 90% |
lm_head_stage LR |
1e-4, weight decay 0.1 |
| Batch | 131 072 tokens/step, seq len 2048 |
| Datasets | glaive-function-calling-v2 (x4) + Salesforce/xlam-function-calling-60k (x3) |
| Convs | 631 840 raw, 947 571 case-augmented |
| Steps | 2 619 (2 600 shipped) |
| Hardware | 2 nodes x 4 A100 80 GB (asia-southeast1-c) |
Validation bpb trajectory (every 200 steps)
| Step | Val bpb | Δ |
|---|---|---|
| 200 | 2.1707 | — |
| 400 | 2.1156 | −0.055 |
| 600 | 2.0368 | −0.079 |
| 800 | 1.8921 | −0.145 |
| 1000 | 1.7406 | −0.151 |
| 1200 | 1.6313 | −0.109 |
| 1400 | 1.5513 | −0.080 |
| 1600 | 1.4979 | −0.053 |
| 1800 | 1.4707 | −0.027 |
| 2000 | 1.4510 | −0.020 |
| 2200 | 1.4384 | −0.013 |
| 2400 | 1.4318 | −0.007 |
| 2600 | 1.4288 | −0.003 (shipped) |
Comparison to sibling specialists (4-layer @ same base, same step budget)
| Specialist | Val bpb @ shipped step | Notes |
|---|---|---|
| stage-math | 2.4118 @ s2200 | GSM8K + MathInstruct |
| stage-code | 2.3610 @ s2000 | CodeAlpaca + Magicoder-Evol-Instruct |
| stage-tool | 1.4288 @ s2600 | glaive + xlam (structured JSON) |
Structured tool-call data has much lower token entropy than math or code prose, which is why BPB converges so much lower. This does not mean the model is better at reasoning — it means the data distribution is narrower.
Evaluation — honest
Working cases (exact training format):
| Query | Emitted |
|---|---|
| "Can you tell me the current weather in Seoul?" | <functioncall> {"name": "get_weather", "arguments": '{"city": "Seoul", ...}'} |
| "What is the temperature in New York?" | <functioncall> {"name": "get_weather", "arguments": '{"city": "New York", ...}'} |
Failure modes:
- Compact / non-indented JSON in the system prompt → natural-language answer, no function call.
- No function defined that matches the query → "I'm sorry, as an AI I don't have the capability..." (glaive refusal template learned verbatim).
- After emitting the function call, greedy decoding sometimes enters a
]}]}]}]loop because the model doesn't learn a strong stop signal for the closing brace. Use a stop-token set including<|endoftext|>,<|assistant_end|>, or capmax_new_tokensat 80–100. - Argument hallucination: the model may add extra fields (e.g.
"country") that were not in the function schema — a known glaive training artifact.
This artifact is a research preview; treat its emissions as schema suggestions, not as validated tool invocations. Always schema-check before dispatching.
Stacking further stages
The cascade loader's from_pretrained supports recursive parent-stage chaining; the parent's lm_head_stage folds into lm_head so a child stage trained on top of this one starts from logits = lm_head_base + lm_head_tool + lm_head_child. This stage is the mechanism by which §6.5 of the paper claims "multiple specialists stack" — the release here is a sibling, but the loader already supports the chained case.
Files
delta.safetensors— 408 MB bf16 stage delta (28 tensors, 213 909 560 params)config.json— stage + base config (points tocognica/Cognica-PoE-v1.0-1.3B-base)modeling_cognica_poe.py,configuration_cognica_poe.py,tokenization_cognica_poe.py— Hugging Facetrust_remote_codemodulesconvert_stage_delta.py— reproducesdelta.safetensorsfrom a nanochat stage checkpointtokenizer.pkl,tokenizer_config.json,special_tokens_map.json,token_bytes.pt— tokenizer (identical to base)
Limitations
- 1.3 B scale. Function-call emission is template-level pattern completion, not general tool-use reasoning. Does not compose well with unseen schemas or ambiguous instructions.
- Prompt-format brittle. Requires training-matched multi-line indented JSON; single-line JSON causes silent failure.
- Stop-token weak. Greedy decoding enters closing-brace loops; cap
max_new_tokensand post-process to extract the first complete JSON object. - Argument hallucination. Glaive training data occasionally invents fields; downstream code must validate arguments against the real schema.
- Refusal template baked in. If no function in the system prompt matches the query, the model will emit the glaive refusal string near-verbatim.
Citation
@article{jeong2026poe,
title = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
author = {Jeong, Jaepil},
year = {2026},
institution = {Cognica, Inc.},
doi = {10.5281/zenodo.19547653},
url = {https://doi.org/10.5281/zenodo.19547653}
}
@misc{cognica-poe-stage-tool-2026,
title = {Cognica-PoE-v1.0-1.3B-stage-tool: Dual-head SFT function-calling specialist over a PoE base (research preview)},
author = {{Cognica, Inc.}},
year = {2026},
howpublished = {\url{https://huggingface.co/cognica/Cognica-PoE-v1.0-1.3B-stage-tool}}
}
License
Apache 2.0 — see LICENSE and NOTICE. Same terms as the base model. Training datasets (glaive-function-calling-v2, xlam-function-calling-60k) each carry their own licenses (Apache 2.0, CC-BY-4.0 respectively) and are acknowledged in NOTICE.
- Downloads last month
- 38
Model tree for cognica/Cognica-PoE-v1.0-1.3B-stage-tool
Base model
cognica/Cognica-PoE-v1.0-1.3B-base