Atem-Pharaoh-3B
Ancient logic. Modern intelligence.
The 3B chain-of-thought model β explicit reasoning traces at scale.
π Base Model
π Stage
π Parameters
π License
Overview
Atem-Pharaoh-3B is the Stage 2 release of the 3B Atem series β a chain-of-thought fine-tune built on top of Atem-3B, trained to produce explicit <think>...</think> reasoning traces before arriving at a final answer. Where Atem-3B was trained to answer directly, Pharaoh is trained to think out loud.
Training used approximately 38,000 examples drawn from a pool of ~63,500 CoT-annotated records across mathematics, code, science, and general reasoning. A deliberate 75%/25% think/no-think split was applied β the model was trained on structured reasoning traces for the majority of examples and direct answers for the remainder, ensuring it can operate in both modes depending on how it is prompted.
Design note: Atem-Pharaoh-3B has a confirmed tendency toward verbose outputs and, on open-ended questions with many valid answers, occasional think trace runaways. Custom system prompts are strongly recommended to control verbosity, chain-of-thought depth, and output length. See the Prompting Guidance section below.
The Atem Series
1.5B Series
| Model | Stage | Capability |
|---|---|---|
| Atem v1 | Stage 1 β SFT | Fast, direct reasoning |
| Atem-Wisdom | Stage 2 β CoT | Explicit thinking traces |
| Atem-Pharaoh-1.5B (planned) | Stage 3 β DPO/IPO | Preference-aligned reasoning |
3B Series
| Model | Stage | Capability |
|---|---|---|
| Atem-3B | Stage 1 β SFT | Direct reasoning at 3B scale |
| Atem-Pharaoh-3B | Stage 2 β CoT | Explicit reasoning traces at 3B scale |
| Atem-Pharaoh-3B-DPO (planned) | Stage 3 β DPO/IPO | Preference-aligned reasoning |
Model Details
| Property | Value |
|---|---|
| Base model | EphAsad/Atem-3B |
| Training method | LoRA SFT β Stage 2 (CoT think traces) |
| LoRA config | r=32, alpha=64, dropout=0.05 |
| Parameters | ~3.09B |
| Trainable parameters | 59,867,136 (1.90%) |
| Training records | 38,157 (after token length filtering) |
| Think / no-think split | 75% / 25% |
| Epochs | 2 |
| Final val loss | 0.9494 |
| Hardware | NVIDIA A100-SXM4-80GB |
| Max sequence length | 4,096 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
Output Format
Atem-Pharaoh-3B produces responses in one of two formats depending on the prompt and training signal:
Think mode (75% of training):
<think>
{step-by-step reasoning trace}
</think>
{final answer}
Direct mode (25% of training):
{direct answer β no think tags}
The model defaults to think mode for most queries. To reliably suppress or encourage CoT, use a custom system prompt (see below).
Prompting Guidance
Atem-Pharaoh-3B responds to system prompt instruction. The default identity is baked into the chat template and produces think traces on most inputs. For deployment use cases where verbosity, output length, or CoT depth need controlling, the following prompt patterns are recommended.
Suppress CoT β direct answers only
You are Atem, a precise and analytical assistant. Respond directly and concisely.
Do not show internal reasoning. Answer the question and stop.
Calibrate length to question complexity
You are Atem, a precise and analytical assistant. Match your response length to
the complexity of the question β a single sentence for simple questions, full
reasoning for complex ones. Do not over-explain.
Full CoT β maximise reasoning depth
You are Atem, a precise and analytical assistant. Think through every problem
step by step before answering. Show your full reasoning inside <think> tags,
then give your final answer.
Cap think trace length
You are Atem, a precise and analytical assistant. When you reason through a
problem, keep your thinking concise β aim for no more than 150 words inside
<think> tags. Then give a clear, direct final answer.
Without a custom prompt, the model will use the default identity and tend toward longer, more structured outputs. On open-ended questions with many valid answers, this can result in extended reasoning traces. Prompting with an explicit length or format constraint reliably corrects this.
Training Data
Stage 2 training used approximately 38,000 examples after token-length filtering, drawn from a pool of ~63,500 CoT-annotated records. Chinese-language reasoning traces from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion; non-English traces were downgraded to the no-think pool rather than discarded entirely. OpenR1-Math examples were filtered to correctness_llama == True only.
The think/no-think split was enforced programmatically: after all datasets were loaded into a think pool and a no-think pool, records were flipped from thinkβno-think until the no-think pool reached 25% of the total corpus.
| Dataset | Count | Type |
|---|---|---|
| Modotte/CodeX-2M-Thinking | 10,000 | Code CoT |
| nvidia/OpenCodeReasoning | 10,000 | Code reasoning |
| Jackrong/Kimi-K2.5 (Γ3 configs) | 15,000 | General / Math / PhD reasoning |
| mitroitskii/OpenR1-Math-220k-formatted | 7,000 | Mathematics (correctness filter) |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | 7,000 | Inverted reasoning traces |
| trjxter/DeepSeek-V4-Pro-Reasoning-8000x | 8,014 | Reasoning distillation |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | Mixed reasoning |
| FreedomIntelligence/medical-o1-reasoning-SFT | 3,000 | Medical reasoning |
Loss curve:
| Step | Train Loss | Val Loss |
|---|---|---|
| 250 | 1.0215 | 0.9931 |
| 500 | 0.9615 | 0.9663 |
| 750 | 0.9516 | 0.9556 |
| 1000 | 0.9425 | 0.9502 |
| 1194 (final) | 0.9897 | 0.9494 |
Training loss descent is steady across both epochs. The slight uptick at the final step is normal end-of-epoch behaviour on a cosine schedule.
Evaluation
A/B Comparison β Atem-Pharaoh-3B vs Qwen2.5-3B-Instruct
Evaluated on 30 questions calibrated to 3B model capability across coding, mathematics, analytical reasoning, and language tasks. Both models ran on identical prompts with no system prompt override.
| Metric | Base (Qwen2.5-3B) | Atem-Pharaoh-3B |
|---|---|---|
| Think traces | 0 / 30 | 30 / 30 |
| Avg response length | 152 words | 427 words |
Qualitative findings:
Coding tasks (is_even, count_vowels, list vs tuple, find_max, for vs while): Atem-Pharaoh-3B consistently correct with additional edge case handling and alternative approaches in the trace. Base model answers are correct but minimal.
Mathematical tasks: Both models correct. Pharaoh's traces show full working.
Analytical tasks (student score, shop visitors, correlation/causation, hiring/queuing): Pharaoh produces richer, more structured responses with clearer explanations. The queuing theory response (Q16) demonstrates genuine reasoning depth with well-constructed analogies.
Language tasks: Both models perform comparably. Pharaoh tends toward over-structuring simple tasks.
Known limitations observed in evaluation:
Think trace runaways: On open-ended questions where valid answers are unbounded, the think trace can degenerate into extended enumeration rather than converging on an answer. This was observed on Q27 (sentence ambiguity) in this evaluation and is consistent with behaviour observed in separate testing. The final answer typically recovers correctly, but the trace itself becomes incoherent. Custom system prompts with explicit trace length constraints are the recommended mitigation (see Prompting Guidance).
Verbosity mismatch: Response length does not scale to question complexity. Simple questions receive the same structural treatment as complex ones. A system prompt instructing the model to match length to complexity resolves this reliably.
Occasional tag artifacts: A small number of responses produced nested <think><think> opening tags. This is a minor formatting artifact with no effect on answer quality.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-Pharaoh-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain why a binary search is faster than a linear search."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-Pharaoh-3B",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "Write a Python function to check if a number is prime."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
Ollama
# Recommended β best speed/quality balance
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M
Available Files
| File | Size | Description |
|---|---|---|
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors |
~6.2 GB | Full bfloat16 weights |
Atem-Pharaoh-3B.Q4_K_M.gguf |
~1.93 GB | 4-bit β recommended |
Atem-Pharaoh-3B.Q5_K_M.gguf |
~2.22 GB | 5-bit |
Atem-Pharaoh-3B.Q8_0.gguf |
~3.29 GB | 8-bit β near-lossless |
System Prompt
Atem-Pharaoh-3B's identity is baked into the chat template. For production use, override with a custom system prompt tailored to your use case (see Prompting Guidance above). The default identity:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically β identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Roadmap
| Stage | Status | Description |
|---|---|---|
| Stage 1 β SFT | β Complete | Atem-3B β direct reasoning |
| Stage 2 β CoT SFT | β Complete | Atem-Pharaoh-3B β this model |
| Stage 3 β DPO/IPO | π Planned | Preference-aligned reasoning |
Citation
@misc{atem_pharaoh_3b_2026,
author = {Asad, Zain},
title = {Atem-Pharaoh-3B: Chain-of-Thought Reasoning via Stage 2 CoT SFT},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-Pharaoh-3B}},
}
License
Released under the Apache 2.0 License, consistent with the base model lineage (Qwen2.5-3B-Instruct β Atem-3B β Atem-Pharaoh-3B).
Built independently by EphAsad
- Downloads last month
- 221
Model tree for EphAsad/Atem-Pharaoh-3B
Base model
Qwen/Qwen2.5-3B