Atem-Pharaoh-3B

Ancient logic. Modern intelligence.

The 3B chain-of-thought model — explicit reasoning traces at scale.

👁 Base Model
👁 Stage
👁 Parameters
👁 License

Overview

Atem-Pharaoh-3B is the Stage 2 release of the 3B Atem series — a chain-of-thought fine-tune built on top of Atem-3B, trained to produce explicit <think>...</think> reasoning traces before arriving at a final answer. Where Atem-3B was trained to answer directly, Pharaoh is trained to think out loud.

Training used approximately 38,000 examples drawn from a pool of ~63,500 CoT-annotated records across mathematics, code, science, and general reasoning. A deliberate 75%/25% think/no-think split was applied — the model was trained on structured reasoning traces for the majority of examples and direct answers for the remainder, ensuring it can operate in both modes depending on how it is prompted.

Design note: Atem-Pharaoh-3B has a confirmed tendency toward verbose outputs and, on open-ended questions with many valid answers, occasional think trace runaways. Custom system prompts are strongly recommended to control verbosity, chain-of-thought depth, and output length. See the Prompting Guidance section below.

The Atem Series

1.5B Series

Model	Stage	Capability
Atem v1	Stage 1 — SFT	Fast, direct reasoning
Atem-Wisdom	Stage 2 — CoT	Explicit thinking traces
Atem-Pharaoh-1.5B (planned)	Stage 3 — DPO/IPO	Preference-aligned reasoning

3B Series

Model	Stage	Capability
Atem-3B	Stage 1 — SFT	Direct reasoning at 3B scale
Atem-Pharaoh-3B	Stage 2 — CoT	Explicit reasoning traces at 3B scale
Atem-Pharaoh-3B-DPO (planned)	Stage 3 — DPO/IPO	Preference-aligned reasoning

Model Details

Property	Value
Base model	EphAsad/Atem-3B
Training method	LoRA SFT — Stage 2 (CoT think traces)
LoRA config	r=32, alpha=64, dropout=0.05
Parameters	~3.09B
Trainable parameters	59,867,136 (1.90%)
Training records	38,157 (after token length filtering)
Think / no-think split	75% / 25%
Epochs	2
Final val loss	0.9494
Hardware	NVIDIA A100-SXM4-80GB
Max sequence length	4,096 tokens
Precision	bfloat16
License	Apache 2.0

Output Format

Atem-Pharaoh-3B produces responses in one of two formats depending on the prompt and training signal:

Think mode (75% of training):

<think>
{step-by-step reasoning trace}
</think>

{final answer}

Direct mode (25% of training):

{direct answer — no think tags}

The model defaults to think mode for most queries. To reliably suppress or encourage CoT, use a custom system prompt (see below).

Prompting Guidance

Atem-Pharaoh-3B responds to system prompt instruction. The default identity is baked into the chat template and produces think traces on most inputs. For deployment use cases where verbosity, output length, or CoT depth need controlling, the following prompt patterns are recommended.

Suppress CoT — direct answers only

You are Atem, a precise and analytical assistant. Respond directly and concisely.
Do not show internal reasoning. Answer the question and stop.

Calibrate length to question complexity

You are Atem, a precise and analytical assistant. Match your response length to
the complexity of the question — a single sentence for simple questions, full
reasoning for complex ones. Do not over-explain.

Full CoT — maximise reasoning depth

You are Atem, a precise and analytical assistant. Think through every problem
step by step before answering. Show your full reasoning inside <think> tags,
then give your final answer.

Cap think trace length

You are Atem, a precise and analytical assistant. When you reason through a
problem, keep your thinking concise — aim for no more than 150 words inside
<think> tags. Then give a clear, direct final answer.

Without a custom prompt, the model will use the default identity and tend toward longer, more structured outputs. On open-ended questions with many valid answers, this can result in extended reasoning traces. Prompting with an explicit length or format constraint reliably corrects this.

Training Data

Stage 2 training used approximately 38,000 examples after token-length filtering, drawn from a pool of ~63,500 CoT-annotated records. Chinese-language reasoning traces from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion; non-English traces were downgraded to the no-think pool rather than discarded entirely. OpenR1-Math examples were filtered to correctness_llama == True only.

The think/no-think split was enforced programmatically: after all datasets were loaded into a think pool and a no-think pool, records were flipped from think→no-think until the no-think pool reached 25% of the total corpus.

Dataset	Count	Type
Modotte/CodeX-2M-Thinking	10,000	Code CoT
nvidia/OpenCodeReasoning	10,000	Code reasoning
Jackrong/Kimi-K2.5 (×3 configs)	15,000	General / Math / PhD reasoning
mitroitskii/OpenR1-Math-220k-formatted	7,000	Mathematics (correctness filter)
Jackrong/Claude-opus-4.6-TraceInversion-9000x	7,000	Inverted reasoning traces
trjxter/DeepSeek-V4-Pro-Reasoning-8000x	8,014	Reasoning distillation
WithinUsAI/MiniMax_M2.7_Distilled_5k	5,000	Mixed reasoning
FreedomIntelligence/medical-o1-reasoning-SFT	3,000	Medical reasoning

Loss curve:

Step	Train Loss	Val Loss
250	1.0215	0.9931
500	0.9615	0.9663
750	0.9516	0.9556
1000	0.9425	0.9502
1194 (final)	0.9897	0.9494

Training loss descent is steady across both epochs. The slight uptick at the final step is normal end-of-epoch behaviour on a cosine schedule.

Evaluation

A/B Comparison — Atem-Pharaoh-3B vs Qwen2.5-3B-Instruct

Evaluated on 30 questions calibrated to 3B model capability across coding, mathematics, analytical reasoning, and language tasks. Both models ran on identical prompts with no system prompt override.

Metric	Base (Qwen2.5-3B)	Atem-Pharaoh-3B
Think traces	0 / 30	30 / 30
Avg response length	152 words	427 words

Qualitative findings:

Coding tasks (is_even, count_vowels, list vs tuple, find_max, for vs while): Atem-Pharaoh-3B consistently correct with additional edge case handling and alternative approaches in the trace. Base model answers are correct but minimal.

Mathematical tasks: Both models correct. Pharaoh's traces show full working.

Analytical tasks (student score, shop visitors, correlation/causation, hiring/queuing): Pharaoh produces richer, more structured responses with clearer explanations. The queuing theory response (Q16) demonstrates genuine reasoning depth with well-constructed analogies.

Language tasks: Both models perform comparably. Pharaoh tends toward over-structuring simple tasks.

Known limitations observed in evaluation:

Think trace runaways: On open-ended questions where valid answers are unbounded, the think trace can degenerate into extended enumeration rather than converging on an answer. This was observed on Q27 (sentence ambiguity) in this evaluation and is consistent with behaviour observed in separate testing. The final answer typically recovers correctly, but the trace itself becomes incoherent. Custom system prompts with explicit trace length constraints are the recommended mitigation (see Prompting Guidance).

Verbosity mismatch: Response length does not scale to question complexity. Simple questions receive the same structural treatment as complex ones. A system prompt instructing the model to match length to complexity resolves this reliably.

Occasional tag artifacts: A small number of responses produced nested <think><think> opening tags. This is a minor formatting artifact with no effect on answer quality.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-Pharaoh-3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

messages = [
 {
 "role": "user",
 "content": "Explain why a binary search is faster than a linear search."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 repetition_penalty=1.1,
 do_sample=True,
 )

response = tokenizer.decode(
 output[0][inputs.shape[1]:],
 skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="EphAsad/Atem-Pharaoh-3B",
 max_seq_length=4096,
 dtype=torch.bfloat16,
 load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
 {
 "role": "user",
 "content": "Write a Python function to check if a number is prime."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to("cuda")

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 do_sample=True,
 )

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M

Available Files

File	Size	Description
`model-00001-of-00002.safetensors` + `model-00002-of-00002.safetensors`	~6.2 GB	Full bfloat16 weights
`Atem-Pharaoh-3B.Q4_K_M.gguf`	~1.93 GB	4-bit — recommended
`Atem-Pharaoh-3B.Q5_K_M.gguf`	~2.22 GB	5-bit
`Atem-Pharaoh-3B.Q8_0.gguf`	~3.29 GB	8-bit — near-lossless

System Prompt

Atem-Pharaoh-3B's identity is baked into the chat template. For production use, override with a custom system prompt tailored to your use case (see Prompting Guidance above). The default identity:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Roadmap

Stage	Status	Description
Stage 1 — SFT	✅ Complete	Atem-3B — direct reasoning
Stage 2 — CoT SFT	✅ Complete	Atem-Pharaoh-3B — this model
Stage 3 — DPO/IPO	🔄 Planned	Preference-aligned reasoning

Citation

@misc{atem_pharaoh_3b_2026,
 author = {Asad, Zain},
 title = {Atem-Pharaoh-3B: Chain-of-Thought Reasoning via Stage 2 CoT SFT},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/EphAsad/Atem-Pharaoh-3B}},
}

License

Released under the Apache 2.0 License, consistent with the base model lineage (Qwen2.5-3B-Instruct → Atem-3B → Atem-Pharaoh-3B).

Built independently by EphAsad

Downloads last month: 221

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for EphAsad/Atem-Pharaoh-3B

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

EphAsad/Atem-3B