VOOZH about

URL: https://huggingface.co/EphAsad/Atem-Pharaoh-3B

⇱ EphAsad/Atem-Pharaoh-3B Β· Hugging Face


πŸ‘ Atem Logo

Atem-Pharaoh-3B

Ancient logic. Modern intelligence.

The 3B chain-of-thought model β€” explicit reasoning traces at scale.

πŸ‘ Base Model
πŸ‘ Stage
πŸ‘ Parameters
πŸ‘ License


Overview

Atem-Pharaoh-3B is the Stage 2 release of the 3B Atem series β€” a chain-of-thought fine-tune built on top of Atem-3B, trained to produce explicit <think>...</think> reasoning traces before arriving at a final answer. Where Atem-3B was trained to answer directly, Pharaoh is trained to think out loud.

Training used approximately 38,000 examples drawn from a pool of ~63,500 CoT-annotated records across mathematics, code, science, and general reasoning. A deliberate 75%/25% think/no-think split was applied β€” the model was trained on structured reasoning traces for the majority of examples and direct answers for the remainder, ensuring it can operate in both modes depending on how it is prompted.

Design note: Atem-Pharaoh-3B has a confirmed tendency toward verbose outputs and, on open-ended questions with many valid answers, occasional think trace runaways. Custom system prompts are strongly recommended to control verbosity, chain-of-thought depth, and output length. See the Prompting Guidance section below.


The Atem Series

1.5B Series

Model Stage Capability
Atem v1 Stage 1 β€” SFT Fast, direct reasoning
Atem-Wisdom Stage 2 β€” CoT Explicit thinking traces
Atem-Pharaoh-1.5B (planned) Stage 3 β€” DPO/IPO Preference-aligned reasoning

3B Series

Model Stage Capability
Atem-3B Stage 1 β€” SFT Direct reasoning at 3B scale
Atem-Pharaoh-3B Stage 2 β€” CoT Explicit reasoning traces at 3B scale
Atem-Pharaoh-3B-DPO (planned) Stage 3 β€” DPO/IPO Preference-aligned reasoning

Model Details

Property Value
Base model EphAsad/Atem-3B
Training method LoRA SFT β€” Stage 2 (CoT think traces)
LoRA config r=32, alpha=64, dropout=0.05
Parameters ~3.09B
Trainable parameters 59,867,136 (1.90%)
Training records 38,157 (after token length filtering)
Think / no-think split 75% / 25%
Epochs 2
Final val loss 0.9494
Hardware NVIDIA A100-SXM4-80GB
Max sequence length 4,096 tokens
Precision bfloat16
License Apache 2.0

Output Format

Atem-Pharaoh-3B produces responses in one of two formats depending on the prompt and training signal:

Think mode (75% of training):

<think>
{step-by-step reasoning trace}
</think>

{final answer}

Direct mode (25% of training):

{direct answer β€” no think tags}

The model defaults to think mode for most queries. To reliably suppress or encourage CoT, use a custom system prompt (see below).


Prompting Guidance

Atem-Pharaoh-3B responds to system prompt instruction. The default identity is baked into the chat template and produces think traces on most inputs. For deployment use cases where verbosity, output length, or CoT depth need controlling, the following prompt patterns are recommended.

Suppress CoT β€” direct answers only

You are Atem, a precise and analytical assistant. Respond directly and concisely.
Do not show internal reasoning. Answer the question and stop.

Calibrate length to question complexity

You are Atem, a precise and analytical assistant. Match your response length to
the complexity of the question β€” a single sentence for simple questions, full
reasoning for complex ones. Do not over-explain.

Full CoT β€” maximise reasoning depth

You are Atem, a precise and analytical assistant. Think through every problem
step by step before answering. Show your full reasoning inside <think> tags,
then give your final answer.

Cap think trace length

You are Atem, a precise and analytical assistant. When you reason through a
problem, keep your thinking concise β€” aim for no more than 150 words inside
<think> tags. Then give a clear, direct final answer.

Without a custom prompt, the model will use the default identity and tend toward longer, more structured outputs. On open-ended questions with many valid answers, this can result in extended reasoning traces. Prompting with an explicit length or format constraint reliably corrects this.


Training Data

Stage 2 training used approximately 38,000 examples after token-length filtering, drawn from a pool of ~63,500 CoT-annotated records. Chinese-language reasoning traces from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion; non-English traces were downgraded to the no-think pool rather than discarded entirely. OpenR1-Math examples were filtered to correctness_llama == True only.

The think/no-think split was enforced programmatically: after all datasets were loaded into a think pool and a no-think pool, records were flipped from think→no-think until the no-think pool reached 25% of the total corpus.

Dataset Count Type
Modotte/CodeX-2M-Thinking 10,000 Code CoT
nvidia/OpenCodeReasoning 10,000 Code reasoning
Jackrong/Kimi-K2.5 (Γ—3 configs) 15,000 General / Math / PhD reasoning
mitroitskii/OpenR1-Math-220k-formatted 7,000 Mathematics (correctness filter)
Jackrong/Claude-opus-4.6-TraceInversion-9000x 7,000 Inverted reasoning traces
trjxter/DeepSeek-V4-Pro-Reasoning-8000x 8,014 Reasoning distillation
WithinUsAI/MiniMax_M2.7_Distilled_5k 5,000 Mixed reasoning
FreedomIntelligence/medical-o1-reasoning-SFT 3,000 Medical reasoning

Loss curve:

Step Train Loss Val Loss
250 1.0215 0.9931
500 0.9615 0.9663
750 0.9516 0.9556
1000 0.9425 0.9502
1194 (final) 0.9897 0.9494

Training loss descent is steady across both epochs. The slight uptick at the final step is normal end-of-epoch behaviour on a cosine schedule.


Evaluation

A/B Comparison β€” Atem-Pharaoh-3B vs Qwen2.5-3B-Instruct

Evaluated on 30 questions calibrated to 3B model capability across coding, mathematics, analytical reasoning, and language tasks. Both models ran on identical prompts with no system prompt override.

Metric Base (Qwen2.5-3B) Atem-Pharaoh-3B
Think traces 0 / 30 30 / 30
Avg response length 152 words 427 words

Qualitative findings:

Coding tasks (is_even, count_vowels, list vs tuple, find_max, for vs while): Atem-Pharaoh-3B consistently correct with additional edge case handling and alternative approaches in the trace. Base model answers are correct but minimal.

Mathematical tasks: Both models correct. Pharaoh's traces show full working.

Analytical tasks (student score, shop visitors, correlation/causation, hiring/queuing): Pharaoh produces richer, more structured responses with clearer explanations. The queuing theory response (Q16) demonstrates genuine reasoning depth with well-constructed analogies.

Language tasks: Both models perform comparably. Pharaoh tends toward over-structuring simple tasks.

Known limitations observed in evaluation:

Think trace runaways: On open-ended questions where valid answers are unbounded, the think trace can degenerate into extended enumeration rather than converging on an answer. This was observed on Q27 (sentence ambiguity) in this evaluation and is consistent with behaviour observed in separate testing. The final answer typically recovers correctly, but the trace itself becomes incoherent. Custom system prompts with explicit trace length constraints are the recommended mitigation (see Prompting Guidance).

Verbosity mismatch: Response length does not scale to question complexity. Simple questions receive the same structural treatment as complex ones. A system prompt instructing the model to match length to complexity resolves this reliably.

Occasional tag artifacts: A small number of responses produced nested <think><think> opening tags. This is a minor formatting artifact with no effect on answer quality.


Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-Pharaoh-3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

messages = [
 {
 "role": "user",
 "content": "Explain why a binary search is faster than a linear search."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 repetition_penalty=1.1,
 do_sample=True,
 )

response = tokenizer.decode(
 output[0][inputs.shape[1]:],
 skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="EphAsad/Atem-Pharaoh-3B",
 max_seq_length=4096,
 dtype=torch.bfloat16,
 load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
 {
 "role": "user",
 "content": "Write a Python function to check if a number is prime."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to("cuda")

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 do_sample=True,
 )

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

Ollama

# Recommended β€” best speed/quality balance
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-Pharaoh-3B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-Pharaoh-3B:Q4_K_M

Available Files

File Size Description
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors ~6.2 GB Full bfloat16 weights
Atem-Pharaoh-3B.Q4_K_M.gguf ~1.93 GB 4-bit β€” recommended
Atem-Pharaoh-3B.Q5_K_M.gguf ~2.22 GB 5-bit
Atem-Pharaoh-3B.Q8_0.gguf ~3.29 GB 8-bit β€” near-lossless

System Prompt

Atem-Pharaoh-3B's identity is baked into the chat template. For production use, override with a custom system prompt tailored to your use case (see Prompting Guidance above). The default identity:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically β€” identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Roadmap

Stage Status Description
Stage 1 β€” SFT βœ… Complete Atem-3B β€” direct reasoning
Stage 2 β€” CoT SFT βœ… Complete Atem-Pharaoh-3B β€” this model
Stage 3 β€” DPO/IPO πŸ”„ Planned Preference-aligned reasoning

Citation

@misc{atem_pharaoh_3b_2026,
 author = {Asad, Zain},
 title = {Atem-Pharaoh-3B: Chain-of-Thought Reasoning via Stage 2 CoT SFT},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/EphAsad/Atem-Pharaoh-3B}},
}

License

Released under the Apache 2.0 License, consistent with the base model lineage (Qwen2.5-3B-Instruct β†’ Atem-3B β†’ Atem-Pharaoh-3B).


Built independently by EphAsad

Downloads last month
221
Safetensors
Model size
3B params
Tensor type
BF16
Β·

Model tree for EphAsad/Atem-Pharaoh-3B

Base model

Qwen/Qwen2.5-3B
Adapter
(1)
this model
Adapters
2 models

Datasets used to train EphAsad/Atem-Pharaoh-3B