Atem-3B

Ancient logic. Modern intelligence.

The 3B foundation model of the Atem series — direct reasoning at scale.

👁 Base Model
👁 Stage
👁 Parameters
👁 License

Overview

Atem-3B is the first release in the 3B branch of the Atem model series — a Stage 1 supervised fine-tune on Qwen2.5-3B-Instruct across approximately 120,000 training examples spanning mathematics, code, reasoning, and general instruction following.

Where the 1.5B Atem line demonstrated that a small model could be meaningfully improved through careful data curation, Atem-3B applies the same methodology at twice the parameter count. The 3B base provides a stronger foundation — particularly for mathematical reasoning and structured generation — while the training corpus prioritises quality and diversity over volume.

Design philosophy: Think tags were stripped from all training data during preprocessing. Atem-3B is a direct-answer model — it does not produce <think> traces. The reasoning capacity of the 3B base is channelled into producing well-structured, considered responses rather than visible chain-of-thought. A CoT variant is planned for Stage 2.

The Atem Series

1.5B Series

Model	Stage	Capability
Atem v1	Stage 1 — SFT	Fast, direct reasoning
Atem-Wisdom	Stage 2 — CoT	Explicit thinking traces
Atem-Pharaoh (planned)	Stage 3 — DPO/IPO	Preference-aligned reasoning

3B Series

Model	Stage	Capability
Atem-3B	Stage 1 — SFT	Direct reasoning at 3B scale
Atem-3B-Pharaoh	Stage 2 — CoT	Explicit thinking traces

Model Details

Property	Value
Base model	Qwen/Qwen2.5-3B-Instruct
Training method	LoRA SFT — Stage 1 (think tags stripped)
LoRA config	r=32, alpha=64, dropout=0.05
Parameters	~3.09B
Trainable parameters	59,867,136 (1.90%)
Training records	120,043 (after token length filtering)
Epochs	1
Final val loss	0.8384
Hardware	NVIDIA A100-SXM4-80GB
Max sequence length	4,096 tokens
Precision	bfloat16
License	Apache 2.0

Output Format

Atem-3B produces direct, structured responses. Think tags were stripped from all training data during preprocessing — the model was trained exclusively on clean outputs with no chain-of-thought traces.

[Direct response — reasoned, structured, no <think> tags]

This is a deliberate Stage 1 design choice. A chain-of-thought variant exposing explicit reasoning traces is planned as Stage 2.

Training Data

Stage 1 training used approximately 120,000 examples drawn from eleven sources. All reasoning traces (<think>...</think> blocks) were stripped prior to training. Records shorter than 20 characters after stripping were excluded.

Dataset	Count	Focus
Modotte/CodeX-2M-Thinking	40,000	Code (think tags stripped)
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned	23,000	General reasoning (English filtered)
open-r1/OpenThoughts-114k-math	10,000	Mathematics (correct only)
flytech/python-codes-25k	10,000	Python code
FreedomIntelligence/medical-o1-reasoning-SFT	10,000	Medical reasoning
tuanha1305/DeepSeek-R1-Distill	9,000	Reasoning distillation
EphAsad/QWENMillenium-SF	5,000	General instruction
EphAsad/MistralMillenium-SF	5,000	General instruction
WithinUsAI/MiniMax_M2.7_Distilled_5k	5,000	Mixed reasoning
Jackrong/Claude-opus-4.7-TraceInversion-5000x	4,761	Inverted reasoning
EphAsad/Phi4Millennium-SF	2,932	General instruction

Chinese-language records from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion. OpenThoughts-114k-math was filtered to correct == True examples only.

Loss curve:

Step	Train Loss	Val Loss
200	0.9236	0.9011
400	0.9200	0.8796
600	0.8591	0.8685
800	0.8837	0.8585
1000	0.8455	0.8507
1200	0.8359	0.8453
1400	0.8240	0.8413
1600	0.8626	0.8391
1800	0.8940	0.8384
1876 (final)	0.8702	0.8384

Validation loss descends steadily throughout the full run with no overfitting signal.

Evaluation

Benchmark Results

Evaluated using lm-evaluation-harness via the Python API under identical conditions for both models. ARC-Challenge and HellaSwag use zero-shot normalised accuracy; GSM8K uses 5-shot. Both models evaluated at 4-bit quantisation on the same A100-SXM4-80GB in torch.float16.

Task	Base (3B)	Atem-3B	Delta
ARC-Challenge	48.1%	48.0%	-0.1% —
GSM8K (strict-match)	2.1%	37.1%	+35.0%
GSM8K (flexible-extract)	62.4%	64.7%	+2.3% ✓
HellaSwag	73.5%	70.4%	-3.0% ⚠

Note on GSM8K: lm_eval's strict-match filter uses a #### number regex that only fires when the model produces that exact token sequence. The base Qwen2.5-3B-Instruct solves problems correctly but formats answers conversationally, yielding 2.1% strict-match against a 62.4% flexible-extract — the latter being the accurate measure of base model mathematical capability. Atem-3B's training on math distillation datasets reinforced structured answer termination, producing 37.1% strict-match. The meaningful comparison is flexible-extract: 62.4% → 64.7% (+2.3%) — a genuine but modest improvement. The strict-match delta is a formatting artefact, not a 35-point gain in mathematical reasoning ability.

Note on HellaSwag: The -3.0% regression is a common pattern when fine-tuning instruct models on structured reasoning and task-completion data. HellaSwag tests commonsense sentence completion in a multiple-choice format; training on problem-solving corpora shifts the model's distribution away from the casual, predictive register that HellaSwag measures. This is a known trade-off, not an indicator of general capability loss.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-3B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

messages = [
 {
 "role": "user",
 "content": "Explain the difference between a process and a thread."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 repetition_penalty=1.1,
 do_sample=True,
 )

response = tokenizer.decode(
 output[0][inputs.shape[1]:],
 skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="EphAsad/Atem-3B",
 max_seq_length=4096,
 dtype=torch.bfloat16,
 load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
 {
 "role": "user",
 "content": "Write a Python function to find all prime numbers up to n."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to("cuda")

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=1024,
 temperature=0.7,
 top_p=0.9,
 do_sample=True,
 )

print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-3B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-3B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-3B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-3B:Q4_K_M

Available Files

File	Size	Description
`model-00001-of-00002.safetensors` + `model-00002-of-00002.safetensors`	~6.2 GB	Full bfloat16 weights
`Atem-3b.Q4_K_M.gguf`	~1.93 GB	4-bit — recommended
`Atem-3b.Q5_K_M.gguf`	~2.22 GB	5-bit
`Atem-3b.Q8_0.gguf`	~3.29 GB	8-bit — near-lossless

System Prompt

Atem-3B's identity is baked into the chat template and activates without an explicit system message. To override manually:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Roadmap

Stage	Status	Description
Stage 1 — SFT	✅ Complete	Atem-3B — this model
Stage 2 — CoT SFT	🔄 Planned	Atem-3B-Wisdom — chain-of-thought traces
Stage 3 — DPO/IPO	🔄 Planned	Atem-3B-Pharaoh — preference-aligned reasoning

Citation

@misc{atem_3b_2026,
 author = {Asad, Zain},
 title = {Atem-3B: A 3B Direct-Reasoning Model via Stage 1 SFT},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/EphAsad/Atem-3B}},
}

License

Released under the Apache 2.0 License, consistent with the base model (Qwen2.5-3B-Instruct).

Built independently by EphAsad

Downloads last month: 656

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for EphAsad/Atem-3B

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1310)

this model

Adapters

1 model

Datasets used to train EphAsad/Atem-3B

Evaluation results

Accuracy (normalised) on ARC-Challenge
test set self-reported
0.480
Exact Match (flexible-extract, 5-shot) on GSM8K
test set self-reported
0.647
Accuracy (normalised) on HellaSwag
validation set self-reported
0.704

URL: https://huggingface.co/EphAsad/Atem-3B

⇱ EphAsad/Atem-3B · Hugging Face