Atem-3B
Ancient logic. Modern intelligence.
The 3B foundation model of the Atem series โ direct reasoning at scale.
๐ Base Model
๐ Stage
๐ Parameters
๐ License
Overview
Atem-3B is the first release in the 3B branch of the Atem model series โ a Stage 1 supervised fine-tune on Qwen2.5-3B-Instruct across approximately 120,000 training examples spanning mathematics, code, reasoning, and general instruction following.
Where the 1.5B Atem line demonstrated that a small model could be meaningfully improved through careful data curation, Atem-3B applies the same methodology at twice the parameter count. The 3B base provides a stronger foundation โ particularly for mathematical reasoning and structured generation โ while the training corpus prioritises quality and diversity over volume.
Design philosophy: Think tags were stripped from all training data during preprocessing. Atem-3B is a direct-answer model โ it does not produce <think> traces. The reasoning capacity of the 3B base is channelled into producing well-structured, considered responses rather than visible chain-of-thought. A CoT variant is planned for Stage 2.
The Atem Series
1.5B Series
| Model | Stage | Capability |
|---|---|---|
| Atem v1 | Stage 1 โ SFT | Fast, direct reasoning |
| Atem-Wisdom | Stage 2 โ CoT | Explicit thinking traces |
| Atem-Pharaoh (planned) | Stage 3 โ DPO/IPO | Preference-aligned reasoning |
3B Series
| Model | Stage | Capability |
|---|---|---|
| Atem-3B | Stage 1 โ SFT | Direct reasoning at 3B scale |
| Atem-3B-Pharaoh | Stage 2 โ CoT | Explicit thinking traces |
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Training method | LoRA SFT โ Stage 1 (think tags stripped) |
| LoRA config | r=32, alpha=64, dropout=0.05 |
| Parameters | ~3.09B |
| Trainable parameters | 59,867,136 (1.90%) |
| Training records | 120,043 (after token length filtering) |
| Epochs | 1 |
| Final val loss | 0.8384 |
| Hardware | NVIDIA A100-SXM4-80GB |
| Max sequence length | 4,096 tokens |
| Precision | bfloat16 |
| License | Apache 2.0 |
Output Format
Atem-3B produces direct, structured responses. Think tags were stripped from all training data during preprocessing โ the model was trained exclusively on clean outputs with no chain-of-thought traces.
[Direct response โ reasoned, structured, no <think> tags]
This is a deliberate Stage 1 design choice. A chain-of-thought variant exposing explicit reasoning traces is planned as Stage 2.
Training Data
Stage 1 training used approximately 120,000 examples drawn from eleven sources. All reasoning traces (<think>...</think> blocks) were stripped prior to training. Records shorter than 20 characters after stripping were excluded.
| Dataset | Count | Focus |
|---|---|---|
| Modotte/CodeX-2M-Thinking | 40,000 | Code (think tags stripped) |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned | 23,000 | General reasoning (English filtered) |
| open-r1/OpenThoughts-114k-math | 10,000 | Mathematics (correct only) |
| flytech/python-codes-25k | 10,000 | Python code |
| FreedomIntelligence/medical-o1-reasoning-SFT | 10,000 | Medical reasoning |
| tuanha1305/DeepSeek-R1-Distill | 9,000 | Reasoning distillation |
| EphAsad/QWENMillenium-SF | 5,000 | General instruction |
| EphAsad/MistralMillenium-SF | 5,000 | General instruction |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | Mixed reasoning |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | 4,761 | Inverted reasoning |
| EphAsad/Phi4Millennium-SF | 2,932 | General instruction |
Chinese-language records from Kimi K2.5 were filtered using an ASCII character ratio threshold before inclusion. OpenThoughts-114k-math was filtered to correct == True examples only.
Loss curve:
| Step | Train Loss | Val Loss |
|---|---|---|
| 200 | 0.9236 | 0.9011 |
| 400 | 0.9200 | 0.8796 |
| 600 | 0.8591 | 0.8685 |
| 800 | 0.8837 | 0.8585 |
| 1000 | 0.8455 | 0.8507 |
| 1200 | 0.8359 | 0.8453 |
| 1400 | 0.8240 | 0.8413 |
| 1600 | 0.8626 | 0.8391 |
| 1800 | 0.8940 | 0.8384 |
| 1876 (final) | 0.8702 | 0.8384 |
Validation loss descends steadily throughout the full run with no overfitting signal.
Evaluation
Benchmark Results
Evaluated using lm-evaluation-harness via the Python API under identical conditions for both models. ARC-Challenge and HellaSwag use zero-shot normalised accuracy; GSM8K uses 5-shot. Both models evaluated at 4-bit quantisation on the same A100-SXM4-80GB in torch.float16.
| Task | Base (3B) | Atem-3B | Delta |
|---|---|---|---|
| ARC-Challenge | 48.1% | 48.0% | -0.1% โ |
| GSM8K (strict-match) | 2.1% | 37.1% | +35.0% |
| GSM8K (flexible-extract) | 62.4% | 64.7% | +2.3% โ |
| HellaSwag | 73.5% | 70.4% | -3.0% โ |
Note on GSM8K: lm_eval's strict-match filter uses a #### number regex that only fires when the model produces that exact token sequence. The base Qwen2.5-3B-Instruct solves problems correctly but formats answers conversationally, yielding 2.1% strict-match against a 62.4% flexible-extract โ the latter being the accurate measure of base model mathematical capability. Atem-3B's training on math distillation datasets reinforced structured answer termination, producing 37.1% strict-match. The meaningful comparison is flexible-extract: 62.4% โ 64.7% (+2.3%) โ a genuine but modest improvement. The strict-match delta is a formatting artefact, not a 35-point gain in mathematical reasoning ability.
Note on HellaSwag: The -3.0% regression is a common pattern when fine-tuning instruct models on structured reasoning and task-completion data. HellaSwag tests commonsense sentence completion in a multiple-choice format; training on problem-solving corpora shifts the model's distribution away from the casual, predictive register that HellaSwag measures. This is a known trade-off, not an indicator of general capability loss.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain the difference between a process and a thread."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-3B",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "Write a Python function to find all prime numbers up to n."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(output[0][inputs.shape[1]:], skip_special_tokens=True))
Ollama
# Recommended โ best speed/quality balance
ollama run hf.co/EphAsad/Atem-3B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-3B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-3B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-3B:Q4_K_M
Available Files
| File | Size | Description |
|---|---|---|
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors |
~6.2 GB | Full bfloat16 weights |
Atem-3b.Q4_K_M.gguf |
~1.93 GB | 4-bit โ recommended |
Atem-3b.Q5_K_M.gguf |
~2.22 GB | 5-bit |
Atem-3b.Q8_0.gguf |
~3.29 GB | 8-bit โ near-lossless |
System Prompt
Atem-3B's identity is baked into the chat template and activates without an explicit system message. To override manually:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically โ identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Roadmap
| Stage | Status | Description |
|---|---|---|
| Stage 1 โ SFT | โ Complete | Atem-3B โ this model |
| Stage 2 โ CoT SFT | ๐ Planned | Atem-3B-Wisdom โ chain-of-thought traces |
| Stage 3 โ DPO/IPO | ๐ Planned | Atem-3B-Pharaoh โ preference-aligned reasoning |
Citation
@misc{atem_3b_2026,
author = {Asad, Zain},
title = {Atem-3B: A 3B Direct-Reasoning Model via Stage 1 SFT},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-3B}},
}
License
Released under the Apache 2.0 License, consistent with the base model (Qwen2.5-3B-Instruct).
Built independently by EphAsad
- Downloads last month
- 656
Model tree for EphAsad/Atem-3B
Datasets used to train EphAsad/Atem-3B
Evaluation results
- Accuracy (normalised) on ARC-Challengetest set self-reported0.480
- Exact Match (flexible-extract, 5-shot) on GSM8Ktest set self-reported0.647
- Accuracy (normalised) on HellaSwagvalidation set self-reported0.704
