Atem-8B
Ancient logic. Modern intelligence.
An 8B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-8B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.
๐ Base Model
๐ Method
๐ Parameters
๐ License
Overview
Atem-8B is an 8B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-8B. Like Atem-4B, it uses a CoT-preserving single-pass design โ building reasoning capability on top of the base model's intact native foundation rather than erasing and rebuilding thinking in separate stages. Atem-8B is trained on a larger corpus (~91K records before filtering vs ~63K for 4B) with higher per-source caps, producing a model with broader reasoning coverage across mathematics, coding, science, and general domains.
This is the most thoroughly evaluated model in the Atem series, benchmarked across nine tasks including a custom flexible GSM8K evaluator that diagnoses the formatting shift introduced by CoT training.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| Training method | Single-pass CoT-Preserving LoRA SFT |
| LoRA config | r=64, alpha=128, dropout=0.05 |
| Target modules | q, k, v, o, gate, up, down projections |
| Parameters | ~8.37B |
| Trainable (LoRA) params | 174,587,904 (2.09% of base) |
| Training records | 58,980 (after token-length filtering) |
| Think / No-think split | 85% / 15% |
| Epochs | 2 (ceiling; early stopping patience=3, never triggered) |
| Effective batch size | 64 (batch 4 ร grad accum 16) |
| Learning rate | 1e-4, cosine schedule, 5% warmup |
| Max sequence length | 6,144 tokens |
| Precision | bfloat16 (full 16-bit LoRA, not QLoRA) |
| Hardware | NVIDIA A100-SXM4 80GB |
| Runtime | 7h40m |
| License | Apache 2.0 |
Design Notes
Single combined pass. The earlier Atem-0.6B pipeline erased Qwen3's native thinking mode in Stage 1 then re-imposed an externally-distilled style in Stage 2. This introduced measurable capability costs โ the base model's exposed reasoning self-corrected on problems the no-think version got wrong, and ARC-Challenge regressed after Stage 2. Atem-8B skips the erasure entirely: one pass, intact native reasoning, external CoT styles layered on a foundation that still works.
Full 16-bit LoRA. At 8B with an 80GB A100, full 16-bit LoRA requires ~33GB โ comfortably within budget. It is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays compute overhead on quantize/dequantize operations at each step.
r=64, alpha=128. r=64 on Qwen3-8B represents 2.09% of the model โ somewhat lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. The proportional capacity does decrease modestly as model size grows; r=96 would more closely match the 4B reference point. Not a blocker for this run, and noted for future iterations.
Corpus scale. Atem-8B draws from the same eight source datasets as Atem-4B but with higher per-source caps โ 91,017 total records before ratio adjustment vs ~63,563 for 4B, yielding 58,980 useable training examples after token-length filtering at 6,144.
Intended Use
Atem-8B is designed for general reasoning tasks where structured, step-by-step thinking adds value:
- Multi-step mathematical reasoning
- Code explanation, implementation, and debugging
- Analytical reasoning and argument evaluation
- Scientific explanation requiring technical depth
- Commonsense reasoning and physical intuition
- Logic, fallacy identification, and conditional reasoning
- Concept explanation across diverse domains
Training Data
Atem-8B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers.
| Dataset | Records | Source / Teacher |
|---|---|---|
| mitroitskii/OpenR1-Math-220k-formatted | ~10,938 | DeepSeek-R1 โ Mathematics (correctness-filtered) |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | 7,000 | Claude Opus 4.6 โ Trace Inversion |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math) | 8,000 | Kimi K2.5 โ Mathematical Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation) | 8,000 | Kimi K2.5 โ General Reasoning |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science) | 8,000 | Kimi K2.5 โ Scientific Reasoning |
| WithinUsAI/MiniMax_M2.7_Distilled_5k | 5,000 | MiniMax M2.7 |
| FreedomIntelligence/medical-o1-reasoning-SFT | 7,500 | Medical reasoning (English config) |
| Modotte/CodeX-2M-Thinking | 15,000 | Mixed โ Coding with CoT |
| trjxter/DeepSeek-V4-Pro-Reasoning-8000x | ~8,014 | DeepSeek-V4-Pro |
| nvidia/OpenCodeReasoning | 15,000 | Mixed โ Competitive coding |
| Total (pre-filter pool) | 91,017 | |
| Total (post-filter, trained on) | 58,980 |
Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded. The 34.3% filter rate reflects the same 6,144-token ceiling that filtered 32.7% of the Atem-4B corpus โ the longest, most complex reasoning traces from competitive programming and advanced mathematics exceed this limit.
Training Configuration
# Key hyperparameters
lora_r = 64
lora_alpha = 128
lora_dropout = 0.05
max_seq_length = 6144
learning_rate = 1e-4
lr_scheduler = 'cosine'
warmup_ratio = 0.05
batch_size = 4
grad_accumulation = 16 # effective batch size: 64
num_epochs = 2 # ceiling โ early stopping patience=3
eval_steps = 150
early_stopping_patience = 3
early_stopping_threshold = 0.001
nothink_ratio = 0.15
load_in_4bit = False # full 16-bit LoRA
dtype = bfloat16
Training used Unsloth with train_on_responses_only masking. Early stopping was configured with patience=3 and threshold=0.001 โ it did not trigger, as validation loss improved at every checkpoint throughout the full 2-epoch run.
Loss Curve
| Step | Train Loss | Val Loss |
|---|---|---|
| 150 | 0.8661 | 0.8367 |
| 300 | 0.7971 | 0.8120 |
| 450 | 0.8006 | 0.7978 |
| 600 | 0.7992 | 0.7880 |
| 750 | 0.7791 | 0.7822 |
| 900 | 0.7879 | 0.7770 |
| 1050 | 0.7328 | 0.7758 |
| 1200 | 0.7357 | 0.7734 |
| 1350 | 0.7223 | 0.7711 |
| 1500 | 0.7461 | 0.7697 |
| 1650 | 0.7501 | 0.7691 |
| 1800 | 0.7691 | 0.7688 |
| Final (1844) | 0.7847 (avg) | 0.7688 |
Validation loss tracked above training loss for most of the run, indicating no overfitting. At step 150, val loss was briefly below train loss โ a known early-training artifact when dropout is active during training but not during evaluation. This normalised by step 300 and did not recur. Val loss improved continuously across all 13 checkpoints, confirming the early stopping mechanism was never needed.
Evaluation
Benchmark Results
Evaluated against base Qwen3-8B (Qwen/Qwen3-8B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation. GSM8K flexible extraction uses a custom evaluator that accepts #### answer, \boxed{answer}, and prose formats โ see note below.
| Task | Base (Qwen3-8B) | Atem-8B | Delta |
|---|---|---|---|
| ARC-Challenge (0-shot, acc_norm) | 56.5% | 56.9% | +0.4pp โ |
| GSM8K strict (5-shot, exact_match) | 86.7% | 83.3% | โ3.4pp โ |
| GSM8K flexible (5-shot, custom) | 86.7% | 85.6% | โ1.1pp โ |
| HellaSwag (0-shot, acc_norm) | 74.5% | 76.2% | +1.7pp โ |
| MMLU (0-shot, acc) | 72.9% | 72.9% | +0.0pp โ |
| Winogrande (0-shot, acc) | 67.2% | 71.8% | +4.6pp โ |
| PIQA (0-shot, acc) | 76.2% | 78.1% | +1.9pp โ |
| OpenBookQA (0-shot, acc_norm) | 41.4% | 43.2% | +1.8pp โ |
| BoolQ (0-shot, acc) | 85.9% | 84.3% | โ1.6pp โ |
Winogrande (+4.6pp, 2.5ฯ) is the headline result โ the largest gain in the evaluation set. Commonsense pronoun resolution is format-independent and tests exactly the kind of contextual reasoning that CoT training is designed to improve.
HellaSwag (+1.7pp, 2.8ฯ) uses normalised log-likelihood scoring over multiple-choice options โ format-independent and not influenced by generation style. A genuine reasoning signal.
PIQA, OpenBookQA both positive. All four commonsense and reasoning tasks improved. The direction is consistent and matches the expected effect of training on structured reasoning traces.
MMLU exactly tied at 72.9%. The CoT training neither added nor removed knowledge breadth โ the correct expected behaviour for SFT on reasoning data.
GSM8K โ Formatting Shift Analysis
The strict-match GSM8K regression (โ3.4pp) was investigated using a custom flexible extractor that accepts multiple answer formats: #### {number} (lm_eval standard), \boxed{number} (LaTeX, common in mathematics literature), prose declarations, and last-number fallback.
| Extraction method | Atem-8B | Base |
|---|---|---|
Strict-match #### only |
83.3% | 86.7% |
| Flexible extraction | 85.6% | ~86.7% |
| Recovered by flexible | +2.3pp | โ |
68% of the observed regression was a formatting artifact. The training corpus โ OpenR1-Math, DeepSeek-V4-Pro, Kimi-K2.5 โ uses \boxed{answer} (LaTeX notation, standard in academic and competition mathematics) rather than the #### answer format specific to the GSM8K dataset. The SFT pass has shifted Atem's preferred answer format from #### toward \boxed{}. lm_eval's strict-match regex only searches for ####, so correct answers in \boxed{} format count as wrong.
The true capability gap after accounting for formatting is approximately โ1.1pp, not โ3.4pp. The base model retains a small genuine advantage on this benchmark because it was instruction-tuned on GSM8K-format data and naturally reproduces the #### convention.
BoolQ (โ1.6pp, 1.8ฯ) is borderline โ sitting between noise and statistical significance. BoolQ requires committing to a binary yes/no answer; it's possible the more exploratory CoT training style slightly disadvantaged decisive binary classification. Worth monitoring on future runs.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "EphAsad/Atem-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain why switching doors in the Monty Hall problem gives a 2/3 probability of winning."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
repetition_penalty=1.1,
)
response = tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
)
print(response)
Unsloth (faster inference)
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="EphAsad/Atem-8B",
max_seq_length=6144,
dtype=torch.bfloat16,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{
"role": "user",
"content": "A train travels from A to B at 60 km/h and returns at 90 km/h. What is the average speed?"
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = model.generate(
input_ids=inputs,
max_new_tokens=2000,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
)
print(tokenizer.decode(
output[0][inputs.shape[1]:],
skip_special_tokens=True
))
Ollama
# Recommended โ best speed/quality balance
ollama run hf.co/EphAsad/Atem-8B:Q4_K_M
# Higher quality
ollama run hf.co/EphAsad/Atem-8B:Q5_K_M
# Near-lossless
ollama run hf.co/EphAsad/Atem-8B:Q8_0
llama.cpp
llama-server -hf EphAsad/Atem-8B:Q4_K_M
Sampling Parameters
Use temperature=0.6, top_p=0.95, top_k=20 for thinking mode โ Qwen3's published recommendation, used throughout this evaluation. Do not use greedy decoding with thinking mode enabled.
System Prompt
Atem-8B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:
You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically โ identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.
Available Files
| File | Size | Description |
|---|---|---|
model-XXXX-of-00004.safetensors (ร4) |
~16.4 GB total | Full bfloat16 merged weights |
Atem-8b.Q4_K_M.gguf |
5.03 GB | 4-bit quantised โ recommended |
Atem-8b.Q5_K_M.gguf |
5.85 GB | 5-bit quantised |
Atem-8b.Q8_0.gguf |
8.71 GB | 8-bit quantised โ near-lossless |
Known Limitations
GSM8K formatting shift. As documented in the evaluation section, the SFT corpus uses \boxed{} notation for mathematical answers rather than the #### format specific to the GSM8K benchmark. This creates a systematic measurement gap under strict-match evaluation (โ3.4pp), of which 68% is a formatting artifact. Under flexible extraction the true gap is approximately โ1.1pp. For production use, \boxed{answer} is standard in mathematical contexts.
6,144 token sequence ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces; raising max_new_tokens at inference time provides budget for longer outputs but does not recover training coverage of ultra-long traces.
LoRA proportional capacity. r=64 represents 2.09% of the 8B model โ lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. r=96 would more closely match the 4B proportional reference. Not a blocker, but noted for future runs.
No RLHF or DPO. Atem-8B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.
Roadmap
- Atem-14B: Single CoT-preserving pass on Qwen3-14B, r=128 (3.10% proportional capacity), with GSM8K-format examples added to the corpus to restore
####answer convention
Citation
@misc{atem_8b_2026,
author = {Asad, Zain},
title = {Atem-8B: An 8B CoT-Preserving Reasoning Model via
Single-Pass SFT on Qwen3},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/EphAsad/Atem-8B}},
}
License
Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-8B.
Built independently by Zain Asad โ EphAsad
- Downloads last month
- 428
Model tree for EphAsad/Atem-8B
Datasets used to train EphAsad/Atem-8B
Evaluation results
- acc_norm on ARC (Challenge)test set self-reported56.900
- exact_match (strict-match) on GSM8Ktest set self-reported83.300
- acc_norm on HellaSwagvalidation set self-reported76.200
- acc on MMLUtest set self-reported72.900
- acc on Winograndevalidation set self-reported71.800
- acc on PIQAvalidation set self-reported78.100
- acc_norm on OpenBookQAtest set self-reported43.200
- acc on BoolQvalidation set self-reported84.300
