Atem-8B

Ancient logic. Modern intelligence.

An 8B reasoning model trained via a single CoT-preserving SFT pass directly on Qwen3-8B, distilling multi-domain reasoning capability from frontier teacher models while keeping the base model's native thinking capability intact.

👁 Base Model
👁 Method
👁 Parameters
👁 License

Overview

Atem-8B is an 8B parameter reasoning model built via a single supervised fine-tuning pass on raw Qwen3-8B. Like Atem-4B, it uses a CoT-preserving single-pass design — building reasoning capability on top of the base model's intact native foundation rather than erasing and rebuilding thinking in separate stages. Atem-8B is trained on a larger corpus (~91K records before filtering vs ~63K for 4B) with higher per-source caps, producing a model with broader reasoning coverage across mathematics, coding, science, and general domains.

This is the most thoroughly evaluated model in the Atem series, benchmarked across nine tasks including a custom flexible GSM8K evaluator that diagnoses the formatting shift introduced by CoT training.

Model Details

Property	Value
Base model	Qwen/Qwen3-8B
Training method	Single-pass CoT-Preserving LoRA SFT
LoRA config	r=64, alpha=128, dropout=0.05
Target modules	q, k, v, o, gate, up, down projections
Parameters	~8.37B
Trainable (LoRA) params	174,587,904 (2.09% of base)
Training records	58,980 (after token-length filtering)
Think / No-think split	85% / 15%
Epochs	2 (ceiling; early stopping patience=3, never triggered)
Effective batch size	64 (batch 4 × grad accum 16)
Learning rate	1e-4, cosine schedule, 5% warmup
Max sequence length	6,144 tokens
Precision	bfloat16 (full 16-bit LoRA, not QLoRA)
Hardware	NVIDIA A100-SXM4 80GB
Runtime	7h40m
License	Apache 2.0

Design Notes

Single combined pass. The earlier Atem-0.6B pipeline erased Qwen3's native thinking mode in Stage 1 then re-imposed an externally-distilled style in Stage 2. This introduced measurable capability costs — the base model's exposed reasoning self-corrected on problems the no-think version got wrong, and ARC-Challenge regressed after Stage 2. Atem-8B skips the erasure entirely: one pass, intact native reasoning, external CoT styles layered on a foundation that still works.

Full 16-bit LoRA. At 8B with an 80GB A100, full 16-bit LoRA requires ~33GB — comfortably within budget. It is both marginally faster and marginally more accurate than QLoRA at equivalent effective batch sizes, since QLoRA pays compute overhead on quantize/dequantize operations at each step.

r=64, alpha=128. r=64 on Qwen3-8B represents 2.09% of the model — somewhat lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. The proportional capacity does decrease modestly as model size grows; r=96 would more closely match the 4B reference point. Not a blocker for this run, and noted for future iterations.

Corpus scale. Atem-8B draws from the same eight source datasets as Atem-4B but with higher per-source caps — 91,017 total records before ratio adjustment vs ~63,563 for 4B, yielding 58,980 useable training examples after token-length filtering at 6,144.

Intended Use

Atem-8B is designed for general reasoning tasks where structured, step-by-step thinking adds value:

Multi-step mathematical reasoning
Code explanation, implementation, and debugging
Analytical reasoning and argument evaluation
Scientific explanation requiring technical depth
Commonsense reasoning and physical intuition
Logic, fallacy identification, and conditional reasoning
Concept explanation across diverse domains

Training Data

Atem-8B was trained on a corpus assembled from eight sources covering mathematics, coding, general reasoning, scientific reasoning, and medical reasoning. All sources include explicit chain-of-thought reasoning traces; 85% of training records were formatted with full think traces and 15% as direct answers.

Dataset	Records	Source / Teacher
mitroitskii/OpenR1-Math-220k-formatted	~10,938	DeepSeek-R1 — Mathematics (correctness-filtered)
Jackrong/Claude-opus-4.6-TraceInversion-9000x	7,000	Claude Opus 4.6 — Trace Inversion
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Math)	8,000	Kimi K2.5 — Mathematical Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (General-Distillation)	8,000	Kimi K2.5 — General Reasoning
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned (PHD-Science)	8,000	Kimi K2.5 — Scientific Reasoning
WithinUsAI/MiniMax_M2.7_Distilled_5k	5,000	MiniMax M2.7
FreedomIntelligence/medical-o1-reasoning-SFT	7,500	Medical reasoning (English config)
Modotte/CodeX-2M-Thinking	15,000	Mixed — Coding with CoT
trjxter/DeepSeek-V4-Pro-Reasoning-8000x	~8,014	DeepSeek-V4-Pro
nvidia/OpenCodeReasoning	15,000	Mixed — Competitive coding
Total (pre-filter pool)	91,017
Total (post-filter, trained on)	58,980

Non-English reasoning traces (primarily CJK) were filtered at the trace level using an ASCII-ratio threshold; records with CJK traces were retained as no-think records rather than discarded. The 34.3% filter rate reflects the same 6,144-token ceiling that filtered 32.7% of the Atem-4B corpus — the longest, most complex reasoning traces from competitive programming and advanced mathematics exceed this limit.

Training Configuration

# Key hyperparameters
lora_r = 64
lora_alpha = 128
lora_dropout = 0.05
max_seq_length = 6144
learning_rate = 1e-4
lr_scheduler = 'cosine'
warmup_ratio = 0.05
batch_size = 4
grad_accumulation = 16 # effective batch size: 64
num_epochs = 2 # ceiling — early stopping patience=3
eval_steps = 150
early_stopping_patience = 3
early_stopping_threshold = 0.001
nothink_ratio = 0.15
load_in_4bit = False # full 16-bit LoRA
dtype = bfloat16

Training used Unsloth with train_on_responses_only masking. Early stopping was configured with patience=3 and threshold=0.001 — it did not trigger, as validation loss improved at every checkpoint throughout the full 2-epoch run.

Loss Curve

Step	Train Loss	Val Loss
150	0.8661	0.8367
300	0.7971	0.8120
450	0.8006	0.7978
600	0.7992	0.7880
750	0.7791	0.7822
900	0.7879	0.7770
1050	0.7328	0.7758
1200	0.7357	0.7734
1350	0.7223	0.7711
1500	0.7461	0.7697
1650	0.7501	0.7691
1800	0.7691	0.7688
Final (1844)	0.7847 (avg)	0.7688

Validation loss tracked above training loss for most of the run, indicating no overfitting. At step 150, val loss was briefly below train loss — a known early-training artifact when dropout is active during training but not during evaluation. This normalised by step 300 and did not recur. Val loss improved continuously across all 13 checkpoints, confirming the early stopping mechanism was never needed.

Evaluation

Benchmark Results

Evaluated against base Qwen3-8B (Qwen/Qwen3-8B) using lm-evaluation-harness. Both models were loaded in 4-bit for evaluation. GSM8K flexible extraction uses a custom evaluator that accepts #### answer, \boxed{answer}, and prose formats — see note below.

Task	Base (Qwen3-8B)	Atem-8B	Delta
ARC-Challenge (0-shot, acc_norm)	56.5%	56.9%	+0.4pp —
GSM8K strict (5-shot, exact_match)	86.7%	83.3%	−3.4pp ⚠
GSM8K flexible (5-shot, custom)	86.7%	85.6%	−1.1pp —
HellaSwag (0-shot, acc_norm)	74.5%	76.2%	+1.7pp ✓
MMLU (0-shot, acc)	72.9%	72.9%	+0.0pp —
Winogrande (0-shot, acc)	67.2%	71.8%	+4.6pp ✓
PIQA (0-shot, acc)	76.2%	78.1%	+1.9pp ✓
OpenBookQA (0-shot, acc_norm)	41.4%	43.2%	+1.8pp ✓
BoolQ (0-shot, acc)	85.9%	84.3%	−1.6pp —

Winogrande (+4.6pp, 2.5σ) is the headline result — the largest gain in the evaluation set. Commonsense pronoun resolution is format-independent and tests exactly the kind of contextual reasoning that CoT training is designed to improve.

HellaSwag (+1.7pp, 2.8σ) uses normalised log-likelihood scoring over multiple-choice options — format-independent and not influenced by generation style. A genuine reasoning signal.

PIQA, OpenBookQA both positive. All four commonsense and reasoning tasks improved. The direction is consistent and matches the expected effect of training on structured reasoning traces.

MMLU exactly tied at 72.9%. The CoT training neither added nor removed knowledge breadth — the correct expected behaviour for SFT on reasoning data.

GSM8K — Formatting Shift Analysis

The strict-match GSM8K regression (−3.4pp) was investigated using a custom flexible extractor that accepts multiple answer formats: #### {number} (lm_eval standard), \boxed{number} (LaTeX, common in mathematics literature), prose declarations, and last-number fallback.

Extraction method	Atem-8B	Base
Strict-match `####` only	83.3%	86.7%
Flexible extraction	85.6%	~86.7%
Recovered by flexible	+2.3pp	—

68% of the observed regression was a formatting artifact. The training corpus — OpenR1-Math, DeepSeek-V4-Pro, Kimi-K2.5 — uses \boxed{answer} (LaTeX notation, standard in academic and competition mathematics) rather than the #### answer format specific to the GSM8K dataset. The SFT pass has shifted Atem's preferred answer format from #### toward \boxed{}. lm_eval's strict-match regex only searches for ####, so correct answers in \boxed{} format count as wrong.

The true capability gap after accounting for formatting is approximately −1.1pp, not −3.4pp. The base model retains a small genuine advantage on this benchmark because it was instruction-tuned on GSM8K-format data and naturally reproduces the #### convention.

BoolQ (−1.6pp, 1.8σ) is borderline — sitting between noise and statistical significance. BoolQ requires committing to a binary yes/no answer; it's possible the more exploratory CoT training style slightly disadvantaged decisive binary classification. Worth monitoring on future runs.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "EphAsad/Atem-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

messages = [
 {
 "role": "user",
 "content": "Explain why switching doors in the Monty Hall problem gives a 2/3 probability of winning."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to(model.device)

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=2000,
 temperature=0.6,
 top_p=0.95,
 top_k=20,
 do_sample=True,
 repetition_penalty=1.1,
 )

response = tokenizer.decode(
 output[0][inputs.shape[1]:],
 skip_special_tokens=True
)
print(response)

Unsloth (faster inference)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="EphAsad/Atem-8B",
 max_seq_length=6144,
 dtype=torch.bfloat16,
 load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
 {
 "role": "user",
 "content": "A train travels from A to B at 60 km/h and returns at 90 km/h. What is the average speed?"
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to("cuda")

with torch.no_grad():
 output = model.generate(
 input_ids=inputs,
 max_new_tokens=2000,
 temperature=0.6,
 top_p=0.95,
 top_k=20,
 do_sample=True,
 )

print(tokenizer.decode(
 output[0][inputs.shape[1]:],
 skip_special_tokens=True
))

Ollama

# Recommended — best speed/quality balance
ollama run hf.co/EphAsad/Atem-8B:Q4_K_M

# Higher quality
ollama run hf.co/EphAsad/Atem-8B:Q5_K_M

# Near-lossless
ollama run hf.co/EphAsad/Atem-8B:Q8_0

llama.cpp

llama-server -hf EphAsad/Atem-8B:Q4_K_M

Sampling Parameters

Use temperature=0.6, top_p=0.95, top_k=20 for thinking mode — Qwen3's published recommendation, used throughout this evaluation. Do not use greedy decoding with thinking mode enabled.

System Prompt

Atem-8B's identity is baked into the chat template and activates automatically when no system message is provided. For manual override:

You are Atem, a precise and analytical reasoning assistant. You approach
every problem methodically — identifying core concepts, reasoning step by
step, and arriving at well-supported conclusions. You show your thinking
clearly and are thorough, direct, and intellectually honest.

Available Files

File	Size	Description
`model-XXXX-of-00004.safetensors` (×4)	~16.4 GB total	Full bfloat16 merged weights
`Atem-8b.Q4_K_M.gguf`	5.03 GB	4-bit quantised — recommended
`Atem-8b.Q5_K_M.gguf`	5.85 GB	5-bit quantised
`Atem-8b.Q8_0.gguf`	8.71 GB	8-bit quantised — near-lossless

Known Limitations

GSM8K formatting shift. As documented in the evaluation section, the SFT corpus uses \boxed{} notation for mathematical answers rather than the #### format specific to the GSM8K benchmark. This creates a systematic measurement gap under strict-match evaluation (−3.4pp), of which 68% is a formatting artifact. Under flexible extraction the true gap is approximately −1.1pp. For production use, \boxed{answer} is standard in mathematical contexts.

6,144 token sequence ceiling. The training corpus's longest reasoning traces (competitive programming, advanced mathematics) exceed 6,144 tokens and were dropped during formatting. The model has not been exposed to very long chain-of-thought traces; raising max_new_tokens at inference time provides budget for longer outputs but does not recover training coverage of ultra-long traces.

LoRA proportional capacity. r=64 represents 2.09% of the 8B model — lower than the proven 4B baseline of 3.11% due to the quadratic scaling of total parameters relative to linear scaling of LoRA capacity. r=96 would more closely match the 4B proportional reference. Not a blocker, but noted for future runs.

No RLHF or DPO. Atem-8B has not undergone preference optimisation. Responses are accurate and structured but may not be as reliably aligned with user preferences in open-ended creative or instructional tasks compared to models that have undergone preference training.

Roadmap

Atem-14B: Single CoT-preserving pass on Qwen3-14B, r=128 (3.10% proportional capacity), with GSM8K-format examples added to the corpus to restore #### answer convention

Citation

@misc{atem_8b_2026,
 author = {Asad, Zain},
 title = {Atem-8B: An 8B CoT-Preserving Reasoning Model via
 Single-Pass SFT on Qwen3},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/EphAsad/Atem-8B}},
}

License

Released under the Apache 2.0 License, consistent with the base model Qwen/Qwen3-8B.

Built independently by Zain Asad — EphAsad

Downloads last month: 428

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for EphAsad/Atem-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1468)

this model

Adapters

3 models

Datasets used to train EphAsad/Atem-8B

Evaluation results

acc_norm on ARC (Challenge)
test set self-reported
56.900
exact_match (strict-match) on GSM8K
test set self-reported
83.300
acc_norm on HellaSwag
validation set self-reported
76.200
acc on MMLU
test set self-reported
72.900
acc on Winogrande
validation set self-reported
71.800
acc on PIQA
validation set self-reported
78.100
acc_norm on OpenBookQA
test set self-reported
43.200
acc on BoolQ
validation set self-reported
84.300

URL: https://huggingface.co/EphAsad/Atem-8B

⇱ EphAsad/Atem-8B · Hugging Face