Axion1-350K-A250K

DeepSeek-V3 architecture scaled to 160k active/token) — runs entirely on CPU.

Built from scratch as a proof-of-concept that the real DeepSeek-V3 architectural innovations (MLA + DeepSeekMoE + auxiliary-loss-free load balancing) work correctly even at extreme miniaturization.

Architecture

This is not a distilled or quantized version of DeepSeek. Every component was implemented from scratch in pure PyTorch, faithfully following the DeepSeek-V3 technical report (arXiv:2412.19437).

Component	DeepSeek-V3	Axion1
Attention	MLA (Multi-head Latent Attention)	✅ Identical MLA
FFN	DeepSeekMoE (256 routed experts)	✅ MoE (4 routed, top-2)
Load balancing	Auxiliary-loss-free (dynamic bias)	✅ Section 2.3.2
Position	RoPE	✅ RoPE
Normalization	RMSNorm	✅ RMSNorm
Activation	SwiGLU	✅ SwiGLU
Total params	671B	344k
Active params/token	37B	~160k

Model Details

d_model : 64
n_layers : 4
n_heads : 4 (MLA)
d_head : 16
kv_lora_rank : 8 (MLA KV compression)
q_lora_rank : 16 (MLA Q compression)
n_shared_experts : 1
n_routed_experts : 4 (top-2 activated)
d_ff : 64 (per expert)
vocab_size : 1024 (BPE, trained on GSM8K)
max_seq_len : 512
total_params : 343,616
active_params/tok : ~160,000

Training

Dataset: GSM8K — grade school math, converted to plain text with question / reasoning / answer format
Tokenizer: BPE trained from scratch, vocab size 1024
Hardware: AMD Ryzen 5 5600G — CPU only, 12 threads, 32 GB RAM
Speed: ~1,000–1,100 tokens/sec on CPU
Epochs: 20 | Final val loss: ~3.2 | Total time: ~115 minutes

Training Curve

Epoch	Val Loss
1	5.49
2	4.59
3	4.30
5	3.88
7	3.66
9	3.54
20	~3.2

Usage

from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
from tokenizer import BPETokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
 "AxionLab-official/Axion1-350k-A250k",
 trust_remote_code=True
)
model.eval()

tok = BPETokenizer.load("model.vocab", "model.model")

# Bloqueia EOS e PAD nos primeiros min_tokens gerados
class MinNewTokens(LogitsProcessor):
 def __init__(self, min_tokens: int, eos_id: int, pad_id: int):
 self.min_tokens = min_tokens
 self.bad = [eos_id, pad_id]
 self.generated = 0

 def __call__(self, input_ids, scores):
 if self.generated < self.min_tokens:
 for bid in self.bad:
 scores[:, bid] = float("-inf")
 self.generated += 1
 return scores

eos_id = tok.token2id["<eos>"]
pad_id = tok.token2id["<pad>"]

prompt = "# Pergunta:\nQuanto é 5 + 3?\n--\n# Resposta:\n"
ids = tok.encode(prompt, add_bos=True, add_eos=False)
input_ids = torch.tensor([ids])

with torch.no_grad():
 output = model.generate(
 input_ids,
 max_new_tokens=80,
 temperature=0.9,
 do_sample=True,
 top_k=50,
 top_p=0.95,
 eos_token_id=eos_id,
 pad_token_id=pad_id,
 use_cache=False,
 logits_processor=LogitsProcessorList([
 MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id)
 ]),
 )

new_tokens = output[0][len(ids):].tolist()
# Remove EOS do final se presente
if new_tokens and new_tokens[-1] == eos_id:
 new_tokens = new_tokens[:-1]

print("Resposta:", tok.decode(new_tokens))

Scaling Roadmap

Version	Params	Status
Axion1-v0.1 (this)	344k	✅ Released
Axion1-v0.2	~1.5M	🔜 Next
Axion1-v0.3	~6M	📅 Planned
Axion1--v0.4	~24M	📅 Planned
Axion1--v0.5	~100M	📅 Planned

Files

├── model.py # Full DeepSeek-V3 architecture (MLA + MoE)
├── modeling_axion.py # HuggingFace wrapper
├── config.json # Model configuration
├── model.safetensors # Trained weights
├── model.vocab # BPE vocabulary
└── model.model # BPE merge rules

Limitations

With only 344k parameters, the model has learned mathematical vocabulary and co-occurrence patterns from GSM8K but cannot reliably solve problems or maintain syntactic coherence. This is expected — the purpose of this release is to demonstrate that the DeepSeek-V3 architectural components work correctly at any scale, and to serve as a foundation for the scaling roadmap above.

Citation

@article{deepseekv3,
 title = {DeepSeek-V3 Technical Report},
 author = {DeepSeek-AI},
 year = {2024},
 url = {https://arxiv.org/abs/2412.19437}
}

License

MIT — free to use, modify, and build upon.

Made by AxionLab

Downloads last month: 8

Safetensors

Model size

344k params

Tensor type

F32

Dataset used to train AxionLab-Co/AxionMoE-350k-A250k

Paper for AxionLab-Co/AxionMoE-350k-A250k

Paper • 2412.19437 • Published Dec 27, 2024 • 87

URL: https://huggingface.co/AxionLab-Co/AxionMoE-350k-A250k

⇱ AxionLab-Co/AxionMoE-350k-A250k · Hugging Face