nanochat-d24-speedrun
A 1.4B parameter GPT-2 style model trained from scratch using nanochat on 16×H100 GPUs.
Training
- Architecture: 24-layer Transformer, 1536 hidden dim, 12 heads, 32K vocab
- Training data: 5.8B tokens (ClimbMix), param:data ratio = 8
- Precision: FP8 (tensorwise scaling)
- Hardware: 16× NVIDIA H100
- Throughput: 1.58M tok/sec, 47.6% bf16 MFU
- Pretraining time: 62 minutes
- Total pipeline time: 1h 26m (pretrain + base eval + SFT + chat eval + report)
Results
| Metric | Base | SFT |
|---|---|---|
| Val BPB | 0.715 | - |
| CORE | 0.247 | - |
| ChatCORE | - | 0.360 |
| ARC-Easy | - | 61.3% |
| ARC-Challenge | - | 48.9% |
| MMLU | - | 36.4% |
| HumanEval | - | 11.0% |
| GSM8K | - | 9.8% |
| SpellingBee | - | 99.6% |
Files
base_checkpoints/— Pretrained base model (step 5568)chatsft_checkpoints/— SFT fine-tuned chat model (step 482)tokenizer/— Custom BPE tokenizer (32K vocab)report.md— Full training report
Usage
# Requires the nanochat repo
from nanochat.gpt import GPT, GPTConfig
from nanochat.checkpoint_manager import load_checkpoint
model, metadata = load_checkpoint("path/to/base_checkpoints")
Acknowledgments
Compute resources provided by WestAI. Thanks to the WestAI team for their generous compute contributions.
