OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

OctoThinker-3B-Long-Base

The OctoThinker family is built on carefully studied mid-training insights, starting from the Llama-3 family, to create a reinforcement learning–friendly base language model.

Training Recipe

👁 Data Pipeline

Evaluation Results

Note that we adopt the few-shot prompting evaluation for these base language models.

👁 Data Pipeline

More about OctoThinker

👁 Data Pipeline

Citation

Check out our paper for more details. If you use our models, datasets or find our work useful, please cite

@article{wang2025octothinker,
 title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
 author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
 year={2025},
 journal={arXiv preprint arXiv:2506.20512},
 note={Preprint}
}

Downloads last month: 289

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for OctoThinker/OctoThinker-3B-Long-Base

Base model

meta-llama/Llama-3.2-3B

Finetuned

(460)

this model

Quantizations

2 models

Datasets used to train OctoThinker/OctoThinker-3B-Long-Base

Collection including OctoThinker/OctoThinker-3B-Long-Base

What makes a base language model suitable for RL? Through controlled experiments, we identify key factors then leverage them to scale up mid-training. • 6 items • Updated Jul 6, 2025 • 2

Paper for OctoThinker/OctoThinker-3B-Long-Base

Paper • 2506.20512 • Published Jun 25, 2025 • 49

URL: https://huggingface.co/OctoThinker/OctoThinker-3B-Long-Base

⇱ OctoThinker/OctoThinker-3B-Long-Base · Hugging Face

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

OctoThinker-3B-Long-Base

Training Recipe

Evaluation Results

More about OctoThinker

Citation

Model tree for OctoThinker/OctoThinker-3B-Long-Base

Datasets used to train OctoThinker/OctoThinker-3B-Long-Base

Collection including OctoThinker/OctoThinker-3B-Long-Base

Paper for OctoThinker/OctoThinker-3B-Long-Base