Model Details
This is a decoder-only model with approximately 0.7B parameters. The architecture largely follows the Qwen-3 design, with the following key hyperparameters:
- Hidden Size: 1024
- Attention Heads: 16
- Layers: 28
- Sequence Length: 4096
Training Data
The training data is a diverse dataset, combined high-quality English, code, and math corpora. The total token budget for training is 100 billions tokens. The training mixture is comprised of the following datasets:
- English: A mixture of Nemotron-CC high-actual, medium-high-actual, medium-actual, DCLM, finepdfs and finepdfs-edu datasets, as well as arxiv, pes2o and Wiki subsets of olmo-mix.
- Code: The StarCoder dataset.
- Math: The FineMath 4+ and MegaMath (text-code-block and web-pro) datasets.
Tokenizer
The model utilizes custom openeurollm tokenizer with a 262K vocabulary size.
Training Information
The model was trained using the NVidia-Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 16 AMD MI250x nodes, totaling approximately 1500 GPU hours.
Intermediate Checkpoints
We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 4000 training steps.
The naming convention is iter_0xxxxx00. For example, the checkpoint for 16000 iterations is named iter_0016000. The available checkpoints range from iter_0004000 up to iter_0047684. The final checkpoint, iter_0047684, is located in the main branch.
- Downloads last month
- 36
