VOOZH about

URL: https://huggingface.co/openeurollm/datamix-0.7b-en_baseline-100bt

⇱ openeurollm/datamix-0.7b-en_baseline-100bt · Hugging Face


Model Details

This is a decoder-only model with approximately 0.7B parameters. The architecture largely follows the Qwen-3 design, with the following key hyperparameters:

  • Hidden Size: 1024
  • Attention Heads: 16
  • Layers: 28
  • Sequence Length: 4096

Training Data

The training data is a diverse dataset, combined high-quality English, code, and math corpora. The total token budget for training is 100 billions tokens. The training mixture is comprised of the following datasets:

  • English: A mixture of Nemotron-CC high-actual, medium-high-actual, medium-actual, DCLM, finepdfs and finepdfs-edu datasets, as well as arxiv, pes2o and Wiki subsets of olmo-mix.
  • Code: The StarCoder dataset.
  • Math: The FineMath 4+ and MegaMath (text-code-block and web-pro) datasets.

Tokenizer

The model utilizes custom openeurollm tokenizer with a 262K vocabulary size.

Training Information

The model was trained using the NVidia-Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 16 AMD MI250x nodes, totaling approximately 1500 GPU hours.

Intermediate Checkpoints

We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 4000 training steps.

The naming convention is iter_0xxxxx00. For example, the checkpoint for 16000 iterations is named iter_0016000. The available checkpoints range from iter_0004000 up to iter_0047684. The final checkpoint, iter_0047684, is located in the main branch.

Downloads last month
36
Safetensors
Model size
0.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train openeurollm/datamix-0.7b-en_baseline-100bt