VOOZH about

URL: https://huggingface.co/Tralalabs/TralaBPE-32K-EnglishMix-v1

⇱ Tralalabs/TralaBPE-32K-EnglishMix-v1 · Hugging Face


License

This tokenizer is released under the Apache License 2.0.

Training data attribution

This tokenizer was trained on a mixture of publicly available English text datasets:

  • HuggingFaceFW/fineweb-edu, CC-MAIN-2024-18
  • HuggingFaceFW/fineweb, CC-MAIN-2024-18
  • wikimedia/wikipedia, 20231101.en
  • allenai/c4, en
  • HuggingFaceFW/finepdfs-edu, eng_Latn

FineWeb, FineWeb-Edu, and FinePDFs-Edu are released under the Open Data Commons Attribution License ODC-By v1.0 and are subject to Common Crawl Terms of Use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Tralalabs/TralaBPE-32K-EnglishMix-v1