License
This tokenizer is released under the Apache License 2.0.
Training data attribution
This tokenizer was trained on a mixture of publicly available English text datasets:
- HuggingFaceFW/fineweb-edu, CC-MAIN-2024-18
- HuggingFaceFW/fineweb, CC-MAIN-2024-18
- wikimedia/wikipedia, 20231101.en
- allenai/c4, en
- HuggingFaceFW/finepdfs-edu, eng_Latn
FineWeb, FineWeb-Edu, and FinePDFs-Edu are released under the Open Data Commons Attribution License ODC-By v1.0 and are subject to Common Crawl Terms of Use.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
