HALvest
Open Scientific Papers Harvested from HAL (Unfiltered)
Dataset Summary
overview:
This is the unfiltered version of HALvest, comprising of fulltext from open papers found on Hyper Articles en Ligne (HAL) with extra fields for potential filtering. Our dump is mostly english/french but gather papers written in 56 languages across 13 domains.
You can download the dataset using Hugging Face datasets:
from datasets import load_dataset
ds = load_dataset("almanach/HALvest", "en")
Details
Building the dataset is a three steps process: data fetching from HAL, data merging and data enriching.
- We first request HAL's API in order to gather open research papers and parse it -- effectively sorting papers by language. Then, we download the PDFs of the fetched data.
- Using GROBID, we convert each PDF to an
xml-teiformat in order to have structured data. We convert eachxml-teifile to atxtformat before concatenating it with the paper's. - Finally, we compute some statistics about each document.
Languages
Please, note that the number of tokens is highly inflated in the raw version of the dataset because of badly encoded PDFs, translating to gibberish documents/texts.
| ISO-639 | Language | # Documents | # mT5 Tokens |
|---|---|---|---|
| en | English | 464,679 | 8,158,933,235 |
| fr | French | 199,216 | 9,018,529,985 |
| es | Spanish | 2,975 | 69,221,667 |
| it | Italian | 1,172 | 48,747,986 |
| pt | Portuguese | 934 | 32,918,832 |
| de | German | 652 | 12,225,960 |
| ru | Russian | 245 | 5,763,532 |
| zh | Chinese | 160 | 2,861,585 |
| eu | Basque | 113 | 2,297,485 |
| ar | Arabic | 92 | 2,167,431 |
| ja | Japanese | 92 | 547,861 |
| el | Greek | 54 | 1,738,878 |
| pl | Polish | 43 | 987,878 |
| ro | Romanian | 39 | 1,298,901 |
| uk | Ukrainian | 34 | 837,793 |
| vi | Viêt Namese | 29 | 436,660 |
| ca | Catalan | 28 | 975,078 |
| da | Danish | 27 | 961,955 |
| oc | Occitan | 26 | 285,334 |
| br | Breton | 24 | 998,088 |
| sr | Serbian | 24 | 336,878 |
| ko | Korean | 17 | 226,268 |
| fa | Persian | 17 | 213,903 |
| tr | Turkish | 17 | 149,718 |
| hu | Hungarian | 14 | 577,568 |
| eo | Esperanto | 14 | 105,286 |
| hy | Armenian | 10 | 127,988 |
| cs | Czech | 9 | 712,263 |
| bg | Bulgarian | 9 | 208,763 |
| sq | Albanian | 9 | 98,009 |
| id | Indonesian | 9 | 53,075 |
| he | Hebrew | 8 | 61,283 |
| hr | Croatian | 8 | 40,621 |
| et | Estonian | 7 | 20,405 |
| sv | Swedish | 6 | 270,642 |
| no | Norwegian | 6 | 62,767 |
| az | Azerbaijani | 5 | 52,762 |
| fi | Finnish | 4 | 60,507 |
| tet | Tetum | 4 | 18,485 |
| lt | Lithuanian | 3 | 16,572 |
| mr | Marathi | 3 | 16,386 |
| hi | Hindi | 3 | 3,490 |
| ie | Interlingue | 2 | 140,383 |
| ta | Tamil | 2 | 77,087 |
| sw | Swahili | 2 | 73,921 |
| tl | Tagalog | 2 | 35,962 |
| gl | Galician | 2 | 29,688 |
| mk | Macedonian | 2 | 14,654 |
| th | Thai | 1 | 70,909 |
| tk | Turkmen | 1 | 66,104 |
| bs | Bosnian | 1 | 63,018 |
| kk | Kazakh | 1 | 41,839 |
| sl | Slovenian | 1 | 22,844 |
| sk | Slovak | 1 | 12,997 |
| co | Corsican | 1 | 9,083 |
| gn | Guarani | 1 | 1,566 |
| bo | Tibetan | 1 | 579 |
Domains
Please, note that the number of tokens is highly inflated in the raw version of the dataset because of badly encoded PDFs, translating to gibberish documents/texts.
| Domain | Code | # Documents | # mT5 Tokens |
|---|---|---|---|
| Humanities and Social Sciences | shs | 156,566 | 5,614,423,171 |
| Computer Science | info | 148,316 | 2,573,673,455 |
| Life Sciences | sdv | 115,744 | 3,145,323,780 |
| Engineering Sciences | spi | 102,751 | 2,254,653,825 |
| Physics | phys | 65,991 | 1,503,190,749 |
| Mathematics | math | 62,921 | 1,638,500,361 |
| Chemical Science | chim | 40,012 | 899,507,319 |
| Environmental Science | sde | 31,575 | 579,076,669 |
| Sciences of the Universe | sdu | 23,557 | 682,356,264 |
| Cognitive science | scco | 11,772 | 227,487,096 |
| Statistics | stat | 10,579 | 184,678,350 |
| Quantitative Finance | qfin | 3,451 | 68,518,636 |
| Nonlinear Sciences | nlin | 1,972 | 30,694,088 |
You can browse through every domains and sub-domains here: https://hal.science/browse/domain.
Considerations for Using the Data
The corpus is extracted from the HAL's open archive which distributes scientific publications following open access principles. The corpus is made up of both creative commons licensed and copyrighted documents (distribution authorized on HAL by the publisher). This must be considered prior to using this dataset for any purpose, other than training deep learning models, data mining etc. We do not own any of the text from which these data has been extracted.
Citation
@misc{kulumba2026halvestcontrastiveretrievallikeauthorshipattribution,
title={HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction},
author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary and Florian Cafiero},
year={2026},
eprint={2407.20595},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.20595},
}
Dataset Copyright
The licence terms for HALvest strictly follows the one from HAL. Please refer to the below license when using this dataset.
- Downloads last month
- 957
