ποΈMIRAGE-Bench [NAACL'25] Dataset Collection from the MIRAGE-Bench paper Paper β’ 2410.13716 β’ Published Oct 17, 2024 Viewer β’ Updated Jun 10, 2024 β’ 13k β’ 265 β’ 3 Viewer β’ Updated Mar 21, 2025 β’ 10k β’ 20 β’ 1 Viewer β’ Updated Mar 19 β’ 299k β’ 38 β’ 1
π NoMIRACL Dataset [EMNLP'24] A collection of multilingual relevance assessment datasets. We also have SFT fine-tuned models (Mistral-7B & Llama-3 8B) Paper β’ 2312.11361 β’ Published Dec 18, 2023 β’ 1 Updated Nov 23, 2024 β’ 134 β’ 12 Viewer β’ Updated Nov 23, 2024 β’ 23.9k β’ 22 β’ 2 Viewer β’ Updated Jan 5, 2023 β’ 77.2M β’ 3.7k β’ 52
GPL BEIR Datasets [NAACL'22] Generative Pseudo Labeling training datasets for all domains in BEIR. Paper β’ 2112.07577 β’ Published Dec 14, 2021 Viewer β’ Updated Nov 25, 2023 β’ 230k β’ 10 β’ 1 Viewer β’ Updated Nov 24, 2023 β’ 221k β’ 6 β’ 1 Viewer β’ Updated Dec 1, 2023 β’ 239k β’ 13 β’ 1
Multilingual SFT & DPO Datasets These SFT or DPO datasets were translated from English using the Mistral-7B-Instruct-v0.2 or taken from other sources. Viewer β’ Updated Aug 9, 2024 β’ 24.4k β’ 13 Viewer β’ Updated Mar 21, 2024 β’ 1.54M β’ 12 Viewer β’ Updated Mar 21, 2024 β’ 73.6k β’ 4 Viewer β’ Updated Mar 21, 2024 β’ 231k β’ 36
π¦’SWIM-IR Dataset [NAACL'24] 29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs. Paper β’ 2311.05800 β’ Published Nov 10, 2023 β’ 4 Viewer β’ Updated Apr 28, 2024 β’ 15.4M β’ 447 β’ 9 Viewer β’ Updated Apr 28, 2024 β’ 3.17M β’ 123 β’ 10 Viewer β’ Updated Apr 28, 2024 β’ 93k β’ 230 β’ 2
ποΈMIRAGE-Bench [NAACL'25] Dataset Collection from the MIRAGE-Bench paper Paper β’ 2410.13716 β’ Published Oct 17, 2024 Viewer β’ Updated Jun 10, 2024 β’ 13k β’ 265 β’ 3 Viewer β’ Updated Mar 21, 2025 β’ 10k β’ 20 β’ 1 Viewer β’ Updated Mar 19 β’ 299k β’ 38 β’ 1
Multilingual SFT & DPO Datasets These SFT or DPO datasets were translated from English using the Mistral-7B-Instruct-v0.2 or taken from other sources. Viewer β’ Updated Aug 9, 2024 β’ 24.4k β’ 13 Viewer β’ Updated Mar 21, 2024 β’ 1.54M β’ 12 Viewer β’ Updated Mar 21, 2024 β’ 73.6k β’ 4 Viewer β’ Updated Mar 21, 2024 β’ 231k β’ 36
π NoMIRACL Dataset [EMNLP'24] A collection of multilingual relevance assessment datasets. We also have SFT fine-tuned models (Mistral-7B & Llama-3 8B) Paper β’ 2312.11361 β’ Published Dec 18, 2023 β’ 1 Updated Nov 23, 2024 β’ 134 β’ 12 Viewer β’ Updated Nov 23, 2024 β’ 23.9k β’ 22 β’ 2 Viewer β’ Updated Jan 5, 2023 β’ 77.2M β’ 3.7k β’ 52
π¦’SWIM-IR Dataset [NAACL'24] 29 million Synthetic Wikipedia-based Multilingual Retrieval Training Pairs. Paper β’ 2311.05800 β’ Published Nov 10, 2023 β’ 4 Viewer β’ Updated Apr 28, 2024 β’ 15.4M β’ 447 β’ 9 Viewer β’ Updated Apr 28, 2024 β’ 3.17M β’ 123 β’ 10 Viewer β’ Updated Apr 28, 2024 β’ 93k β’ 230 β’ 2
GPL BEIR Datasets [NAACL'22] Generative Pseudo Labeling training datasets for all domains in BEIR. Paper β’ 2112.07577 β’ Published Dec 14, 2021 Viewer β’ Updated Nov 25, 2023 β’ 230k β’ 10 β’ 1 Viewer β’ Updated Nov 24, 2023 β’ 221k β’ 6 β’ 1 Viewer β’ Updated Dec 1, 2023 β’ 239k β’ 13 β’ 1