nthakur

👁 Image
sulekz's profile picture 👁 Image
Bayparkbear's profile picture 👁 Image
tomaarsen's profile picture

AI & ML interests

NLP, IR, QA

Recent Activity

new activity 22 days ago

BeIR/fiqa:VORTEXRAG: 7-Layer RAG — EM 74.8 on QA benchmarks, solves Semantic Drift [open source]

new activity 22 days ago

BeIR/hotpotqa:VORTEXRAG: 7-Layer RAG — EM 74.8 on QA benchmarks, solves Semantic Drift [open source]

new activity 22 days ago

BeIR/nq:VORTEXRAG: 7-Layer RAG — EM 74.8 on QA benchmarks, solves Semantic Drift [open source]

View all activity

Organizations

Posts 2

view post

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

view post

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

View all Posts

URL: https://huggingface.co/nthakur

⇱ nthakur (Nandan Thakur)

Nandan Thakur

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 5

Papers 18

models 37

datasets 58