Arabic Sentence Segmentation Shared Task 2026 Arabic sentence segmentation datasets for the Arabic Sentence Segmentation Shared Task 2026 Viewer • Updated May 22 • 658 • 258 • 1 Viewer • Updated May 22 • 658 • 134 • 2 Viewer • Updated May 22 • 658 • 103 • 1 Viewer • Updated May 22 • 658 • 667 • 1
MediX-R1 Open Ended Medical Reinforcement Learning MediX-R1 Medical AI Demo 🏥 1 Medical image analysis and chat with MediX-R1 Image-Text-to-Text • 31B • Updated Feb 27 • 260 • 5 Image-Text-to-Text • 9B • Updated Feb 27 • 805 • 8 Image-Text-to-Text • 2B • Updated Feb 27 • 42 • 3
OpenEarthAgent The OpenEarthAgent Collection brings together the OpenEarthAgent model and its accompanying large-scale tool-augmented geospatial reasoning data. Text Generation • 4B • Updated Feb 20 • 67 • • 4 Viewer • Updated Mar 3 • 1.2k • 441 • 4
VideoMathQA VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Viewer • Updated Jun 6, 2025 • 2.1k • 535 • 11
CASS Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels Viewer • Updated May 28, 2025 • 135k • 924 • 7 Viewer • Updated Apr 27, 2025 • 40 • 10 • 3
BiMediX2 BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities Image-Text-to-Text • 8B • Updated Jun 3, 2025 • 284 • 1 Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 68 • 8 Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 41 Image-Text-to-Text • 71B • Updated Dec 15, 2024 • 3 • 4
VideoGPT+ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Updated Jun 17, 2024 • 7 Updated Jun 17, 2024 • 1 Viewer • Updated Feb 9 • 4.35k • 771 • 3 Viewer • Updated Jun 17, 2024 • 139k • 70 • 7
Video-ChatGPT "Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. Visual Question Answering • Updated Jun 8, 2023 • 44 Viewer • Updated Sep 29, 2023 • 100k • 131 • 49
PALO PALO: A Polyglot Large Multimodal Model for 5B People Text Generation • Updated Mar 25, 2024 • 9 Text Generation • Updated Mar 25, 2024 • 6 Text Generation • Updated Mar 25, 2024 • 183 Preview • Updated Mar 3, 2024 • 43 • 3
GeoChat GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. Text Generation • Updated Mar 1, 2024 • 894 • 26 Preview • Updated Mar 5, 2024 • 127 • 5 Updated Mar 5, 2024 • 475 • 22
Video-CoM Video-CoM: Interactive Video Reasoning via Chain of Manipulations 8B • Updated Apr 12 • 4 • 1 8B • Updated Apr 12 • 10
DeepfakeJudge A framework for deepfake detection and reasoning supervision at scale. Viewer • Updated Feb 22 • 2.05k • 150 • 4 8B • Updated Feb 22 • 19 4B • Updated Feb 22 • 12 4B • Updated Feb 22 • 30
MedMO Medical Foundation Model Image-Text-to-Text • 9B • Updated Apr 8 • 391 • 11 Image-Text-to-Text • 4B • Updated Apr 8 • 24 • 15 Paper • 2602.06965 • Published Feb 6 • 7 Image-Text-to-Text • 9B • Updated Apr 8 • 574 • 15
FinMMEval Lab @CLEF'2026 Training datasets for FinMMEval Lab @CLEF'2026 Viewer • Updated Dec 19, 2025 • 249 • 28 Viewer • Updated Dec 18, 2025 • 167 • 69 Viewer • Updated Dec 17, 2025 • 183 • 49 Viewer • Updated Dec 19, 2025 • 274 • 30
NADI 2025 Sub-task 3 datasets Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task Viewer • Updated Oct 31, 2025 • 46.2k • 377 • 32 Viewer • Updated Oct 1, 2025 • 9.71k • 663 • 24 Viewer • Updated Sep 4, 2025 • 5.25k • 47 • 1 Viewer • Updated May 28, 2025 • 65.8k • 53 • 2
GeoPixel Pixel Grounding Large Multimodal Model in Remote Sensing Updated Feb 20, 2025 • 7.35k • 5 Updated Feb 20, 2025 • 65 • 2 Paper • 2501.13925 • Published Jan 23, 2025 • 8 Viewer • Updated Feb 26, 2025 • 18.7k • 46 • 4
ArTST - Arabic Text Speech Transformer Open source project for Arabic Speech Recognition and Generation Automatic Speech Recognition • 0.2B • Updated Sep 10, 2025 • 12 • 2 Text-to-Speech • Updated Sep 10, 2025 • 1.68k • 32 ArtstTTS 🔥 5 ArtstASR 💭 3
GLaMM Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated. Text Generation • Updated Apr 27, 2024 • 280 • 7 Updated Apr 17, 2024 • 259 • 15 Preview • Updated Mar 21, 2024 • 119 • 12 Text Generation • Updated Dec 26, 2023 • 525 • 4
LLaVA++ (LLaMA-3 and Phi-3-Mini) Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3 LLaVA++ (LLaMA-3-V) 👁 33 Start a chatbot server for text-based interactions LLaVA++ (Phi-3-V) 👁 26 Launch a chatbot with image and text understanding Text Generation • 4B • Updated Apr 27, 2024 • 183 • 21 Text Generation • 8B • Updated Apr 27, 2024 • 27 • 12
MobiLlama Collection of MobiLlama Language Models. Text Generation • Updated Feb 28, 2024 • 314 • 42 Text Generation • 1B • Updated Feb 28, 2024 • 16 • 19 Text Generation • Updated Feb 28, 2024 • 7 • 17 Text Generation • Updated Feb 28, 2024 • 10 • 6
Satmae++ Collection of ViT models trained using SatMAE++ approach. Updated Mar 26, 2024 • 1 Updated Mar 26, 2024 • 2 Updated Mar 26, 2024 • 1 Updated Mar 26, 2024
Arabic Sentence Segmentation Shared Task 2026 Arabic sentence segmentation datasets for the Arabic Sentence Segmentation Shared Task 2026 Viewer • Updated May 22 • 658 • 258 • 1 Viewer • Updated May 22 • 658 • 134 • 2 Viewer • Updated May 22 • 658 • 103 • 1 Viewer • Updated May 22 • 658 • 667 • 1
Video-CoM Video-CoM: Interactive Video Reasoning via Chain of Manipulations 8B • Updated Apr 12 • 4 • 1 8B • Updated Apr 12 • 10
MediX-R1 Open Ended Medical Reinforcement Learning MediX-R1 Medical AI Demo 🏥 1 Medical image analysis and chat with MediX-R1 Image-Text-to-Text • 31B • Updated Feb 27 • 260 • 5 Image-Text-to-Text • 9B • Updated Feb 27 • 805 • 8 Image-Text-to-Text • 2B • Updated Feb 27 • 42 • 3
DeepfakeJudge A framework for deepfake detection and reasoning supervision at scale. Viewer • Updated Feb 22 • 2.05k • 150 • 4 8B • Updated Feb 22 • 19 4B • Updated Feb 22 • 12 4B • Updated Feb 22 • 30
OpenEarthAgent The OpenEarthAgent Collection brings together the OpenEarthAgent model and its accompanying large-scale tool-augmented geospatial reasoning data. Text Generation • 4B • Updated Feb 20 • 67 • • 4 Viewer • Updated Mar 3 • 1.2k • 441 • 4
MedMO Medical Foundation Model Image-Text-to-Text • 9B • Updated Apr 8 • 391 • 11 Image-Text-to-Text • 4B • Updated Apr 8 • 24 • 15 Paper • 2602.06965 • Published Feb 6 • 7 Image-Text-to-Text • 9B • Updated Apr 8 • 574 • 15
FinMMEval Lab @CLEF'2026 Training datasets for FinMMEval Lab @CLEF'2026 Viewer • Updated Dec 19, 2025 • 249 • 28 Viewer • Updated Dec 18, 2025 • 167 • 69 Viewer • Updated Dec 17, 2025 • 183 • 49 Viewer • Updated Dec 19, 2025 • 274 • 30
VideoMathQA VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Viewer • Updated Jun 6, 2025 • 2.1k • 535 • 11
NADI 2025 Sub-task 3 datasets Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task Viewer • Updated Oct 31, 2025 • 46.2k • 377 • 32 Viewer • Updated Oct 1, 2025 • 9.71k • 663 • 24 Viewer • Updated Sep 4, 2025 • 5.25k • 47 • 1 Viewer • Updated May 28, 2025 • 65.8k • 53 • 2
CASS Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels Viewer • Updated May 28, 2025 • 135k • 924 • 7 Viewer • Updated Apr 27, 2025 • 40 • 10 • 3
GeoPixel Pixel Grounding Large Multimodal Model in Remote Sensing Updated Feb 20, 2025 • 7.35k • 5 Updated Feb 20, 2025 • 65 • 2 Paper • 2501.13925 • Published Jan 23, 2025 • 8 Viewer • Updated Feb 26, 2025 • 18.7k • 46 • 4
BiMediX2 BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities Image-Text-to-Text • 8B • Updated Jun 3, 2025 • 284 • 1 Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 68 • 8 Image-Text-to-Text • 8B • Updated Dec 15, 2024 • 41 Image-Text-to-Text • 71B • Updated Dec 15, 2024 • 3 • 4
ArTST - Arabic Text Speech Transformer Open source project for Arabic Speech Recognition and Generation Automatic Speech Recognition • 0.2B • Updated Sep 10, 2025 • 12 • 2 Text-to-Speech • Updated Sep 10, 2025 • 1.68k • 32 ArtstTTS 🔥 5 ArtstASR 💭 3
VideoGPT+ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Updated Jun 17, 2024 • 7 Updated Jun 17, 2024 • 1 Viewer • Updated Feb 9 • 4.35k • 771 • 3 Viewer • Updated Jun 17, 2024 • 139k • 70 • 7
GLaMM Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated. Text Generation • Updated Apr 27, 2024 • 280 • 7 Updated Apr 17, 2024 • 259 • 15 Preview • Updated Mar 21, 2024 • 119 • 12 Text Generation • Updated Dec 26, 2023 • 525 • 4
Video-ChatGPT "Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. Visual Question Answering • Updated Jun 8, 2023 • 44 Viewer • Updated Sep 29, 2023 • 100k • 131 • 49
LLaVA++ (LLaMA-3 and Phi-3-Mini) Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3 LLaVA++ (LLaMA-3-V) 👁 33 Start a chatbot server for text-based interactions LLaVA++ (Phi-3-V) 👁 26 Launch a chatbot with image and text understanding Text Generation • 4B • Updated Apr 27, 2024 • 183 • 21 Text Generation • 8B • Updated Apr 27, 2024 • 27 • 12
PALO PALO: A Polyglot Large Multimodal Model for 5B People Text Generation • Updated Mar 25, 2024 • 9 Text Generation • Updated Mar 25, 2024 • 6 Text Generation • Updated Mar 25, 2024 • 183 Preview • Updated Mar 3, 2024 • 43 • 3
MobiLlama Collection of MobiLlama Language Models. Text Generation • Updated Feb 28, 2024 • 314 • 42 Text Generation • 1B • Updated Feb 28, 2024 • 16 • 19 Text Generation • Updated Feb 28, 2024 • 7 • 17 Text Generation • Updated Feb 28, 2024 • 10 • 6
GeoChat GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. Text Generation • Updated Mar 1, 2024 • 894 • 26 Preview • Updated Mar 5, 2024 • 127 • 5 Updated Mar 5, 2024 • 475 • 22
Satmae++ Collection of ViT models trained using SatMAE++ approach. Updated Mar 26, 2024 • 1 Updated Mar 26, 2024 • 2 Updated Mar 26, 2024 • 1 Updated Mar 26, 2024