VOOZH about

URL: https://thenewstack.io/5-useful-datasets-for-training-multimodal-ai-models/

⇱ 5 Useful Datasets for Training Multimodal AI Models - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-01-15 09:30:10
5 Useful Datasets for Training Multimodal AI Models
research,sponsor-vmware,sponsored-topic,
AI / AI Engineering / Data / Large Language Models

5 Useful Datasets for Training Multimodal AI Models

Five leading multimodal datasets for developers to use, along with descriptions of what the datasets include and what they can be used for.
Jan 15th, 2025 9:30am by Kimberley Mok
👁 Featued image for: 5 Useful Datasets for Training Multimodal AI Models
Image via Unsplash+. 

With the ability to perform tasks across a range of combined modalities like text, image, audio, video and more, multimodal AI systems are fast becoming more versatile and powerful. However, building useful multimodal AI models requires good multimodal datasets, which are the necessary fuel for training these polyvalent systems — allowing them to expand their understanding of the world beyond one dimension or modality.

For instance, tasks like image captioning require a set of training data that combines both images and relevant, descriptive text, which can be used to train an AI model. After the training process, the AI model can then be deployed, using natural language processing and computer vision techniques to recognize the contents of a new image and to generate the associated text.

The same idea applies to a wide range of tasks, like video analysis, audio-visual speech recognition, cross-modal retrieval, medical diagnostics and more. This is because multimodal datasets empower AI models to learn more complex semantic relationships between objects and their context, thus boosting model performance and accuracy.

With so many multimodal datasets out in the wild, it can be difficult to know where to start. In this post, we’ll cover the most notable multimodal datasets that are currently available, and briefly describe what they include and what they can potentially be used for.

1. Flickr30K Entities

As an extension to the popular image-captioning Flickr30K dataset, this dataset contains more than 31,000 images sourced from Flickr, with each image associated with five crowd-sourced captions. The Flickr30K Entities dataset augments the original 158,000 captions with 244,000 coreference chains, on top of adding bounding box annotation for all entities (i.e. people or objects) referred to in the captions.

One important advantage of the Flickr30K Entities dataset is that it provides more in-depth annotations for image-text tasks, and helps models better describe the contents of an image — in addition to locating the entities within the image.

Applications: Real-time image captioning; image search.

License: Use of images must abide by Flickr’s Terms of Use; it can be used by researchers and educators for non-commercial purposes.

👁 Image

Examples from Flickr30 Entities dataset.

2. InternVid

Developed for video-related tasks like video captioning, video retrieval and video generation, InternVid is a relatively new video-text dataset that includes 7 million videos of various types of objects and activities lasting almost 760,000 hours. This is broken down into an impressive 234 million clips, paired with richly descriptive captions that total over 4.1 billion words.

One of the biggest benefits of this dataset include its breadth, with 16 distinct types of scenes and over 6,000 distinct actions being covered.

Applications: Video chatbots; personalized e-learning.

License: Apache License 2.0.

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

This intriguing text-image-audio dataset is designed to understand sentiment in the context of user-generated video reviews in order to understand the emotional engagement that occurs during product reviews. The MuSe dataset consists of over 40 hours of extensively annotated, high-quality, user-generated video recordings, which provide insights into emotional nuances that might show up in faces, voices, gestures or body language.

The aim of the dataset is to advance multimodal sentiment analysis further by providing an in-depth dataset for understanding complex human emotions in a variety of ways.

Applications: Mental health chatbots or assistants; automated sentiment analysis systems for evaluating customer satisfaction with products.

License: Non-commercial under an End User Licence Agreement (EULA).

👁 Image

Examples from MuSe-CaR dataset.

4. MovieQA

MovieQA is a text-video-question-answer multimodal dataset designed for evaluating story comprehension and performing video question-answering (VideoQA) tasks. It consists of 15,000 multiple choice questions paired with subtitled film clips that have been taken from over 400 movies of high semantic diversity.

Answering the questions correctly requires the model to have a sufficient understanding of the visual and textual context contained within the video clip, such as sequential events, human interactions, intent, and the text used to describe them. This dataset is unique in the sense that it contains multiple sources of information, ranging from video clips, plots, subtitles, scripts and DVS (Descriptive Video Service).

Applications: Automated film analysis, summary and categorization.

License: Not specified.

👁 Image

Examples from MovieQA dataset.

5. MINT-1T

MINT-1T is a massive, open source dataset from Salesforce AI Research that contains one trillion text tokens and 3.4 billion images — nearly ten times larger than the next largest open source dataset. This is an incredibly diverse, multimodal, interleaved dataset that integrates text and images in a way that imitates documents in the real world, like web pages and scientific papers — including PDFs and ArXiv papers.

The sheer scale of this dataset means that models can be more broadly versed in the existing online corpus of scientific and technological research. According to the research team, the goal was to create a dataset that features “free-form interleaved sequences of images and text,” suitable for training large multimodal AI models.

Applications: Developing AI assistants that are more context-aware; MINT-1T is a massive dataset that levels the playing field for researchers and businesses with smaller budgets.

License: CC-BY-4.0.

Conclusion

New datasets are continuously emerging, so here are some other recent multimodal datasets that are also worth a mention:

  • BigDocs: This open and “permissively licensed” dataset is designed to train models for extracting information from documents, using enhanced OCR, layout and diagram analysis, and table detection.
  • Newsmediabias-plus (NMB+): Combining textual and visual data from news articles, this dataset from the Vector Institute is designed for the detection and analysis of media bias and disinformation.

These are but a handful of the vast number of multimodal datasets that are available — not to mention multilingual datasets that are also coming to the fore. With so many options out there, it’s relatively easy to find the right datasets to train your AI model. For more information, check out our posts on tools for building multimodal AI applications, plus some open source and small-scale multimodal AI models.

Trusted by enterprises and loved by developers, VMware Tanzu is built for platform and data teams who want to accelerate agentic software delivery and AI-ready data. Tanzu provides a pre-engineered, agentic app platform and an AI-ready data intelligence platform that helps enterprises build, run, manage and safeguard agents, their integrations and data so you can capitalize on AI at scale. 
Learn More
The latest from VMware Tanzu
Hear more from our sponsor
TRENDING STORIES
Kimberley Mok is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate...
Read more from Kimberley Mok
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Real.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.