VOOZH about

URL: https://thenewstack.io/5-multimodal-ai-models-that-are-actually-open-source/

⇱ 5 Multimodal AI Models That Are Actually Open Source - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-12-13 06:00:41
5 Multimodal AI Models That Are Actually Open Source
research,sponsor-vmware,sponsored-topic,
AI / AI Engineering / Large Language Models / Open Source

5 Multimodal AI Models That Are Actually Open Source

To get up to speed on the latest open source multimodal AI systems, here are five leading options — including their features and uses.
Dec 13th, 2024 6:00am by Kimberley Mok
👁 Featued image for: 5 Multimodal AI Models That Are Actually Open Source
Image via Unsplash+. 

Multimodal AI is attracting a lot of attention, thanks to the tantalizing promise of AI systems that are designed to be jacks of all trades — capable of processing a combination of text, image, audio, and video.

But while there is already a constellation of powerful, proprietary multimodal AI systems on the market, smaller multimodal AI models and open source alternatives are also rapidly gaining ground, as users continue to seek out options that are more accessible and adaptable, and prioritize transparency and collaboration. To get you up to speed on the latest open source multimodal AI systems, we’ll outline some of the more popular options — including their features and uses.

1. Aria

The recently introduced Aria AI model from Rhymes AI is touted as the world’s first open source, multimodal native mixture-of-experts (MoE) model that can process text, code, images, and video — all within one architecture.

This versatile model is relatively powerful compared to even larger models, yet is more efficient, as it selectively leverages relevant subsets (or “mini-experts”) of its framework, depending on the task. Its architecture is designed for ease of scalability, as new “experts” could be added to address new tasks without straining the system. Aria excels at long multimodal input understanding, meaning that it is adept at quickly and accurately parsing long documents and videos.

👁 Image

Aria’s architecture.

2. Leopard

Developed by an interdisciplinary team of researchers from  University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC), Leopard is an open source multimodal model that is specifically designed for text-rich image tasks.

Leopard is intended to tackle two of the biggest challenges in the multimodal AI space, namely the scarcity of high-quality multi-image datasets, and balancing image resolution with sequence length. To achieve this, the model is trained with a curated dataset featuring over 1 million high-quality, human-made and synthetic data pieces that have been collected from real-world examples. It is also openly available for use in other models.

“Leopard stands out with its novel adaptive high-resolution encoding module, which dynamically optimizes the allocation of visual sequence lengths based on the original aspect ratios and resolutions of the input images,” Wenhao Yu, a senior researcher at Tencent America and one of the creators of Leopard, explained to The New Stack. “Additionally, it uses pixel shuffling to losslessly compress long visual feature sequences into shorter ones. This design enables the model to handle multiple high-resolution images without sacrificing detail or clarity.”

These features make Leopard an excellent tool for multi-page document understanding (think slide decks, scientific and financial reports), data visualization, webpage comprehension, and in deploying multimodal AI agents capable fo handling tasks in visually complex environments.

👁 Image

Leopard’s overall model pipeline.

3. CogVLM

Utilizing deep fusion techniques to attain high performance, CogVLM stands for Cognitive Visual Language Model, an open source, state-of-the-art visual language foundational model that can be used for visual question answering (VQA) and image captioning.

CogVLM uses an attention-based fusion mechanism that fuses text and image embeddings, and freezes network layers to keep performance high. It also employs a EVA2-CLIP-E visual encoder and a multi-layer perceptron (MLP) adapter for co-mapping visual and text features onto the same space.

4. LLaVA

Large Language and Vision Assistant (LLaVA) is another open source, state-of-the-art option. It leverages Vicuna to decode language, and CLIP for fine-tuning on instruction-following textual data. The model has been trained using instruction-following text-based data generated by ChatGPT and GPT-4. LLaVA uses a trainable projection matrix to map visual representations onto the language embedding space.

As a versatile visual assistant, LLaVA can be used to create more advanced chatbots that can handle text- and image-based queries.

5. xGen-MM

Also known as BLIP-3, this state-of-the-art, open source suite of multimodal models from Salesforce features a line of variants, including a base pretrained model, an instruction-tuned model, and a safety-tuned model that is intended to reduce harmful outputs.

One crucial development is that the systems were trained using a massive, open source trillion-token dataset of “interleaved” image and text data, which the researchers characterize as the “the most natural form of multimodal data”. That means the models are skilled at handling inputs with text and multiple images, which could be useful in a wide range of settings — such as autonomous vehicles, or image analysis and diagnosing diseases in healthcare, or creating interactive educational tools, or promotional marketing materials.

Conclusion

There is still an ongoing, vigorous debate surrounding the actual definition of open source AI, peppered with accusations of large tech companies “open washing” their AI models in order to gain wider credibility and cachet.

Regardless of how the open source AI debate unfolds, it’s clear that there’s still a further need for truly open source systems — and datasets — that emphasize transparency, collaboration and accessibility and that actually live up to the open source ethos.

Trusted by enterprises and loved by developers, VMware Tanzu is built for platform and data teams who want to accelerate agentic software delivery and AI-ready data. Tanzu provides a pre-engineered, agentic app platform and an AI-ready data intelligence platform that helps enterprises build, run, manage and safeguard agents, their integrations and data so you can capitalize on AI at scale. 
Learn More
The latest from VMware Tanzu
Hear more from our sponsor
TRENDING STORIES
Kimberley Mok is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate...
Read more from Kimberley Mok
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.