Summary
- Multimodal AI models will usher in a new era of highly intuitive and dynamic AI applications.
- Smartphones from Google and Samsung are already using multimodal AI tech.
- Open-source multimodal AI models will lower the entry barrier and drive widespread adoption in 2024.
- Next-gen video game NPCs and VR experiences powered by multimodal AI will be exciting and far-reaching.
The year 2023 belonged to AI large language models (LLMs) like ChatGPT, Google Bard, and many more. It was a frenetic, unprecedented year of AI developments led by fresh new tech that people hadn't experienced before. Whether you believe that ChatGPT, the OG AI chatbot, reigns supreme or you're confused between ChatGPT vs Microsoft Copilot vs Google Bard, you're probably not prepared for what's about to follow this year.
As impressive as ChatGPT, DALL-E, MusicLM, and countless other AI models are, they're still unimodal — accepting a single mode of input, usually text. But multimodal AI will be a game-changer in this nascent space. With the ability to handle multiple inputs like text, voice, video, thermal, and more, multimodal AI models like GPT-4V, Google Gemini, and Meta ImageBind are set to usher in a new and groundbreaking era of intuitive and dynamic AI applications.
Multimodal AI is already here
It's probably in your phone
You might not know it, but multimodal AI has been in development for quite some time now, with the biggest heavyweights like Google, Meta, and OpenAI being some of the first movers. Even your phone probably has some form of multimodal AI if you're using one of the Google Pixel 8 or Samsung Galaxy S24 series devices. While Google uses its new Gemini model in the Pixel phones, Samsung uses Gemini and some proprietary tech in what they're calling Galaxy AI.
Google is planning to bring Gemini to Google Search, Google Chrome, Google Ads, and Duet AI.
Currently, these phones are limited to a few impressive use cases like live translation and interpretation during calls, chat assist features, and generative editing in photos. But Google is planning to bring Gemini to Google Search, Google Chrome, Google Ads, and Duet AI (for collaborative workspaces). Other prominent multimodal AI models like GPT-4V are already being used by ChatGPT Plus customers.
Outside of phones, we'll start seeing many other products leveraging multimodal AI, like LG Electronics' smart home AI assistant, which can be your home manager and smart companion thanks to its ability to analyze multiple inputs and engage in complex conversations. Samsung also showcases its own robot assistant, Ballie, upgraded with all-new AI capabilities that empower it to learn from users and offer personalized services.
Open models will accelerate multimodal adoption
Every company will jump on the bandwagon
Many existing multimodal AI models from Google, OpenAI, and other players are proprietary. But 2024 will see the rise of more and more open models that are easily accessible by anyone. Meta already has an open-source model called Llama 2, and Mistral AI offers its Mixtral-8x7B freely to everyone. Before long, these open-source AI models will lower the entry barrier for enterprises to leverage the power of multimodal AI.
The power to contextualize text inputs in light of voice tone, facial expressions, body movements, and past interactions will be extraordinary.
Whether it's workspace productivity, intelligent decision-making, or other boldly intuitive features in new applications that come out this year, multimodal AI has the unique ability to offer much more than unimodal AI models. The power to contextualize text inputs in light of voice tone, facial expressions, body movements, and past interactions will be extraordinary. It will catapult AI models from note-takers and productivity tools to intelligent assistants that can function as valued team members.
And open-source models accessible to one and all will be the key to realizing the widespread adoption of multimodal AI in 2024.
Multimodal will unlock next-gen virtual experiences
Gaming NPCs, customer service bots, and more
Personally, I'm most excited to see how multimodal AI models will transform video games and other virtual experiences this year. Nvidia has already showcased NVIDIA ACE (Avatar Cloud Engine), a set of technologies that developers can use to power non-player characters (NPCs) with top-notch generative AI models. It won't be long before the next big AAA game sees you interacting with any NPC through not just text but your voice as well.
I'm intrigued to see how these technologies will come into play in VR games and other mixed-reality scenarios.
Inworld AI is another character engine that allows developers to craft NPCs that can interact using natural language, voice, animations, and emotions. I'm intrigued to see how these technologies will come into play in VR games and other mixed-reality scenarios. And not just for gaming: companies can make use of this game-changing tech to create incredibly lifelike customer chatbots that can react to your every word, movement, and emotion.
Multimodal AI is set to flood your feeds
Despite the massive potential of multimodal AI, there will inevitably be companies simply trying to cash in on the hype. As a result, the term multimodal will be inescapable in all your social feeds and online touchpoints. Whether end users or enterprises, no one is yet able to grasp how this AI revolution will play out. All we can do is stay informed and stay away from the frivolous implementations of this new technology.
The real impact of multimodal AI will be driven by developers who truly understand customer needs and behaviors, and whose applications leverage this technology to craft pinpoint solutions to address them.
