Summary
- Llama 3 may debut soon with enhancements like a larger context window to rival top AI competitors like Gemini.
- Considering Mixtral's MoE system, Llama 3 might adopt a similar approach to optimize computational efficiency.
- Multilingual support and multi-modality are expected upgrades for Llama 3 to cater to a broader user base more effectively.
The artificial intelligence LLM race has been slowly heating up since its jumpstart with ChatGPT. We've seen Google's Bard become Gemini, Microsoft's Bing Chat become Copilot, and we've seen Meta release its open source Llama 2 model that anyone can run on their own computer. Now, as the competition advances, Meta is expected to launch Llama 3 later this year — possibly as soon as July according to reports.
With that, there are a number of improvements we'd love to see from Llama 3 when it releases in order for it to keep up with the competition. These are some of our most wanted features and improvements.
1 Larger context window
Part of what makes Gemini so powerful
A context window is essentially how much an LLM can "see" at any given time, and part of what makes Gemini so powerful is its ability to have a context window of up to 10 million tokens. While the amount of memory required for that is absurd, a larger context window would still be amazing. According to LlaMa 2's model card, it currently has a context window of just 4K tokens, even at its 70B parameter model. That's not a lot of context, and puts LlaMa 2 on a massive backfoot compared to what's currently out there.
As already mentioned, there are memory limitations here, but there are advancements in this area that may make it so that Meta can at least increase the context window, even if it won't be anywhere near the 32K limit of GPT-4.
Google's Gemini 1.5 Ultra will need to compete with GPT-5, not GPT-4
Google's Gemini 1.5 Pro model wowed us, and the Ultra model could be even better.
2 Mixture of Experts
How Mixtral manages to compete with GPT-3.5
Meta could learn from Mixtral 8x7B, a model made by Mistral AI that manages to compete with GPT-3.5 and can run locally on people's machines. The full Mixtral 8x7B model requires some incredibly beefy hardware to run, but so does LlaMa 2 70B.
Mixtral employs an MoE architecture to process incoming tokens, directing them to specialized neural networks within the system based on their relevance. The Mixtral 8x7B model features eight such experts. Notably, it's possible to structure these experts in a hierarchical manner, where an expert itself may be another MoE. Upon receiving a prompt, Mixtral 8x7B utilizes a routing network to determine the most suitable expert for each token. In this setup, each token is evaluated by two experts, and the final response is a blend of their outputs.
The MoE approach offers several benefits, particularly in terms of computational efficiency during the initial training phase, although it can be prone to overfitting during the fine-tuning stage. Overfitting occurs when a model becomes too familiar with its training data, leading to a tendency to reproduce it exactly in its responses. Another advantage of MoEs is their potential for faster inference times, as they activate only a subset of experts for each query. However, accommodating a model like Mixtral, with its 47 billion parameters, requires substantial RAM. The model's overall parameter count is 47 billion rather than 56 billion because it shares many parameters across all experts and does not simply multiply the seven billion parameters of each expert by eight.
With this approach, LlaMa 3 could even utilize an MoE in smaller models, improving inference time and decreasing the RAM required. You'll still need a powerful PC, but nothing unachievable.
Best GPUs in 2025: Our top graphics card picks
Picking the right graphics card can be difficult given the sheer number of options on the market. Here are the best graphics cards to consider.
3 Multilingual support
Any language other than English is currently "out of scope"
According to the model card of LlaMa 2, any use outside of English is outside of scope. While most LLMs are trained on data that's predominantly in English, international users would still like to converse with an LLM in their own language, too. The likes of ChatGPT, Google Gemini, and even Mixtral support multiple languages, but Meta doesn't account for it with LlaMa 2 at all.
|
Language |
Percent |
Language |
Percent |
|
English |
89.70% |
Ukrainian |
0.07% |
|
Unknown |
8.38% |
Korean |
0.06% |
|
German |
0.17% |
Catalan |
0.04% |
|
French |
0.16% |
Serbian |
0.04% |
|
Swedish |
0.15% |
Indonesian |
0.03% |
|
Simplified Chinese |
0.13% |
Czech |
0.03% |
|
Spanish |
0.13% |
Finnish |
0.03% |
|
Russian |
0.13% |
Hungarian |
0.03% |
|
Dutch |
0.12% |
Norwegian |
0.03% |
|
Italian |
0.11% |
Romanian |
0.03% |
|
Japanese |
0.10% |
Bulgarian |
0.02% |
|
Polish |
0.09% |
Danish |
0.02% |
|
Portugese |
0.09% |
Slovenian |
0.01% |
|
Vietnamese |
0.08% |
Croatian |
0.01% |
As a result, this is something we'd love to see change with Llama 3. The above table is taken from the LlaMa 2 research paper, where "Unknown" is partially made up of programming data. In other words, though, other langauges pale in comparison to English in this dataset. The inclusion of other languages would widen the number of people that can use LlaMa, as right now, it doesn't make sense for those who don't speak English to use it.
4 Mutli-modality
Support other mediums than just text
Meta has already been testing language models that support multiple modalities, with Meta's research team publishing a paper detailing "AnyMAL" in September of last year. AnyMAL is described as "a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses," inheriting LlaMa 2. Given that Meta has worked on this already, it seems likely that the advancements made here will be found in LlaMa 3.
As for why it matters, it means that you could give data to LlaMa 3 in the form of an image and have it respond to you with the image as context. This includes finding things in pictures, finding problems with a photo, or understanding video and audio data. Many platforms are beginning to support this if they don't already, and it's something that Meta's LlaMa 2 lacks right now.
5 A more middle-ground parameter option
Why not 30B?
LlaMa 2 is currently available in 7B, 13B, and 70B parameter options. There's obviously a massive jump there in parameters, and it would be great for Meta to take cues from competitors and release a model that comes in at around the 30B parameter mark. It's still a massive model, but it means enthusiasts with modest systems can still take part in the fun with bigger models running locally.
On that note, a smaller, 1B or 2B model to compete with the likes of Gemini Nano could be fun, as well. Something like that would run on basically anything, and it means that even more people can try out running an LLM for the first time. With part of Meta's mission with LlaMa being to democratize AI models, there's no better way to do that than ensuring as many people as possible can try them out.
LlaMa 3 is likely just around the corner
We expect LlaMa 3 to come sometime in the summer, possibly in July according to The Information. Given that LlaMa's purpose was to be an open source model, we expect that the same will be said for Llama 3, too. We're excited for when it does launch, as competition in the space is always a good thing. Let's hope Meta improves on it substantially! For now, you can try out the other LlaMa 2 models using LM Studio.
Run local LLMs with ease on Mac and Windows thanks to LM Studio
If you want to run LLMs on your PC or laptop, it's never been easier to do thanks to the free and powerful LM Studio. Here's how to use it
