While the world marveled at Google's recent Gemini update to Bard, Mixtral 8x7B also quietly released. It employs a Mixture of Experts (MoE) to give human-like responses, which is different compared to how the likes of ChatGPT and Google Bard do it. Not only does it result in good responses, but Mistal, the company behind Mixtral, says that it's a 46.7 billion parameter model with the hardware requirements of one just a fraction of the size.

What makes Mixtral 8x7B even more exciting is that it matches or outperforms both ChatGPT's GPT-3.5 model and Meta's Llama 2 70B model. It's licensed under Apache 2.0, a permissive licensing model, and is open to anyone to use and download. It handles contexts of up to 32k tokens, can work in English, French, Italian, German, and Spanish, and can generate code.

Who is Mistral AI? What is Mixtral?

Mistral AI is a French artificial intelligence company founded earlier this year by researchers who worked at both Meta and Google. It recently raised around 450 million euro, and its most recent model, 8x7B, was released in a nondescript Torrent magnet link shared on Twitter.

Mixtral employs a MoE to generate responses with incoming tokens, where some tokens will be routed to different experts in the system. Each expert is a neural network, and Mixtral 8x7B has eight experts. (As an aside, you can even have hierarchal MoEs where an expert is just another MoE.) When you submit a prompt to Mixtral 8x7B, it goes to a router network to select which expert will process each token most effectively. With Mixtral, two experts are chosen per token, and the output is a combination of the two.

MoEs have advantages and disadvantages, with one of those advantages being immediately obvious when training. These models are compute-efficient to pretrain but can fall victim to overfitting during fine-tuning. Overfitting, in this case, refers to models remembering their training data and using it verbatim in responses.

Another advantage is that their inference tends to be faster since they only use some MoEs during inference. However, you'll still need enough RAM to hold a 47B parameter model. It uses 47B instead of 56B, as many model parameters are shared between all the MoEs, and not all 7B parameters in each expert are multiplied by eight.

How good is Mixtral 8x7B, and how can I use it?

Mixtral 8x7B manages to match or outperform GPT-3.5 and Llama 2 70B in most benchmarks, making it the best open-weight model available. Mistral AI shared a number of benchmarks that the LLM has performed in so far, and the results are quite impressive, to say the least.

Source: Mistral AI

It's completely user-tunable, and anybody can deploy it. If you have a powerful enough computer, you can run it locally in LM Studio. There are guard rails you can also enable to protect against dangerous or harmful content, though they aren't enabled by default. The benchmarks above show that it's capable of MMLU, or Massive Multitask Language Understanding, which uses a combination of more than 50 subjects such as math, physics, history, law, medicine, and ethics for testing world knowledge and problem-solving abilities. This is the most important benchmark that most LLMs (including Gemini) will target.

If you want to try out Mixtral 8x7B and don't want to or can't run it locally, it's on Hugging Face and available for use. Hugging Face's implementation has those guard rails enabled by default, so the experience will be similar to ChatGPT 3.5 at present, both in performance and in what you can ask it. There isn't really anything that it specializes in as such, rather than being a catch-all LLM.

More large language models are around the corner

There are always new advancements to be made in technology, and 2023 has been the year of generative AI. We expect that more models will be released over the next year or longer, and there are always improvements to be made. With rumors swirling around OpenAI and an apparent advent of Artificial General Intelligence, things are likely to get even crazier in the near future.