Agentic AI / Generative AI

Power Text-Generation Applications with Mistral NeMo 12B Running on a Single GPU

👁 Decorative image of a model with multiple apps.

Jul 26, 2024

By Anjali Shah and Chintan Patel

Discuss (3)

AI-Generated Summary

Dislike

Mistral NeMo is a 12B-parameter language model developed by NVIDIA and Mistral AI, achieving leading performance across various benchmarks with a context length of 128K.
The model is trained using NVIDIA Megatron-LM and optimized with TensorRT-LLM engines for higher inference performance, allowing it to fit on a single NVIDIA GPU such as the A100 or H100.
Mistral NeMo is packaged as an NVIDIA NIM inference microservice, enabling streamlined deployment and high-throughput AI inference, and is available with an open Apache 2.0 permissive license for customization and commercial use.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA collaborated with Mistral to co-build the next-generation language model that achieves leading performance across benchmarks in its class. With a growing number of language models purpose-built for select tasks, NVIDIA Research and Mistral AI combined forces to offer a versatile, open language model that’s performant and runs on a single GPU, such as NVIDIA A100 or H100 GPUs.

This post explores the benefits of Mistral NeMo, training and inference optimizations, applicability for various use cases, and the ease of deployment with NVIDIA NIM.

Mistral NeMo 12B

Mistral NeMo is a 12B-parameter, text decoder-only, dense transformer model trained on 131K multilingual vocabulary size. It delivers leading accuracy on popular benchmarks across common sense reasoning, world knowledge, coding, math, and multilingual and multi-turn chat tasks.

Model	Context Window	HellaSwag (0-shot)	Winograd (0-shot)	NaturalQ (5-shot)	TriviaQA (5-shot)	MMLU (5-shot)	OpenBookQA (0-shot)	CommonSenseQA (0-shot)	TruthfulQA (0-shot)	MBPP (pass@1 3-shots)
Mistral NeMo 12B	128k	83.5%	76.8%	31.2%	73.8%	68.0%	60.6%	70.4%	50.3%	61.8%
Gemma 2 9B	8k	80.1%	74.0%	29.8%	71.3%	71.5%	50.8%	60.8%	46.6%	56.0%
Llama 3 8B	8k	80.6%	73.5%	28.2%	61.0%	62.3%	56.4%	66.7%	43.0%	57.2%

Table 1. Mistral NeMo model performance across popular benchmarks

Supporting 128K context length, the model has enhanced understanding and the capability to process extensive and complex information, leading to more coherent, accurate, and contextually relevant outputs.

Mistral NeMo is trained on Mistral’s proprietary dataset that includes a large proportion of multilingual and code data, which enables better feature learning, reduced bias, and an improved ability to handle diverse and complex scenarios.

Optimized training

The model is trained using NVIDIA Megatron-LM, an open-source, PyTorch-based library with a collection of GPU-optimized techniques, cutting-edge system-level innovations, and modular APIs for training models at large scale.

Megatron-LM, part of NVIDIA NeMo, offers the core building blocks for the distributed training of text: multimodal and mixture of experts (MoE) models natively built into the library:

Attention mechanisms
Transformer blocks and layers
Normalization layers
Embedding techniques
Activation recomputation
Distributed checkpointing

Optimized inference

Mistral NeMo is optimized with TensorRT-LLM engines for higher inference performance. TensorRT-LLM compiles the models into TensorRT engines, from model layers into optimized CUDA kernels using pattern matching and fusion, to maximize inference performance. These engines are executed by the TensorRT-LLM runtime, which includes several optimizations:

In-flight batching
KV caching
Quantization to support lower precision workloads

Inference in FP8 precision is also supported by using NVIDIA TensorRT-Model-Optimizer. Using post-training quantization (PTQ) on NVIDIA Hopper and NVIDIA Ada GPUs, you can optimize and reduce model complexity by creating smaller models with lower memory footprint, without sacrificing accuracy.

This model fits on a single GPU, improving compute efficiency, lowering compute cost, and enhancing security and privacy. Run the model on a range of commercial use cases including summarizing long documents, classification, multi-turn conversations, language translation, and code generation.

NVIDIA NIM

The Mistral NeMo model is packaged as an NVIDIA NIM inference microservice to streamline and accelerate the deployment of generative AI models across NVIDIA accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand. It supports a wide range of generative AI models across domains including speech, image, video, healthcare, and more.

NVIDIA NIM delivers best-in-class throughput, enabling enterprises to generate tokens up to 5x faster. For generative AI applications, token processing is the key performance metric, and increased token throughput directly translates to higher revenue for enterprises.

Mistral NeMo has an open Apache 2.0 permissive license that provides enterprises the flexibility to customize the model and integrate it into their commercial applications.

Here’s how to test the Mistral NeMo model for programming tasks.

Coding copilot

Coding is a popular use case supported by the model to enhance developer productivity with AI-powered code suggestions. A coding copilot offers not only inline code suggestions but also generates code, documentation, comments, unit tests, and error fixes, all of which can be easily accessed through industry-standard APIs.

Here’s the syntactically and functionally correct code generated by Mistral NeMo with an English language prompt. You can paste the following in your development environment. To generate other code samples, see the NVIDIA API Catalog.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the neural network
class Net(nn.Module):
 def __init__(self):
 super(Net, self).__init__()
 # Convolutional layers
 self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
 self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
 # Fully connected layer
 self.fc = nn.Linear(32 * 8 * 8, 10) # Assuming input image size is 32x32

 def forward(self, x):
 # Add convolutional layers with ReLU activation and max pooling
 x = F.relu(self.conv1(x))
 x = F.max_pool2d(x, 2)
 x = F.relu(self.conv2(x))
 x = F.max_pool2d(x, 2)

 # Flatten the tensor before passing it to the fully connected layer
 x = x.view(-1, 32 * 8 * 8)

 # Add fully connected layer with log softmax for multi-class classification
 x = self.fc(x)
 output = F.log_softmax(x, dim=1)
 return output

# Create an instance of the neural network
net = Net()

# Print the model architecture
print(net)

# Test the forward pass with a dummy input
dummy_input = torch.randn(1, 3, 32, 32) # Batch size of 1, 3 channels, 32x32 image size
output = net(dummy_input)
print("Test output:\n", output)

You may also want to fine-tune the model with your domain data to generate higher-accuracy responses. NVIDIA offers tools to align the model for your use case.

Model customization

The instruction-tuned variant of the Mistral NeMo model offers strong performance amongst similarly sized LLMs across several benchmarks such as MT Bench, MixEval-Hard, IFEval-v5, and WildBench.

You can further customize it for your specific needs using NVIDIA NeMo, an end-to-end platform for developing custom generative AI, anywhere.

NeMo offers state-of-the-art fine-tuning and alignment support with parameter-efficient fine-tuning (PEFT) techniques, including p-tuning, low-rank adaption (LoRA), and its quantized version (QLoRA). These techniques are useful for creating custom models without requiring a lot of computing power.

NeMo also supports supervised fine-tuning (SFT) and alignment techniques such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and NeMo SteerLM. These techniques enable further steering the model responses and aligning them with human preferences, making the LLMs ready to integrate into custom applications.

Get started

To experience Mistral NeMo NIM microservice, see the Artificial Intelligence solution page. You will also find popular models, such as Llama 3.1 405B, Mixtral 8X22B, and Gemma 2B.

With free NVIDIA cloud credits, you can start testing the model at scale and build a proof of concept (POC) by connecting your application to the NVIDIA-hosted API endpoint running on a fully accelerated stack.

Discuss (3)

About the Authors

👁 Avatar photo

About Anjali Shah
Anjali Shah is a senior deep learning scientist at NVIDIA within the Developer Advocate Engineering group helping clients build generative AI solutions. Early in her career, as a software engineer, she built mission-critical platforms for the world's leading financial services firms. She then spent several years in the healthcare sector, architecting and implementing large scale healthcare (EHR) systems. Before joining NVIDIA, she spent several years at a leading tech company, working across different industries helping clients build innovative data and AI solutions. She has a Ph.D. in biomedical informatics and applied statistics and an M.S. and B.S. in computer science and engineering.

View all posts by Anjali Shah

👁 Avatar photo

About Chintan Patel
Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley.

View all posts by Chintan Patel

URL: https://developer.nvidia.com/blog/power-text-generation-applications-with-mistral-nemo-12b-running-on-a-single-gpu/