Streamline Evaluation of LLMs for Accuracy with NVIDIA NeMo Evaluator

Mar 27, 2024

By Nirmal Kumar Juluru, Nikhil Srihari, Aleksander Ficek, Nik Spirin, Eileen Long, Suseella Panguluri, Rohit Watve and Ashton Sharabiani

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA's NeMo team has announced an early access program for NeMo Evaluator, a cloud-native microservice that provides automated benchmarking capabilities for large language models (LLMs).
NeMo Evaluator supports automated evaluation on academic benchmarks, such as Beyond the Imitation Game benchmark (BIG-bench), Multilingual, and Toxicity, as well as custom datasets using popular natural language processing (NLP) metrics.
The NeMo Evaluator microservice can also leverage LLM-as-a-judge to perform holistic evaluations of model responses, allowing for rapid assessment of numerous responses and potentially reducing evaluation time and costs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Large language models (LLMs) have demonstrated remarkable capabilities, from tackling complex coding tasks to crafting compelling stories to translating natural language. Enterprises are customizing these models for even greater application-specific effectiveness to deliver higher accuracy and improved responses to end users.

However, customizing LLMs for specific tasks can cause the model to “forget” previously learned tasks. This is known as catastrophic forgetting. Therefore, as enterprises adopt LLMs into their applications, it’s necessary to evaluate LLMs both on the original and the newly learned tasks—continuously optimizing the models to provide a better experience. This implies that running an evaluation on a customized model requires re-running foundation and alignment evaluations to detect any potential regressions.

To simplify the evaluation of LLMs, the NVIDIA NeMo team has announced an early access program for NeMo Evaluator, a cloud-native microservice that provides automated benchmarking capabilities. It assesses state-of-the-art foundation models and custom models using a diverse, curated set of academic benchmarks, customer-provided benchmarks, or LLM-as-a-judge.

NeMo Evaluator simplifies generative AI model evaluation

NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, fine-tuning, retrieval-augmented generation, guardrailing, data curation, as well as pretrained models. It has offerings across the tech stack, from frameworks to higher-level APIs, managed endpoints, and microservices.

The NeMo Evaluator microservice, recently launched as part of the NeMo microservices suite, comprises a set of API endpoints that provide the easiest path for enterprises to get started with LLM evaluation. To learn more, see Simplify Custom Generative AI Development with NVIDIA NeMo Microservices.

Along with the NVIDIA NeMo Customizer microservice, enterprises can continuously customize and evaluate models to enhance their performance (Figure 1).

Supported evaluation methods in early access

The NeMo Evaluator microservice supports automated evaluation on a curated set of academic benchmarks and user-provided evaluation datasets. It also supports using LLM-as-a-judge to perform a holistic evaluation of model responses, which is relevant for generative tasks where the ground truth could be undefined. The various evaluation methods supported are explained more fully below.

Automated evaluation on academic benchmarks

Academic benchmarks offer a comprehensive evaluation of LLM performance across diverse language understanding and generation tasks. They serve as valuable tools for comparing different models and assisting in the selection of the most suitable LLM for specific needs. Additionally, benchmarks offer insights into areas where models may underperform, directing efforts to improve performance in those specific areas.

The NeMo Evaluator currently supports popular academic benchmarks, including:

Beyond the Imitation Game benchmark (BIG-bench): A collaborative benchmark intended to probe LLMs and extrapolate their future capabilities. It includes more than 200 tasks such as summarization, paraphrasing, solving sudoku puzzles, and more.
Multilingual: A benchmark that consists of classification and generative tasks to understand ‌multilingual capabilities across a wide variety of languages. This benchmark tests the LLMs on various tasks, including common-sense reasoning, multilingual question-and-answer, and multilingual translation across 101 languages.
Toxicity: A benchmark to measure the toxicity of an LLM. Model toxicity is defined as content that is inappropriate, disrespectful, or unreasonable. The toxicity benchmark here is based on RealToxicityPrompts, a set of 100,000 prompts and toxicity scores.

Automated evaluation on custom datasets

Standard academic datasets and benchmarks often fail to meet the distinctive requirements of enterprises because they overlook crucial aspects such as domain expertise, cultural nuances, localization, and other specific considerations. That’s why enterprises turn to experts to build custom datasets and run evaluations that fit their needs.

To support evaluation on custom datasets, the NeMo Evaluator microservice supports popular natural language processing (NLP) metrics to measure the similarity of ground truth labels to ‌LLM-generated responses, such as:

Accuracy measures the proportion of correctly predicted instances out of the total instances in the dataset.
BiLingual Evaluation Understudy (BLEU) is a metric for automatically evaluating machine-translated text. The BLEU score ranges from 0 to 1, assessing machine translation similarity to quality references. A score of 0 indicates no match (low quality), and a score of 1 signifies a perfect match (high quality).
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures the quality of automatic text summarization, as well as text composition, by comparing the overlap between machine-generated summaries and human-generated reference summaries.
F1 combines precision and recall into a single score, providing a balance between them. It is commonly used for evaluating the performance of classification models as well as question-and-answer.
Exact match measures the proportion of predictions that exactly match the ground truth or expected output.

Automated evaluation with LLM-as-a-judge

Using humans to evaluate LLM responses is a time-consuming and expensive process. However, employing LLM-as-a-judge has shown promising results in terms of scalability and efficiency. LLMs can rapidly assess numerous responses, potentially reducing evaluation time and costs while maintaining reliable judgment standards. For more details, see Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

The NeMo Evaluator microservice can leverage any NVIDIA NIM-supported LLM listed in the NVIDIA API catalog with the MT-Bench dataset or custom datasets for evaluating models customized with NVIDIA NeMo Customizer.

Apply for early access

To get started, apply for NeMo Evaluator early access. Applications are reviewed, and a link to access the microservice containers will be sent upon approval.

As part of the early access program, you can also request access to the NVIDIA NeMo Curator and NVIDIA NeMo Customizer microservices. Together, these microservices enable enterprises to easily build enterprise-grade custom generative AI and bring solutions to market faster.

Discuss (0)

About the Authors

👁 Avatar photo

About Nirmal Kumar Juluru
Nirmal Kumar Juluru is a product marketing manager at NVIDIA driving the adoption of AI software, models, and APIs in the NVIDIA NGC Catalog and NVIDIA AI Foundation models and endpoints. He previously worked as a software developer. Nirmal holds an MBA from Carnegie Mellon University and a bachelors in computer science from BITS Pilani.

View all posts by Nirmal Kumar Juluru

👁 Avatar photo

About Nikhil Srihari
Nikhil Srihari is a deep learning software technical marketing engineer at NVIDIA. He has experience working in a wide range of deep learning and machine learning applications in natural language processing, computer vision, and speech processing. Nikhil previously worked at Fidelity Investments and Amazon. His educational background includes a master's degree in computer science from the University at Buffalo, and a bachelor's degree from the National Institute of Technology Karnataka, Surathkal, India.

View all posts by Nikhil Srihari

👁 Aleksander Ficek

About Aleksander Ficek
Aleksander Ficek is a senior research engineer at NVIDIA, focusing on LLMs and NLP on both the engineering and research fronts. His past work includes shipping multiple LLM products such as NeMo Inference Microservice (NIM) and NeMo Evaluator alongside research in retrieval-augmented generation and parameter efficient fine-tuning. He graduated from University of Waterloo in 2022 and runs a popular AI meetup in Zurich.

View all posts by Aleksander Ficek

👁 Image

About Nik Spirin
Nik Spirin is a director for GenAI/LLMOps at NVIDIA. He leads the development of tools and workflows to enable end-to-end GenAI/LLM lifecycles, including model training, evaluation, and optimization, among others. He has 15+ years of experience working on AI/ML as a researcher, engineer, product manager, and founder. He holds a CS PhD degree from the University of Illinois at Urbana-Champaign.

View all posts by Nik Spirin

👁 Eileen Long

About Eileen Long
Eileen Long, director of AI Services at NVIDIA, leads the Riva Speech and Nemo LLM Production Modeling & MLOps Engineering team. She gained 10 years of experience in ML at Meta and Google, prior to joining NVIDIA. She has worked in multiple startups on web communities, embedded systems, and microprocessors, having held engineering, product, and marketing roles. Eileen holds a bachelor’s degree and a master’s degree in Electrical Engineering from Stanford, and an MBA from Santa Clara University. She champions STEM careers in schools and colleges, and mentors adults changing careers through Dress for Success.

View all posts by Eileen Long

👁 Suseella Panguluri

About Suseella Panguluri
Suseella Panguluri is a senior engineering manager at NVIDIA specializing in LLMOps. She leads the development of tools and workflows to enable end to end GenAI/LLM lifecycles, including training data preparation, model fine-tuning and evaluation of LLMs. She has more than 18 years of experience working as an engineering manager and as an engineer. She previously worked at Amazon, PayPal, and IBM and holds a bachelor's degree from the National Institute of Technology, Warangal, India.

View all posts by Suseella Panguluri

👁 Rohit Watve

About Rohit Watve
Rohit Watve is an engineering manager at NVIDIA specializing in LLMOps. He leads the development of tools and workflows to enable end to end GenAI/LLM lifecycles, including training data preparation, model fine-tuning and evaluation of LLMs. He has more than 15 years of industry experience. He has previously worked at Meta Platforms and Cisco Systems and holds a master’s degree in Electrical Engineering from Stanford University.

View all posts by Rohit Watve

👁 Ashton Sharabiani

About Ashton Sharabiani
Ashton Sharabiani is a senior technical program manager at NVIDIA specializing in LLMOps. He has more than 12 years of experience working as a technical program manager, technical product manager, data science manager, and as a data scientist. He previously worked at Amazon, Deloitte Consulting, and Exelon Corporation, and holds a PhD in Industrial Engineering and Operations Research from the University of Illinois at Chicago.

View all posts by Ashton Sharabiani

URL: https://developer.nvidia.com/blog/streamline-evaluation-of-llms-for-accuracy-with-nvidia-nemo-evaluator/