VOOZH about

URL: https://thenewstack.io/nvidia-shaves-up-to-30-off-large-language-model-training-times/

⇱ Nvidia Shaves up to 30% off Large Language Model Training Times - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-07-28 09:05:12
Nvidia Shaves up to 30% off Large Language Model Training Times
Data

Nvidia Shaves up to 30% off Large Language Model Training Times

Nvidia revs its NeMo Megatron LLM stack, implementing novel techniques and a new tool to speed training and enable larger models
Jul 28th, 2022 9:05am by Andrew Brust
👁 Featued image for: Nvidia Shaves up to 30% off Large Language Model Training Times

Nvidia is announcing today that its NeMo Megatron product — an open source full-stack framework for developing and managing large language models (LLMs) — will ship with several improvements that reduce LLM training times. Since LLMs are colossal in size — often having hundreds of billions or even on the order or a trillion tunable parameters — even small improvements can be highly impactful. But these improvements are not small; Nvidia says they can trim training times by as much as 30%.

LLMs are a specific type of deep learning/neural network model, used for a variety of natural language use cases, including content generation, text summarization, chatbots and other conversational AI applications. LLMs are also quite versatile, with pre-trained models being generally applicable to numerous tasks, rather than custom-designed for particular ones, as is the case with other types of neural network models. LLMs’ complexity delivers a big benefit, but that only comes as a reward for a great deal of work.

Main Ingredients

The New Stack spoke with Ujval Kapasi, Nvidia’s VP of Deep Learning Software, who said that “a lot of the work we’ve been doing at Nvidia over the last few years has been to build hardware and software optimized to accelerate the training and the inference and deployment of these neural networks.”

That definitely seems to be the credo in place for these NeMo Megatron improvements, which come down to:

  • Two novel approaches in training LLMs: selective activation recomputation and sequence parallelism.
  • A new hyperparameter tool that optimizes training based on the desired model size and infrastructure resources available.

Kapasi explained each of these technological advancements in refreshingly plain-English. In colloquial terms, they both come down to working smarter, not harder. I’ll attempt to convey how each of the NeMo Megatron improvements does this.

Go Back and Do It Again

Training deep learning models in general, and LLMs specifically, involves a process of iterative improvement. Kapasi explained that at first, a model produces naive predictions: “the basic approach is… it starts out with completely randomized data…and the neural network makes predictions [that are] completely wrong.” But as those predictions are compared to their actual ground truth values, weightings can be adjusted, and results get progressively better.

As the forward pass to generate the predictions is completed, a lot of memory may be required to retain the parameter values for the backward pass, where the weightings are adjusted. To avoid the memory hit, the values can instead be recomputed, but that drives up the compute resources required. Neither choice seems pleasant, but simply recomputing everything has been the norm.

Turns out, there is a better way. Selective activation recomputation (SAR) offers a compromise. It prioritizes recomputation of values that take a significant amount of memory and whose calculations have relatively small compute needs. This then leaves more memory that can be used to cache parameter values that would involve more resource-intensive recomputation.

Parallelism and Heuristics

Another facet of the LLM training involves parallelization within a model’s transformer layer. While many tasks can be tensor parallelized across multiple GPUs, others are simply replicated on each one. But the new sequence parallelism (SP) technology in NeMo Megatron parallelizes these tasks as well along the sequence dimension, further reducing compute resource requirements and speeding the training process.

👁 Image

Parallelism modes within a transformer layer. Credit: Nvidia

Finally, there is the issue of moving past parameters and instead tuning hyperparameters, which govern the training approach taken. Rather than cycling through a range of values by brute force, NeMo Megatron’s hyperparameter tool (HP tool) sets these values based on the compute environment and requirements, for example, the number of GPUs/size of the GPU cluster and the desired size of the model. While some range testing is still involved, there’s much less of it, which speeds the hyperparameter tuning process and optimizes the training strategy, thereby speeding up the broader training process as well.

Bottom Line

Again, these three advances together provide training speed-ups of up to 30%, according to Nvidia. The company says that training can now be done on 175 billion-parameter models using 1,024 Nvidia A100 GPUs in 24 days. That may still sound big, but it represents a time reduction of 10 days, which works out to saving about 250,000 hours of GPU compute over building such models without SAR, SP and the HP tool. Multiply that 250,000 number by your cloud provider’s hourly GPU compute cost and pretty soon it adds up to real money (with apologies to Senator Dirksen).

While non-data scientists may find all of this a bit esoteric, the downstream benefits should be clear: a greater number of bigger, more accurate LLMs will be available more quickly, for mainstream developers to use in their own applications. And it all comes down to efficiency, parallelization and better training strategy.

Nvidia says the new NeMo Megatron capabilities are available to early access customers to run on Nvidia DGX SuperPODs, and Nvidia DGX Foundry as well as on the Microsoft Azure cloud. They’re also available on Nvidia LaunchPad, a free hands-on lab platform.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.