VOOZH about

URL: https://thenewstack.io/how-meta-is-reinforcing-its-global-network-for-ai-traffic/

⇱ How Meta Is Reinforcing its Global Network for AI Traffic - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-09-12 11:45:22
How Meta Is Reinforcing its Global Network for AI Traffic
AI / Networking / Operations

How Meta Is Reinforcing its Global Network for AI Traffic

In 2022, Meta engineers realized they needed to deal with the incoming tsunami of AI data traffic that was about to overwhelm their networks.
Sep 12th, 2024 11:45am by Joab Jackson
👁 Featued image for: How Meta Is Reinforcing its Global Network for AI Traffic
Image of Jyotsna Sundaresan of Meta, presenting at the company’s Networking @Scale 2024 conference.

It was in 2022 when Meta engineers started to see the first clouds of an incoming storm, namely how much AI would change the nature —and volume — of the company’s network traffic.

“Starting 2022, we started seeing a whole other picture,” said Jyotsna Sundaresan, a Meta network strategist, in a talk Wednesday for the Meta’s Networking @Scale 2024 conference, being held this week both virtually and at Santa Clara Convention Center, in Calif.

Mind you, Meta owns one of the world’s largest private backbones, a global network physically connecting 25 data centers and 85 points of presence with millions of miles of fiber optic cable, buried under both land and sea. Its reach and throughput allows someone on an Australian beach to see videos being posted by their friend in Greece nearly instantaneously.

And for the past five years, this global capacity has grown consistently by 30% a year.

Yet, the growing AI demands on the backbone is bumpy and difficult to predict.

“The impact of large clusters, GenAI, and AGI is yet to be learned,” Sundaresan said. “We haven’t yet fully flushed out what that means for the backend.”

Nonetheless, the networking team has gotten creative in coming up with ways to mitigate the ravenous networking demands of AI, while still increasing its throughput.

The Full AI Data Life Cycle

Back in 2022, Facebook, WhatsApp, Instagram and other product groups all started requesting fleets of GPUs for their AI efforts.

There had been requests in earlier years, but they resulted in smaller clusters that did not generate a lot of traffic across data centers, so they largely went unnoticed by Meta’s networking team.

But in 2022, demand for the GPUs grew by 100% year over year.

And this resulted, for the networking department, “a higher-than-anticipated uptick in growth of traffic on the backbone,” Sundaresan said. “We were not ready for this.”

Initially, the group assumed most AI traffic these clusters would generate only moved from storage to the GPUs. “We had missed several critical elements of this AI life cycle,” she said.

“AI workloads are just not as fungible with hardware heterogeneity”
— Jyotsna Sundaresan, of Meta

Data replication and data placement turned out to be two considerable challenges. They did not fully anticipate how much traffic this would cause.

AI requires fresh data, and this data is generated from everywhere, both by users and by machines, all the time.

All this data then has to be placed somewhere, usually to any one or more remote data centers. It also has to be backed up to other locations to meet various quality control and regulatory mandates.

As a result, “There’s a lot of movement across different regions,” Sundaresan said, noting it can total each day in the Exabytes.

Worse, AI workloads are not quite as fungible as other workloads, Sundaresan noted, meaning the hardware and software requirements can be a lot more fussy than a more generic workload.

For instance, an A100 Nvidia GPU has different network interface preferences than an H100 GPU.

Bending the Demand Curve

You can think of this process as an AI data life cycle, said Abishek Gopalan, a Meta network engineer working on global infrastructure who also presented this talk.

The network management team set out to find ways to optimize the network to better handle the characteristics of these data movements, a process called “backbone dimensioning” Gopalan noted.

👁 Abishek Gopalan speaking.

Abishek Gopalan, of Meta, presenting at the company’s Networking @Scale 2024 conference.

“The backbone is a precious resource, and it’s a shared infrastructure that we’re using to support all of Meta’s products and platforms,” Gopalan said.

The networking team had to take a “holistic view,” of the traffic, looking at not only network capacity, but even at computer and storage resources.

Once it’s created, fresh data would need to be copied to so many other locations that it would exceed the cost of the original computation.

So the networking team worked with the storage team to work on caching and better data placement strategies, as well as deploying instrumentation to help figure out what data doesn’t get used by AI, so it does not have to be moved at all.

👁 Data placement chart.

Also, not all data needs to get moved right away, so better understanding latency requirements of the data itself helped smooth the flow of traffic across the network. Not every batch of data comes with the same service-level requirements. This allows networking to be divided into differentiated classes for different workloads, through the use of advanced scheduling tools.

Without this work, the amount of traffic that would end up on the backbone would be “untenable,” Gopalan said.

Llama 3 Differed from Earlier AI Requirements

What casual observers may not readily appreciate is how much the creation of large language models (LLMs) requires specialized cluster and networking topologies, ones that differ from those used in earlier AI work.

In another talk at the conference, given by a pair of Meta production networking engineers, Pavan Balaji and Adi Gangidi, examined the network optimizations needed for supporting Facebook’s own 405 billion parameter Llama 3 LLM.

👁 Photo of Adi Gangidi

Adi Gangidi, of Meta, speaks at the company’s Networking @Scale 2024 conference.

To run LLMs, you need both accuracy and speed, hence the need for the large clusters of GPUs attached to fast network and storage systems, Balaji said.

To train and serve the first two iterations of Llama, Meta used existing GPU clusters originally built for ranking and recommendation work. Ranking works best in a mesh-like communication topology across all the GPUs.

These proved to be un-optimal for the larger Llama 3 however, which thrives best with a more hierarchical communication pattern.

So the company built two additional clusters, each with 24,000 GPUs each, just for Llama 3.

Instead of a full mesh model, these clusters separated GPUs into zones of 3,000 each with full bisection bandwidth.

These partitions were connected together with “aggregate training switches,” which did not offer full bisection bandwidth, but rather oversubscription, which assumed not all nodes would be using the switch at once.

This was just fine, because “generative AI workloads have hierarchical collectives which produce traffic patterns like trees or rings, and for these patterns, they can tolerate oversubscription just fine,” Gangidi added.

Further fine-tuning was still required, through load balancers, routing techniques such as advanced Equal Cost Multi-Path (ECMP) and other methods of traffic engineering.

Bigger Backbone Still Needed

While all this work aims to reduce the demand curve quite a bit, Gopalan acknowledged there is still work to be done to work on the other side of the equation, that of fortifying the supply curve.

In other words, the core network will still have to be aggressively expanded, reinforced with more fiber optic cable, and given more storage and power support as well.

“We intentionally design our backbone to allow for more flexible demand patterns, as well as allow for more workload optionality,” Gopalan said, “so that it allows our backbone to really serve potential spikes or changes in demand patterns, which aren’t always easy to predict.”

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.