VOOZH about

URL: https://thenewstack.io/openai-recovers-30000-cpu-cores-with-fluent-bit-tweak/

⇱ OpenAI Recovers 30,000 CPU Cores With Fluent Bit Tweak - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-11-13 08:30:38
OpenAI Recovers 30,000 CPU Cores With Fluent Bit Tweak
Kubecon Cloudnativecon NA 2025 / Linux / Observability

OpenAI Recovers 30,000 CPU Cores With Fluent Bit Tweak

By profiling the system and disabling a single unnecessary function, the AI giant recovered over 35% of its CPU cycles. A report from Kubecon.
Nov 13th, 2025 8:30am by Joab Jackson
👁 Featued image for: OpenAI Recovers 30,000 CPU Cores With Fluent Bit Tweak
Fabian Ponce at Kubecon 2025 (TNS).

ATLANTA — When systems grow large enough, even very small optimizations can lead to very large savings.

This was the lesson that OpenAI Technical Staff Member Fabian Ponce imparted before the keynote crowd at KubeCon+CloudNativeCon North America 2025, being held this week in Atlanta.

OpenAI’s Observability Challenge at Scale

Each iteration of OpenAI’s ChatGPT have brought big improvements, along with more Kubernetes clusters and greater volumes of traffic — “And orders of magnitude more telemetry to keep it all running,” Ponce said.

In order to make it all run smoothly, OpenAI requires “an absolutely massive amount of telemetry and making it fast, queryable and actionable at scale,” he said.

Fluent Bit’s Critical Role in Data Telemetry

OpenAI runs Fluent Bit, an observability platform stewarded by the Cloud Native Computing Foundation, on every Kubernetes node. It digests log files and enriches them with samples of network streams, formats the results and sends them to the appropriate data stores.

With architecture, Fluent Bit generates 10PBs of data a day, stored on Clickhouse.

The Drive for Resource Efficiency Amidst Massive Growth

OpenAI, Ponce admitted, has an “absolutely insatiable appetite” for GPUs. OpenAI CEO Sam Altman has plans for the company to use of over 1 million GPUs by the end of the year, and promises to increase that number 100x.

And all those GPUs will also need CPUs to run.

So despite these gargantuan purchase orders, the company’s observability engineers, anyway, are still mindful of using resources efficiently. So one mission is to make Fluent Bit as “lean as possible.”

Using perf, a Linux tool for gathering performance data, the observability team looked at the CPU cycles Fluent Bit was using. Ponce hypothesized that most of the work Fluent D was doing would be in preparing and formatting the incoming data.

Uncovering a Surprising CPU Bottleneck With perf

But what surprised Ponce, was that this wasn’t the case at all. Instead, at least 35% of the data was chewed up by a single function (fstatat64) whose purpose was to figure out how large log files were before reading them.

So the team turned off this capability — and the results were immediately apparent:

👁 Image

“The results speak for themselves,” Fabian Ponce told the crowd. “We have a new load pattern here that uses about half as much CPU while doing exactly the same work.”

Every time a new file is written, Fluent Bit executes the fstatat64 to read the size of the file.

“If the process is continually emitting new logs, line by line, then Fluent Bit is going to race that, and continue to run fstatat64 every time that happens,” Ponce explained. “That is going to burn a ton of extra compute.”

And it turns out the company didn’t really need that information, at least not at that level of nuance.

The Impact of Disabling a Hungry Function

While the maintenance team knew the change would reduce CPU usage, perhaps they would be forgiven for not realizing how much savings would accrue.

In fact, when Fluent Bit was modified system-wise, it ended up “returning about 30,000 CPU cores to our Kubernetes clusters,” Ponce said.

👁 Image

“If we can return a CPU to every node, then maybe that’s one more microservice that we can fit into a given host,” he said.

The team went on to optimize Fluent Bit in other ways as well, though this one tweak had the biggest overall impact. The company’s engineers are preparing for Fluent Bit a patch that would allow users to specify a lower threshold of notifications.

Key Takeaways for Performance Optimization

The takeaway for Ponce was clear: There is always value in breaking out your “profiler of choice, and seeing what is happening under the hood. ”

As famed Golang programmer Rob Pike once advised in his Five Rules of Programming: “You can’t tell where a program will spend its time. Bottlenecks occur in surprising places.”

And in large distributed systems, those little bottlenecks can be expensive unless they are uncorked.

You can enjoy the entire talk here:

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
Clickhouse and The Cloud Native Computing Foundation is a sponsor of The New Stack. 
TNS owner Insight Partners is an investor in: Bit, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.