![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
ATLANTA — When systems grow large enough, even very small optimizations can lead to very large savings.
This was the lesson that OpenAI Technical Staff Member Fabian Ponce imparted before the keynote crowd at KubeCon+CloudNativeCon North America 2025, being held this week in Atlanta.
Each iteration of OpenAI’s ChatGPT have brought big improvements, along with more Kubernetes clusters and greater volumes of traffic — “And orders of magnitude more telemetry to keep it all running,” Ponce said.
In order to make it all run smoothly, OpenAI requires “an absolutely massive amount of telemetry and making it fast, queryable and actionable at scale,” he said.
OpenAI runs Fluent Bit, an observability platform stewarded by the Cloud Native Computing Foundation, on every Kubernetes node. It digests log files and enriches them with samples of network streams, formats the results and sends them to the appropriate data stores.
With architecture, Fluent Bit generates 10PBs of data a day, stored on Clickhouse.
OpenAI, Ponce admitted, has an “absolutely insatiable appetite” for GPUs. OpenAI CEO Sam Altman has plans for the company to use of over 1 million GPUs by the end of the year, and promises to increase that number 100x.
And all those GPUs will also need CPUs to run.
So despite these gargantuan purchase orders, the company’s observability engineers, anyway, are still mindful of using resources efficiently. So one mission is to make Fluent Bit as “lean as possible.”
Using perf, a Linux tool for gathering performance data, the observability team looked at the CPU cycles Fluent Bit was using. Ponce hypothesized that most of the work Fluent D was doing would be in preparing and formatting the incoming data.
But what surprised Ponce, was that this wasn’t the case at all. Instead, at least 35% of the data was chewed up by a single function (fstatat64) whose purpose was to figure out how large log files were before reading them.
So the team turned off this capability — and the results were immediately apparent:
“The results speak for themselves,” Fabian Ponce told the crowd. “We have a new load pattern here that uses about half as much CPU while doing exactly the same work.”
Every time a new file is written, Fluent Bit executes the fstatat64 to read the size of the file.
“If the process is continually emitting new logs, line by line, then Fluent Bit is going to race that, and continue to run fstatat64 every time that happens,” Ponce explained. “That is going to burn a ton of extra compute.”
And it turns out the company didn’t really need that information, at least not at that level of nuance.
While the maintenance team knew the change would reduce CPU usage, perhaps they would be forgiven for not realizing how much savings would accrue.
In fact, when Fluent Bit was modified system-wise, it ended up “returning about 30,000 CPU cores to our Kubernetes clusters,” Ponce said.
“If we can return a CPU to every node, then maybe that’s one more microservice that we can fit into a given host,” he said.
The team went on to optimize Fluent Bit in other ways as well, though this one tweak had the biggest overall impact. The company’s engineers are preparing for Fluent Bit a patch that would allow users to specify a lower threshold of notifications.
The takeaway for Ponce was clear: There is always value in breaking out your “profiler of choice, and seeing what is happening under the hood. ”
As famed Golang programmer Rob Pike once advised in his Five Rules of Programming: “You can’t tell where a program will spend its time. Bottlenecks occur in surprising places.”
And in large distributed systems, those little bottlenecks can be expensive unless they are uncorked.
You can enjoy the entire talk here: