VOOZH about

URL: https://thenewstack.io/when-99-service-level-objectives-are-overrated-and-too-expensive/

⇱ When 99% Service Level Objectives Are Overrated (and Too Expensive) - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-10-28 07:34:10
When 99% Service Level Objectives Are Overrated (and Too Expensive)
sponsor-scylladb,sponsored-topic,
DevOps / Operations

When 99% Service Level Objectives Are Overrated (and Too Expensive)

Not all P99s are created the same: Intentional Service Level Objectives (SLOs) vs. the flashy 99.9999% reliability.
Oct 28th, 2022 7:34am by Jessica Wachtel
👁 Featued image for: When 99% Service Level Objectives Are Overrated (and Too Expensive)

The collective wisdom goes that 99% site reliability might be the standard but Alex Hidalgo, a principal reliability advocate at Nobl9, says such high standards can be often unnecessary. Sometimes 80% will work just fine!

When is it worth pulling out of the race for site reliability of even 99.9%? How important is it to really understand a user’s sweet spot so resources can be put elsewhere? These are the questions Alex Hidalgo addressed in his 2022 P99 Conf talk, “Throw Away Your Nines.”

That service level objectives (SLOs), which are a vital number for site reliability engineering (SRE), require 99+% site reliability is a myth. An SLO specifies the degree of uptime the service is guaranteeing to its users. There are many instances where 99% isn’t necessary and offering services with such high reliability will quickly burn through the budget.

ScyllaDB is engineered to deliver predictable performance at scale. It’s adopted by organizations that need ultra-low latency, even over millions of ops/sec & PBs of data. Our unique architecture leverages the power of modern infrastructure – translating to fewer nodes, less admin & lower costs.
Learn More
The latest from ScyllaDB
Hear more from our sponsor

Consider that a company is looking for “95% of all API requests to return with a non-error state” overall. Add in the SLO target and the request morphs into something more like “99% of requests to be good every 30 days.” Hidalgo explained that more often than not, percentiles are attached because it’s those percentiles that inform SLOs so that original request transforms once again and is now “99th percentile of all requests to complete within 500 milliseconds 95% of the time, every 30 days.”Here is where Hidalgo says the growing problem exists, “We’re really starting to stack things on top of each other and more and more nines are being involved.”

What Does Latency Look Like?

In a perfect world, the latency graph looks like the drawing below. Divide the graph into 100 pieces because the past tells us that a 1% failure rate is baseline acceptability, and this is what where the 99th percentile (P99) falls.

👁 Image

Rarely does perfect happen. More common is the long tail distribution. Small ramp-up made of quick request completions, followed by the averages, and then of course the long tail gradual ledge on the right-hand side made of slower requests thanks to all sorts of issues happening while computer APIs and networking services talk across the internet. P99 is still the standard.

👁 Image

But what happens in the event of any of these latency events?

👁 Image

👁 Image

👁 Image

A standard formula doesn’t work every time. Latency graphs don’t always look the same. Not all P99s are created equal.

Throw Away Your Nines

👁 Image

Stacking these numbers up side by side is pretty wild. If an SLO promises five nines reliability, 99.999%, that leaves 0.9 seconds per day or a total of five minutes and 15 seconds every year of unreliability time. Perfection on paper but in reality, incredibly difficult to achieve. No matter how robust, resilient, or redundant a site is, think about the human logistics alone as they relate to five minutes and 15 seconds of unreliability time a year.

Even if only one event happens once a year, if the incident happens to occur at 3 a.m., it will take longer than five minutes and fifteen seconds for the engineer on call to log into their computer and check longs. It’s a brilliant way to burn through a budget. Dropping some nines to 99.9% will give a site 8 hours and 45 minutes of unreliable time per year which looks a whole lot better in reality but is that even necessary? And what would it mean to step away from needing the absolute all mighty all nine approach? Where would this leave SLOs?

Set Intentional Objectives Based on User’s Needs and Realistic Goals

What to do when third-party dependencies keep reliability down? Hidalgo discussed two clients of his who got creative.

One client, Company A, was a web-facing API where every call to it relied on a call to the database behind it. Unfortunately for Company A, the database was constantly returning errors to the tune of a 20% failure rate. Company B relied on “just about every messaging vendor imaginable,” and each vendor had varying error rates.

Both companies wanted that 99.9% but it wasn’t necessary in either case. Company A took the approach of keeping the 80% SLO target but instituting much better retry logic so users didn’t notice there was a problem. They will have to fix the database issue down the road but in the interim, the user’s needs were identified and met, and isn’t that the point? Company B also put better retry logic in place to even the varying latencies out while they did a deeper dive into what worked best for their users. It landed on an SLO target of 97.2%. Anything over that and the users didn’t notice, but under that and they sure did.

The common tie between Companies C and D are that they couldn’t hit that 99.9% due to “downtime.” Company C performed long-running batch jobs that took hours and hours and one in every five times. Company C set its SLO at 80% because one in five failures was what they came to see as a success.

Company D, though in the process of migration, had a primary code repository system that went down for an hour daily to complete the backup process. Company D built its SLO from a time outside of its backup process. Both Companies C and D were far from the 99% reliability but still had what their users considered successful metrics.

In conclusion, just be intentional. Not all 9’s are bad but when creating SLO targets, “There are nine numbers besides nine that you can in fact use.” There’s nothing wrong with the 99th percentile or trying to hit 99.99999% reliability when it’s needed as long as these metrics aren’t just being it because it’s what everyone else is doing. “What works for one site doesn’t need to work for everything else,” Hidalgo said.

ScyllaDB is engineered to deliver predictable performance at scale. It’s adopted by organizations that need ultra-low latency, even over millions of ops/sec & PBs of data. Our unique architecture leverages the power of modern infrastructure – translating to fewer nodes, less admin & lower costs.
Learn More
The latest from ScyllaDB
Hear more from our sponsor
TRENDING STORIES
Jessica Wachtel is a developer marketing writer at InfluxData where she creates content that helps make the world of time series data more understandable and accessible. Jessica has a background in software development and technical journalism.
Read more from Jessica Wachtel
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.