VOOZH about

URL: https://thenewstack.io/usenix-the-3-measures-of-successful-site-reliability-engineering/

⇱ USENIX: The 3 Measures of Successful Site Reliability Engineering - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-01-13 13:47:01
USENIX: The 3 Measures of Successful Site Reliability Engineering
news,
DevOps / Observability

USENIX: The 3 Measures of Successful Site Reliability Engineering

Citing an economic truism from the 1970s, AppDynamics Technology Evangelist Marco Coulter warned attendees of SRECon20 not to get too hung up on specific metrics, as they may not offer complete guidance as to the overall success of the system, or to the delight of the end user. 
Jan 13th, 2021 1:47pm by Joab Jackson
👁 Featued image for: USENIX: The 3 Measures of Successful Site Reliability Engineering

Citing an economic insight from the 1970s, AppDynamics Technology Evangelist Marco Coulter warned attendees of SRECon20 not to get too hung up on specific metrics, because they may not offer complete guidance as to the overall success of the system being measured.

“Whenever a measure becomes a target, it ceases to be a good measure,” he said during his presentation at the USENIX virtual event last month, paraphrasing British economist Charles Goodhart, who was writing about managing U.K.monetary policy.

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

— Charles Goodhart.

Instead, the SRE must work to take into account the entire system, particularly in terms of customer satisfaction. “As technicians, we focus on the measure as the target, the goal,” Coulter said. Instead, the SRE should work with the end user to define the overall success.

In his presentation, Coulter tells a story about working for a hospital service provider, specifically managing a system that would insert new lab results into the patient record, which was managed by a mainframe system. Hospital nurses complained of the time it took to update the patient records, and a quick analysis found that messages were getting caught in a queue.

To address this concern, the dev team formulated a service level agreement (SLA) with the hospital that if the queue grew to more than 100 messages, the hospital would get a refund. The messages must get processed within 10 seconds. Coulter coded a script that would set off an alert if the queue grew close to 100, so admins could take action, and capacity planning was rejiggered so that queue processing would have all the server power it needed.

The trouble was, however, the system still lags, angering the busy nurses who relied on it, even though the message queues were empty. “The transactions were timing out even before they hit the message queue,” he said. The message queue wasn’t necessarily the bottleneck that led to the dissatisfaction.

The dev team was managing the application to the metric, not the outcome.

Site Reliability Engineering in 3D

The trick of SRE is to balance the need to please the customer against the unnecessary expense of over-provisioning operations, or stifling innovation. Three key dimensions can cover this, according to Coulter.

“You need to consider all three dimensions for success,” Coulter said. Roughly, they are:

  • Service Level Indicators (SLIs): These are the numbers that describe the state of the running system. SLIs are defined at system boundaries or team boundaries. SLIs should measure system slowdowns, not outages, which happen less often these days. The numbers could be captured by an Application Monitoring Platform (APM) such as AppDynamics, DataDog or New Relic, or any one of a number of new observability tools like Honeycomb.io of IBM’s Instana.
  • Service Level Objectives (SLOs): These are the benchmarks that the SLIs numbers need to hit, as agreed upon between the service provider and the end user. They can be expressed in terms of performance curves.
  • Service Level Agreements (SLAs): These are the agreed-upon actions that the provider must adhere to should the SLOs go unmet. It could be a refund, or perhaps the development cycle gets suspended for 28 days to address the ongoing issues.

👁 Image

“In a perfect world, [the SLA] is defined by the business or the customer and then you build the SLOs and SLIs underneath it,” he said.

In the case of the hospital, the cause of the slowdown were malformed packets — messages that did not meet the HL7 standard for hospital data — that were emitted by a proprietary application. The dev team had no control over this application, beyond filing a bug reporter to the vendor, but they did have control of how success was defined by the SLO, and the expectation of the end user.

In many cases, the engineering team doesn’t need to set SLOs to the highest possible performance level. In fact, such a level could be unduly expensive for the service provider to maintain. Rather, they should be set to customer expectation (One exception to this rule are financial institutions where the speed of a transaction is a fiercely competitive differentiator).

The most difficult part of the measure is understanding the end-user. In the case of the hospital, this involved “observing behavior in the wards and talking to nurses,” Coulter said. In this case, they had found out that the nurses had an “instinctive expectation” of when the lab results would come back — in about five minutes or so — though some nurses would hit the submit button repeatedly, particularly when the system was slow, dragging down the average response time even further.

With this knowledge, the service provider would be able to set an SLA that centered on returning the full results within five minutes, rather than the 10 second processing time.

👁 Image

“The SLAs are not there to beat each other up. They are there to capture the mutual understanding. You reach that mutual understanding through negotiation,” Coulter said. “Negotiating is a key skill for any SRE person.”

Enjoy the full presentation here:

AppDynamics and New Relic are sponsors of The New Stack.

Feature image by National Cancer Institute on Unsplash.

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Honeycomb.io, Honeycomb.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.