VOOZH about

URL: https://thenewstack.io/todays-site-reliability-engineers-are-more-healers-than-fire-fighters/

⇱ Today's Site Reliability Engineers Are More Healers Than Fire Fighters - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-03-02 09:02:06
Today's Site Reliability Engineers Are More Healers Than Fire Fighters
profile,
API Management / DevOps / Tech Culture

Today’s Site Reliability Engineers Are More Healers Than Fire Fighters

The type of role that best describes the emerging SRE was one of a number of assumptions examined during this lively discussion. The talk described common misperceptions about what SREs really do and how they can best support the organization.
Mar 2nd, 2021 9:02am by B. Cameron Gain
👁 Featued image for: Today’s Site Reliability Engineers Are More Healers Than Fire Fighters

The image of the site reliability engineer (SRE) running around with a pager and always on call to put out proverbial data center fires in the middle of the night are becoming a thing of the past. In addition to how SREs’ roles can change, depending on the organization’s size and what it does, the super-charged and always-on-call image is not only inaccurate, but also serves to mislead about how SREs can best support DevOps teams.

Instead, the SRE’s role can and should be compared to that of a healer or medic — and not to a firefighter or soldier,” said noted systems engineer Alice Goldfuss, during The New Stack’s recent “StackPulse Goes Live: How to Build an SRE Superhero” video streaming podcast.

“Especially in the ops tradition and the sysadmin tradition that then became the SRE tradition, there is a lot of that military hyper-masculine self-sacrificing that not only leads to burnout but also just crowds out a lot of people in the space,” she said.

“Whereas, I do believe the SRE should be seen as a very highly skilled support role like a medic, or like the healer on the team, so to speak,” she said.

The type of role that best describes the emerging SRE was one of a number of assumptions examined during this lively discussion. The talk described common misperceptions about what SREs really do and how they can best support the organization.

To a great extent, the size of the organization will determine the SRE’s range of responsibilities. The SRE thus means “a lot of things outside of the official Google context, because the Google SRE only really works for Google-sized companies or perhaps just Google,” Goldfuss said.

“Whereas, sorry, in other companies of various sizes and technical shapes the SRE can mean someone who keeps the databases online, someone who builds reliability tooling for engineers and someone who manages the incidents,” Goldfuss said. “And yes, it can be a roll of many hats, or what have you been working on under the SRE title.”

Alice Goldfuss: “Military hyper-masculine self-sacrificing [culture] not only leads to burnout but also just crowds a lot of people out of the space.” @alicegoldfuss #sponsored @StackPulse @alexwilliams @alicegoldfuss @orelimelech_ https://t.co/AVDNykari8
.

— The New Stack (@thenewstack) February 17, 2021

The Shift to a Blameless Culture

People don’t fail, teams do. When things do go wrong, the SRE, unlike in the past, is considered less as a lightning rod for blame — systems were designed not to fail, therefore it is the SRE’s responsibility that systems never crash, some might have contended in the past — but, as mentioned above, more as the medic to help teams fix things when they go wrong.

In what should be a “blameless culture” and knowing that “just everything that can fail will eventually fail,” the SRE was previously expected to make sure everything “just stayed up,” said Or Elimelech, StackPulse site reliability engineer lead during this discussion.

Elimelech said. “In the previous generation of system admins, if something crashed, everyone was screaming — it was like a war,” Elimelech said. “You weren’t prepared for these scenarios before, because you’d expect that everything just stayed up.”

For SRE culture today, “we’re saying, ‘if we fail, let’s learn from it and make sure the mistakes don’t happen again,’” Elimelech said. This instilled culture helps to foster an environment in which “people learn,” instead of wrongly just blaming the SRE, Elimelech said.

StackPulse’ Or Elimelech: The SRE must create the infrastructure to create a common “language available for everyone” as a foundation “for the organization to rely upon.” @orelimelech_ @alicegoldfuss #sponsored @StackPulse @alexwilliams https://t.co/AVDNykari8 pic.twitter.com/tGBbKhF8xv

— The New Stack (@thenewstack) February 17, 2021

Building a Foundation for DevOps

The SRE’s support of the DevOps team can be compared to building a foundation, Elimelech said. Like the developers, Elimelech said he spends much of his time writing code, but his mission as the SRE is to write and deploy code that will help make the developer’s work life easier.

Instead of writing the business logic, Elimelech develops libraries and selects frameworks to help developers write more resilient code with “stricter APIs” so that the “developer experience is first,” Elimelech said. Comparing himself to a therapist in addition to a medic, the idea is to remove “everything off the developers’  desk to let them focus on the business logic,” he said.

StackPulse’ Or Elimelech: A Zero Trust network shouldn’t require a new team member to “install something” before they “have one unified way of doing stuff.” @orelimelech @alicegoldfuss #sponsored @StackPulse @alexwilliams @orelimelech_ https://t.co/AVDNykari8

— The New Stack (@thenewstack) February 17, 2021

As an example, a developer should be able to just “type whatever they need in order to achieve something in the infrastructure in a sane way,” Elimelech said. The SRE supports this need by, while working in the background, developing, for example, an automated process that configures a database  “in a sanitized way, instead of giving a developer access to the production DB,” Elimelech said.

Empathy First

It is easy to become enamored with a new technology without thinking through the consequences of what its adoption might mean for the end-user, or in the case of the SRE, how a new tool or platform might impact the organization’s DevOps team members. At the same time, Goldfuss noted how it is easy to understand how when “creating a logging system that only uses TCP packets, you could very well get back pressure on the system that takes something else down.”

“But I also think empathy and how systems connections work are also needed for humans, and how humans interact with each other,” Goldfuss said. “And, in my experience, the best SREs are the ones who live in a world where they understand the consequences of their actions on others, and this is the person who when you’re shipping something they’re going to be saying: ‘not only how do these systems interact, but also who are the stakeholders, who do we need to talk to?’”

What are SRE metrics? It depends. Alice Goldfuss: “This is where empathy and communication come into play” as you must sit down with your stakeholders and find out what actually matters for you.” @orelimelech_ @alicegoldfuss #sponsored @StackPulse https://t.co/AVDNykari8 pic.twitter.com/MjOQob3Stw

— The New Stack (@thenewstack) February 17, 2021

New technologies can be highly beneficial, of course, but empathy needs to be the first consideration. “There are so many new technologies out there, and you’ll learn them and get there yourself but it really starts with empathy,” Goldfuss said.

Elimelech noted, for example, about how his role as an SRE involves automating as much code as possible to allow the software engineers to remain “focused on the product itself.” “You seek to remove obstacles that the developers might have,” while extending that support to help a developer resolve an issue with a service, such as improving the infrastructures to better enable connections to a MySQL database.

“There’s no pressure on the developer to solve it on their own…so you can escalate stuff to the SRE team,”  Elimelech added. “Empathy, as Alice just mentioned, is everything and is a key point here.”

As a case in point, the question during the live streaming was asked whether service mesh has become an essential SRE tool. While increasingly seen as an essential way to manage Kubernetes environments for many organizations, service mesh is not necessarily applicable to an SRE’s needs, nor to the needs of any organization.

Indeed, a service mesh can offer better performance, load balancing, testing “and stuff like that, and if that is something that you need and you can’t get it anywhere else from a more mature tool that has been out for years, then yes, do a service mesh,” Elimelech said. “Just make sure that you’re evaluating it based on what the tool is actually giving you and that it’s actually adding to the stability and usability of your infrastructure, rather than the fact that it’s shiny.”

The full recorded livestream, hosted by The New Stack Publisher and founder Alex Williams, can be enjoyed here:

StackPulse is a sponsor of The New Stack.

TRENDING STORIES
BC Gain is founder and principal analyst for ReveCom Media. His obsession with computers began when he hacked a Space Invaders console to play all day for 25 cents at the local video arcade in the early 1980s. He then...
Read more from B. Cameron Gain
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.