VOOZH about

URL: https://thenewstack.io/the-role-of-site-reliability-engineering-in-microservices/

⇱ The Role of Site Reliability Engineering in Microservices - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2018-04-23 13:04:15
The Role of Site Reliability Engineering in Microservices
guide-to-microservices-ebook,
Cloud Native Ecosystem / Microservices

The Role of Site Reliability Engineering in Microservices

Apr 23rd, 2018 1:04pm by Alex Handy
👁 Featued image for: The Role of Site Reliability Engineering in Microservices

You can always spot the hot jobs in technology: they’re the ones that didn’t exist 10 years ago. While Site Reliability Engineers (SREs) did definitely exist a decade ago, they were mostly inside Google and a handful of other Valley innovators. Today, however, the SRE role exists everywhere, from Uber to Goldman Sachs, everyone is now in the business of keeping their sites online and stable.

While SREs are hotshots in the industry, their role in a microservices environment is not just a natural fit that goes hand-in-hand, like peanut butter and jelly. Instead, while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.

That’s because SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.

As Google engineers essentially invented the role, the company offers a great deal of insight into how they manage systems that handle up to 100 billion requests a day.  They boil down reliability into an essential element, every bit as desirable as velocity and innovation.

“The initial step is taking seriously that reliability and manageability are important. People I talk to are spending a lot of time thinking about features and velocity, but they don’t spend time thinking about reliability as a feature,” said Todd Underwood, an SRE director at Google.

Underwood said reliability and availability should be considered at every level of a project. As an example, he cites the way Gmail fails by dropping back to a bare HTML experience, rather than by halting all-together. “I’ll take the ugly HTML [version], but I can read my email. Availability is a feature and the most important feature. If you’re not available, you don’t have users to evaluate your other characteristics. Organizations need to choose to prioritize reliability.”

Underwood stipulated that every organization is different and that some of the issues Google encounters are not typical. But he did advocate for some more holistic practices.

“For distributed applications, we’re running some kind of Paxos consistent system. We have a whole chapter on distributed consensus. It seems like a computer science, nerdy thing, but really if you want to have processes and know which ones are where, it’s not possible without Paxos in place,” said Underwood. Paxos is the algorithm for distributed consensus gathering, often used to work out inconsistencies that can arise in distributed systems.

Underwood highlights another aspect of the SRE job that is essential, here, however: visibility. When microservices are throwing billions of packets across constantly changing ecosystems of cloud-based servers, containers, and databases, finding out what went wrong where is essential to troubleshooting any type of problem. This is where the full stack aspects of an SRE’s job come into place.

Google recently introduced a number of tools just for this type of work.

The whole market over the last few years has been shifting very deliberately towards microservices. We see this with Kubernetes and Istio, and the general move to the cloud from the data center. There are some challenges along the way. If you have 100 containers, things like doing a stack trace on a monolith become very difficult. You need a distributed trace,” said Morgan McLean, Product manager on Google Cloud Platform.

“To understand the health of your entire application and see how a transaction is going to flow through all these different microservices, you have to have a system that is going to help you navigate that. You want something that is going to think in terms of the transaction,” Matt Chotin, AppDynamics

To remedy this, Google recently released Stackdriver Trace, Stackdriver Debugger, and Stackdriver Profiler. There’s a reason these tools sound like old-school testing and operations tools from traditional enterprise vendors: they perform the more traditional troubleshooting tasks developers and operations people are used to, but with a focus on microservices and performing these duties in the cloud.

Stackdriver Profiler is in beta, but allows for direct CPU utilization monitoring on applications running inside of a cloud, while Stackdriver Debugger offers a way to essentially insert breakpoints into cloud-based microservices-based applications, and Stackdriver Trace offers the full-stack tracing capabilities McLean alluded to.

“This is really powerful for general performance improvements and powerful for cost reduction,” said McLean of Stacktrace Profiler. “Snapchat tried it out, and within a day of collecting data they realized a very small piece of code — I think it was a regular expression — which should not have even shown up in Profiler, was actually consuming a fairly large amount of CPU. This could happen to anyone. It happens to Google. The Snapchat demonstration was just a really great demonstration of the power of this profiling technology.”

“Without tools like this, this generally isn’t possible. Tracing was becoming a common industry practice. Profiling and production debugging are a little more unique in our offering,” said McLean.

New Thinking

The focus on new style tooling is shared by Matt Chotin, senior director of technical evangelism at AppDynamics. He said that teams need to rethink the way they determine the health of entire applications, once it’s been moved from monolith to microservice.

“You have a myriad of systems. The joy of microservices is that you get to pick the stack that’s right for a particular piece. Each thing might have its own way of monitoring, its own metrics, etc. To understand the health of your entire application and see how a transaction is going to flow through all these different microservices, you have to have a system that is going to help you navigate that. You want something that is going to think in terms of the transaction,” Chotin said.

The engineer shouldn’t think in terms of whether the service is up or down, Chotin said. “Your DevOps team cares about looking at a service to know general availability, but as far as whether or not you are serving the business correctly, you need monitoring that can traverse the entire ecosystem, from application code to infrastructure code,” said Chotin.

Google’s Underwood said that the overall goal for SRE’s inside the company is to limit their growth, while enabling Google’s growth. That means, as Underwood puts it, “It’s super important for us that SREs grow sublinearly with Google. We’d like to continue to get more efficient.”

To that end, he said, Google SREs focus in on their applications, specifically. “We focus on a deep level on the specific services we work on. Teams that work on Google Docs, teams that work on ad serving; each team focused at a very high level of detail on those services. At the same time, we have SRE teams that build common infrastructure used across all the SRE teams.”

AppDynamics and Google are sponsors of The New Stack.

Feature image by Eberhard Grossgasteiger on Unsplash.

TRENDING STORIES
A 20 year veteran technology journalist, Alex Handy cut his teeth covering the launch of the first iMac. His work has appeared in Wired, the Atlanta Journal Constitution and The Austin American Statesman. He is also the founder and director...
Read more from Alex Handy
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.