VOOZH about

URL: https://thenewstack.io/practical-guidance-for-first-time-site-reliability-engineers/

⇱ Practical Guidance for First-Time Site Reliability Engineers - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-08-24 08:57:19
Practical Guidance for First-Time Site Reliability Engineers
sponsor-incident-io,sponsored-post-contributed,
DevOps / Operations

Practical Guidance for First-Time Site Reliability Engineers

Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.
Aug 24th, 2023 8:57am by Ben Wheatley
👁 Featued image for: Practical Guidance for First-Time Site Reliability Engineers
Image from EtiAmmos on. Shutterstock
Incident.io sponsored this post. Insight Partners is an investor in Incident.io and TNS.

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move.

With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

After joining and having settled in after a day or two, my mission became clear: The primary objective was to build a roadmap for our infrastructure, and then set out to deliver it.

At this company, reliability is something that’s valued even more than at most others; as providers of the tooling you depend on to pick up when your own systems are broken, we become a critical dependency and need to have our product available whenever you need us.

This helps set some initial context for what might be going into the roadmap: an emphasis on availability and reliability.

But if you find yourself in this kind of position, how do you start here and produce a roadmap, starting at zero context?

Here are my tips and advice for breaking down this problem.

Getting Started

Starting with a blank space where you might usually expect to have a roadmap, coming from working in organizations where many years were spent considering these kinds of topics, is a different challenge but not an insurmountable one.

Here are a few strategies that might help you build up context, find the problems that really matter and turn these into a plan of action.

With a beautifully simple interface, powerful workflow automation, and integrations with all your existing tools, prepare for incident management like never before.
Learn More
The latest from Incident.io

Get Well Acquainted with Infrastructure and the Code, Too

Before diving in and making any changes, it’s obviously pretty vital to get a good feeling for what the current setup is within an organization.

Part of an SRE’s job, especially within a smaller team, is to enable and accelerate engineers, so you’ll need to both build empathy around the daily processes that engineers go through and have a sufficient technical understanding to make small changes to the product when required.

Definitely spend some time in “product land.” You should have a fully functioning development environment, a good understanding of the structure of the primary codebase and be able to put this into practice by picking up some smaller changes and delivering them from start to finish.

If your organization has a model like our Product Responders, pairing with those will give you a feel for some of the gnarlier issues.

Going through this process, having stepped into the shoes of a product engineer should present some great opportunities to learn not just about the core product, but also the details around deployments, observability tooling and data stores.

Talk with as Many People as You Can

Now this may sound obvious, but learning as much context as possible from those who’ve been living and breathing the systems you’re taking on responsibility for is going to be crucial.

Whenever you’re not working on onboarding or building up technical knowledge, try to fill the gaps with chats over coffee, going outside for a walk or grabbing lunch together. Beyond building relationships, which is important in itself, this is a great opportunity to find out about current pain points, tools they wish they had and any projects that had been deferred “until we have an SRE.”

Especially important is remembering not just to limit yourself to talking to those grizzled veteran engineers who’ve seen it all, but also the new joiners who may have useful viewpoints, the product and engineering managers who get a good aggregate view from the engineers that they work with, and leadership who will have useful input on longer-term vision and vendor relationships.

Keep a Finger on the Pulse of Your Customers

Keep an eye out for whenever your organization’s customers are getting in touch with any issues that relate to infrastructure or shared concerns, whether that’s through asking the customer support team to keep you in the loop or monitoring your internal incidents channel. If the opportunity is there, you should join discussion and talk with the customer directly, which will allow you to dig into the details further.

These types of interactions may be less frequent than others, but they are very valuable, as they give you an insight into what customers value (such as latency vs. availability) and how they interact with the product, and allow you to start understanding what kinds of trade-offs you can make further down the line.

Don’t Limit Yourself to Just Your Peers

As an SRE at an early-stage company, there’s a good chance that you’ll need to bring in new platforms, tools and processes. There’s also a reasonable chance that these will look different from where you’ve worked previously. Perhaps the business needs are different, or it’s just that industry trends have evolved beyond the systems you’ve worked with previously.

As you start to build up ideas for what kinds of changes you’d like to make, you might find that being the sole SRE makes it tricky to know if you’re on the right track. It’s really useful to validate ideas like this with people outside of your organization too.

Is the hot, new container technology you’re looking at not as good as it’s cracked up to be?

Perhaps contacts at similar-sized companies will have some insight. Similarly, if your company has existing relationships with platform vendors, then lean on them by sending your thoughts and proposals over to the account manager. They may be able to tell you whether you’re following best practices and what their recommendations are based on similar-sized orgs.

Pulling It all Together

If you’ve followed some of the suggestions above, then you should now have a good feeling about the issues and missing building blocks within the organization and be able to make some informed decisions about the next steps.

You’ll also have a lot of diverse inputs from different stakeholders, so you’ll need to distill all of this into a sensible roadmap. The key to this will be picking out common themes in the information by applying some grouping, but after that, you’ll need to make some calls about how to tackle the problems.

The insights you’ve gained into the business, customers, engineer workflows and the product should help you out here. Be careful not to plan too far into the future — you should focus on the problems that are causing pain right now, and then revisit them in a few months’ time, rather than trying to set out a multiyear strategy from the beginning.

To make this all more concrete, here’s the summarized roadmap that I created.

  1. Theme: Compute — How we deploy and run our core codebase
    • Better control of deployments and how we cut over to new versions.
    • Improving reliability and performance of routing HTTP requests to our application code.
    • Improving observability around the system-level metrics of our containers (such as CPU, memory, open file handles).
  2. Theme: Database — How we store the data that powers our product
    • Gaining the ability to do PostgreSQL major version upgrades with minimal downtime.
    • Improving observability about what’s happening within the database.
  3. Theme: Observability — How we monitor the health of our product and systems
    • Gaining the ability to capture application-level metrics. We work with logs and traces, but we’d like to instrument the application further.
    • Improving how we store logs and how they can be used effectively.

Distilling all of this into a document and sharing it with key stakeholders should give you the buy-in to go after the problems you need to tackle.

With a beautifully simple interface, powerful workflow automation, and integrations with all your existing tools, prepare for incident management like never before.
Learn More
The latest from Incident.io
TRENDING STORIES
Ben Wheatley is a site reliability engineer at incident.io. He's worked on building out internal platforms and highly available infrastructure for the last six years, previously having come from a background in backend software engineering.
Read more from Ben Wheatley
Incident.io sponsored this post. Insight Partners is an investor in Incident.io and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Incident.io.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.