VOOZH about

URL: https://thenewstack.io/site-reliability-engineering-and-the-art-of-improvisation/

⇱ Site Reliability Engineering and the Art of Improvisation - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-12-16 07:33:38
Site Reliability Engineering and the Art of Improvisation
contributed,sponsor-blameless,sponsored,sponsored-post-contributed,
DevOps / Security / Tech Culture

Site Reliability Engineering and the Art of Improvisation

Site reliability engineering depends on orchestration and improvisation. Develop your SRE by trusting your instincts and jamming.
Dec 16th, 2021 7:33am by Matt Davis
👁 Featued image for: Site Reliability Engineering and the Art of Improvisation
Featured image via Pixabay
Blameless sponsored this post.
Matt Davis
Matt is a senior infrastructure engineer at Blameless. His expertise includes data center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems and TechOps leadership. Having degrees in music performance and composition, he has a passion for exploring the relationships between the artistic mind and operating distributed software architectures.

Site reliability engineering (SRE) depends on orchestration and improvisation. To develop a great SRE practice means a deep understanding of the technical infrastructure, but also the confidence to trust your instincts and just start jamming.

I run a weekly continuous learning session at Blameless that takes its title from the traditional Indonesian orchestra: the Gamelan (pronounced “gah-meh-lahn”). This orchestra is mostly percussion, lots of tuned gongs and mallet instruments, a few strung and wind ones, usually a male or female singer, all joining together through rhythm and extemporaneous songwriting.

You see, a key element of gamelan is that the music is written by the group as it is practiced, with the belief that music should grow and change. As they meet time and time again, members are continuously evolving new versions of the songs every time they play. Practices begin to look like performances and vice versa. It’s a lot like improvising jazz, where the gig is merely another time to get together and play.

Some of the ways we transform and evolve our understanding at these sessions include:

  • Walkthroughs of observability toolsets, a.k.a. “Morning Vistas”: What do you observe when you open the laptop to start the day and look across the operational landscape? This provides fresh perspectives on how our colleagues approach their regular work.
  • Decision requirements table building, for instance the most difficult decisions faced during on-call or live maintenance of our Kubernetes clusters. These help us think about how we can make improvements to support responders making decisions under duress.
  • Team knowledge elicitation, like deeper views into NGINX Ingress logging or attempting a dependency matrix for our critical path. It’s very useful for squeezing some of that juicy knowledge out of our experts’ brains.
  • Asking the question, “Why do we have on-call?” to share mental models of how different people at the company view and engage with it. We learn about each other’s expectations, how we might alleviate the fears of being on call for the first time.
  • Spin the Wheel of Expertise! a.k.a. “Who? What? Where?” Here we explore our technology stack and services through gameplay, asking each person to spin the wheel and require them to show us firsthand how they would come up with the answer, or how they would escalate if they simply didn’t know.
👁 The wheel of expertise

Spin the Wheel of Expertise

What we’ve created at Blameless is an opportunity for learning and a time to come together in a collaborative way to share mental models and tell stories about different areas of the system in a safe and unpressurized way so we can carry learning forward. This way, incidents are also merely another time we can apply our powers of intuition because we’ve put techniques for addressing them into practice. More precisely, we call this “The Practice of Practice,” which is the experience we absorb when we actually do our craft — improvisation, production, incidents.

My motto has sometimes been that it doesn’t much matter what we do together as long as we’re doing it together. Regardless of attendance, the discussions always dive into shared perspectives and allow participants a safe space to explore things without fear of the judgment or anxiety associated with an incident. It is impossible for any single person to know the full complexity of networked software, so it becomes critical to know where to find expertise and how to learn from doing instead of trying to follow prescriptions or hastily reviewed runbooks.

One of my favorite things about running these opportunities for learning is seeing the participants employ aspects of their regular work while we answer questions or explore one UI or another. This allows others to peek into their coworkers’ mental models. What might seem like mundane, ordinary tasks to one may illuminate an understanding for another, or even lead others to embellish their own patterns and work style.

Socio-Technical Praxis

Our themes and agendas are somewhat loose but usually planned so we’re not just staring at each other. Nevertheless, sometimes we are required to adapt. There was one session held on the same day as a large vendor outage that disabled our ability to use a portion of our own UI to support that day’s game. So, we pivoted, and it became a session with two of our experts on the subject of the vendor outage, which in this case it was root certificate authorities and the SSL/TLS protocol.

Although there is an emphasis on the operational parts of our complex system, the participants are far from just infrastructure engineers and SREs. We have sessions including people from technical writing, software development, customer service, strategy, marketing and even management. We make the calendar invite optional, companywide and we do not call it a meeting: It’s a session, where we can share stories and have fun in a live setting.

👁 Video call

A session with members of various teams

In all these activities we seek to open doors that people might be afraid to go through, learning by experiencing how our peers answer questions about a service or technology. We pick up on the patterns and praxis of others, and this enriches our own set of intuition responses, creating new pathways and new connections in our own mental models. This enriches our view of the system and provides the foundation to be adaptive when responding to incidents.

Build to Adapt

In the grand socio-technical scheme of things, “the Practice of Practice enables us to build upon the resilience that blossoms like the harmony of well-practiced jazz musicians. The magic and excitement found in discovery is food for our brains. Our synapses hunger for enriching pattern recognition, combining new experiences with old ones and other mental models to form new ones.

The superhero-like power of instantly pulling solutions out of seemingly nowhere has its origins in bringing our practiced scales, melodies, theories, rhythms and other patterns together with inspiring combinations.

Instead of suffering the stressful common-ground breakdowns during incidents that translate to a poor customer experience, we seek new ways to choreograph our socio-technical systems more confidently. We see as an organization that there is power in this kind of collaboration; participants have praised these sessions as some of the best on-the-job learning they’ve ever done.

Blameless drives reliability across the entire software lifecycle by operationalizing Site Reliability Engineering (SRE) practices. Teams better command and communicate during incidents, resolve faster, and continuously improve. Engineers stay resilient and customers stay protected.
Learn More
The latest from Blameless

So it is true that having a firmer handle on how to cope with rather than eschew ambiguity comes directly from knowing how to do our jobs better at the sharp end. But we’re not in it alone. We do this by drawing on our rich network of humans in collaborative joint activity, recognizing how our regular work interrelates and feeds into the very complexity we seek to understand.

It’s not a whole lot different from the way musicians influence and support each other through their playing. Imagine how extremely uncomfortable events can be lightened by an unassuming session on what choices you have when your very reliable servers go down. Incidents are unplanned and can thus be intimidating, but the team has got your back. This is a situation you have all practiced, so it’s just another time you’re getting together to make music.

Blameless drives reliability across the entire software lifecycle by operationalizing Site Reliability Engineering (SRE) practices. Teams better command and communicate during incidents, resolve faster, and continuously improve. Engineers stay resilient and customers stay protected.
Learn More
The latest from Blameless
TRENDING STORIES
Matt is a senior infrastructure engineer at Blameless. His expertise includes data center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems and TechOps leadership. Having degrees in music performance and composition, he has a...
Read more from Matt Davis
Blameless sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.