VOOZH about

URL: https://thenewstack.io/usenix-continuous-integration-is-just-sre-alerting-shifted-left/

⇱ Usenix: Continuous Integration Is Just SRE Alerting 'Shifted Left' - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-04-10 10:00:18
Usenix: Continuous Integration Is Just SRE Alerting 'Shifted Left'
CI/CD / Observability / Operations

Usenix: Continuous Integration Is Just SRE Alerting ‘Shifted Left’

What works for Continuous Integration can work for alerting and what doesn’t work for Continuous Integration might not work for alerting, and vice versa.
Apr 10th, 2023 10:00am by Jessica Wachtel
👁 Featued image for: Usenix: Continuous Integration Is Just SRE Alerting ‘Shifted Left’
Feature Image by Jan Vašek from Pixabay.

Should Site Reliability Engineering alerts be “shifted left” into the Continuous Integration stage of software deployment, that is before the software is even deployed?

A recent Usenix opinion piece, “CI is Alerting,” written by Titus Winters, Principal Software Engineer at Google, explains how this potential practice could be useful.

As Winters points out, CI systems are systems for automating the build-and-test routine: build the code, and run the tests as often as is reasonable. Adding site reliability engineering alerts to CI should be possible, given that CI alerts should be treated the same way and tested with the same criteria as production alerts. That means CI shouldn’t have 100% passing rates and an Error’s Budget should be added. Brittleness is a leading cause for non-actionable alerts and flaky tests. This can be solved by adding more high-level expressive infrastructure.

Although CI and alerting are guided by different groups, the article makes the argument that they serve the same purpose and even that they use the same data at times. CI on large-scale integration tests are the equivalent to canary deployments and when using high-fidelity test data, reporting large-scale integration test failures in staging are basically the same failures seen in production alerts.

There is a purpose to the parallels. The purpose is that what works for CI can work for alerting and what doesn’t work for CI might not work for alerting and vice versa. This paves the way for the concept of brittleness being a problem.

“Given the higher stakes involved, it’s perhaps unsurprising that SRE has put a lot of thought into best practices surrounding monitoring and alerting, while CI has traditionally been viewed as a bit more of a luxury feature,” Winters writes. “For the next few years, the task will be to see where existing SRE practice can be reconceptualized in a CI context to help explain testing and CI.”

How Alerting Is Like CI

Here’s a production alert:

Engineer 1: “We got a 2% bump in retries in the past hour, which put us over the alerting threshold for retries per day.”

Engineer 2: “Is the system suffering as a result? Are users noticing increased latency or increased failed requests?”

Engineer 1: “No.”

Engineer 2: “Then … ignore the alert I guess. Or update the failure threshold.”

The alerting threshold is brittle but it didn’t come out of thin air.Even if there was no fundamental truth to the specific alert, it’s correlated to what actually matters — degradation in service.

Here’s a unit test failure:

Engineer 1: “We got a test failure from our CI system. The image renderer test is failing after someone upgraded the JPEG compressor library.”

Engineer 2: “How is the test failing?”

Engineer 1: “Looks like we get a different sequence of bytes out of the compressor than we did previously.”

Engineer 2: “Do they render the same?”

Engineer 1: “Basically.”

Engineer 2: “Then … ignore the alert I guess. Or update the test.”

Similarly to the alert, the test failed on criteria that didn’t fully apply. The specific sequence of bytes doesn’t matter as long as the bitmap produced by decoding it as a JPEG is well-encoded and visually similar.

This happens when there isn’t enough high-level expressive infrastructure to easily assert the condition that actually matters. The next best thing is to test or monitor for the easy-to-express-but-brittle condition.

The Trouble with Brittleness

When an end-to-end probe isn’t revealed but collecting aggregate statistics is available, teams are likely to write threshold alerts on arbitrary statistics. In lieu of a high-level way to say, “Fail the test if the decoded image isn’t roughly the same as the decoded image,” teams will test byte streams. Such brittleness reduces the value of testing and alerting by triggering false positives but also serves as a clear indication of where it may be valuable to invest in higher-level design.

Just because brittle isn’t best doesn’t mean brittle is bad. These tests and alerts still point to something that might be actionable. Data surrounding the alert will lead to more clues about the importance of the alert. This is why Winters explains that flaky tests are more negatively impactful than non-actionable alerts. There’s usually fewer data in the testing environment to show whether or not the test failed because of a software-related or test-related issue.

What Is the Pathway Forward?

Treat every alert with the priority it deserves rather than always being on high alert. Consider adding the flexibility of an Errors Budget to CI rather than only having an Errors Budget in alerting and focusing on absolutes with CI. Winters views that as a narrow perspective and recommends refining objectives and adding in an Error Budget for CI because 100% passing rate on CI is just like 100% uptime: awfully expensive.

Some other lessons learned:

Treating every alert as an equal cause for alarm isn’t generally the right approach. This is one of those “alarm snooze” situations. The alarm matters but if it’s not incredibly impactful it’s ok to keep on moving. But that also doesn’t mean the alarm should get thrown out the window because tomorrow is a new day.

Reconsider policies where, if not all CI results are green, no commits can be made. Don’t throw out the alarm — if CI is reporting an issue, investigate. If the root cause is well-understood and won’t affect production then blocking commits quite possibly isn’t the best pathway forward and could be problematic in the long run.

This is a novel idea and Winters says he’s “still figuring out how to fully draw parallels.” For the next few years, “the task will be to see where existing SRE practice can be reconceptualized in a CI context to help explain testing and CI.” He looks for best practices in testing to clarify goals and policies on monitoring and alerting.

TRENDING STORIES
Jessica Wachtel is a developer marketing writer at InfluxData where she creates content that helps make the world of time series data more understandable and accessible. Jessica has a background in software development and technical journalism.
Read more from Jessica Wachtel
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.