VOOZH about

URL: https://thenewstack.io/the-resilience-roundtable-a-discussion-about-chaos-engineering-and-more/

⇱ The Resilience Roundtable: A Discussion About Chaos Engineering and More - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-01-21 03:00:03
The Resilience Roundtable: A Discussion About Chaos Engineering and More
contributed,sponsor-gremlin,sponsored,sponsored-post-contributed,
DevOps / Observability

The Resilience Roundtable: A Discussion About Chaos Engineering and More

Key takeaways from The Resilience Roundtable. Reliable technology plays a critical role in helping maintain normalcy and connectedness.
Jan 21st, 2021 3:00am by Adam LaGreca
👁 Featued image for: The Resilience Roundtable: A Discussion About Chaos Engineering and More
Gremlin sponsored this post.

Gremlin sponsored this post.

Adam LaGreca
Adam LaGreca is the founder of 10KMedia -- a boutique PR agency for B2B DevOps. Prior, he was the Director of Communications for DigitalOcean, Datadog, and Gremlin.

2020 was an interesting year… to say the least. The pandemic changed our lives in ways that will outlast the virus itself. Add to the mix a rise in civil unrest, and I think it’s fair to say that we need a reboot in 2021.

In the world of DevOps, there’s quite a bit to be optimistic about. Funding of technology startups has actually increased over the past year. Digital transformations have been accelerated, as companies across all industries prioritize the online experience in a distributed world. Modern tooling like Slack and Zoom have made it possible for many of us to continue working, to stay in touch with loved ones, and even to be entertained as we are stuck at home.

Reliable technology has played a critical role in helping maintain a sense of normalcy and connectedness.

And so I wanted to get a panel together that consisted of some of the premier thought leaders in the space. These founders and executives are on the frontlines building solutions that help companies modernize, solve problems, and become more resilient.

Panelists

  • Kolton Andrus: The CEO and co-founder of Gremlin, the world’s first fully-hosted chaos engineering platform. Previously worked on building robust systems at Amazon and Netflix.
  • Charity Majors: The CTO and co-founder of Honeycomb, an observability platform to understand production systems. Previously worked at Facebook as a production engineering manager, focusing on their backend-as-a-service platform Parse.
  • John Egan: The CEO and co-founder of Kintaba, a modern incident management platform for your entire organization. Previously built a startup that was acquired by Facebook, where he then led product for their enterprise offering Workday.
  • Daniel “Spoons” Spoonhower: The CTO and co-founder of Lightstep, a cutting-edge observability and distributed tracing software. Previously worked at Google and is also the co-founder of the OpenTelemetry project.
  • Shahar Fogel: The CEO of Rookout, a live debugging platform enabling developers to debug modern applications faster than ever. Previously was the CEO of Brandtix and the VP of Product at Connectik Technologies.

Watch the full video below:

Key Takeaways from the Resilience Roundtable

Major Outages Impact Companies Both Big and Small

Yes, it’s true that Amazon can lose millions of dollars if they are down for even a few minutes and that Robinhood might lose countless users each time they crash during a major market movement. But for startups, even if they aren’t losing millions of dollars or hundreds of customers, the relative impact on their business can actually be much greater. Losing even a single big customer for a startup can mean losing a significant chunk of revenue. So while big companies make for big headlines, startups can feel the pain of major outages just as much — if not more.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin

Postmortems Should Be Shared Broadly and Publicly

Creating a culture that accepts failure and learns from it is a major and important shift for many companies. Too often when something goes wrong within traditional organizations, people that weren’t even there (e.g. management) dole out punishment and blame as the primary response. In modern incident management, blameless postmortems are a way to formally document what went wrong and why, in an effort to better understand the incident and prevent it from happening again. These documents should not only be shared with your team — they should also be shared publicly so that anyone interested can learn from what happened. (Cross-company resilience FTW)

You Build It, You Own It!

The best way to get software developers to care about the reliability of their applications… is to put them on call! Skin in the game can make a world of difference. If the engineer knows it’s their pager that will fire in the middle of the night or over the holiday break, they are much more likely to write code that stands up.

Resilience Is Shifting Left

This is a core promise of DevOps: That the daylight between the code being written, and then who is responsible for that code’s behavior in production, becomes narrower and narrower. When we think of shifting more of the operational burden upfront (i.e. Proactive Ops), we may also think of the cutting-edge discipline of Chaos Engineering. Like a vaccine, it’s important to inject a little failure upfront, on your own terms, in order to build longer-term resilience. And for software developers, resilience often means more than just checking if systems are up or down; it means being able to debug customer-facing issues on the fly, and provide a seamless online experience even when the unexpected happens.

Observability Is Real, AIOps Not So Much

Among the panelists, there was a near-unanimous reaction to the term “AIOps” (eye roll). While machines solving all of our problems make for good headlines, the truth is that the human is still very much needed in attributing value to machine-detected anomalies. You’re also adding another project for your engineers to be concerned about — before they wanted to just improve resilience, but now they have to build and maintain the AI to help with that resilience! Simply adopting the best DevOps/SRE practices will likely get you further, for now.

Lightstep is a sponsor of The New Stack.

Gremlin is the world’s first hosted Chaos Engineering service with a mission to help build a more reliable internet. It turns failure into resilience by enabling engineers to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.
Learn More
The latest from Gremlin
TRENDING STORIES
Adam LaGreca is the founder of 10KMedia, a boutique public relations agency for B2B DevOps. Prior, he was the director of communications for DigitalOcean, Datadog and Gremlin.
Read more from Adam LaGreca
Gremlin sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Simply, Real, Honeycomb.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.