![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
As chaos engineering becomes a more mainstream way of proactively seeking out your system’s weaknesses, we see it applied to increasingly complicated circumstances and with teams of all sizes.
One such area is serverless. After all, serverless computing is the language-agnostic, pay-as-you-go way to access backend services. This makes it multitenant, stateless, highly distributed, and heavily reliant on third parties. A heck of a lot can go wrong with so much out of your control.
From higher granularity to expanding attack surface to new failure types, serverless has so many potential points of failure, noted Thundra’s Product Vice President Emrah Samdan at ChaosConf, hosted by Gremlin. Chaos Engineering is one method to finding out where these potential failures are — before they cripple your operations.
If there was an underlying theme of this year’s ChaosConf, it’d be defining just what chaos engineering is. Because, even among expert fire starters, explaining the concept is as much art as it is science.
For Samdan, it’s not about being a glutton for punishment, breaking your system because you feel like it. And it’s not about placing blame.
For him, chaos engineering is all about asking: “What if?”
Samdan said, “You need to ask your system: What if your databases become unreachable? What if your whole region goes down? What if my downstream Lambda times out? Any type of failure can happen in your systems. Chaos engineering answers these questions.”
He says you need to answer these questions to establish what are the acceptable limits of your system. He analogized it to a vaccine, injecting a little bit more resiliency and confidence into your system every time.
“Chaos isn’t a pit. Chaos is a ladder.” — Emrah Samdan, Thundra
Echoing another message from ChaosConf, Samdan reminds us chaos engineering also isn’t just for giant streaming companies. Anyone can do it and you can get started small. He even recommends avoiding doing it in production at the start.
“You can just start when you are staging. Start small. Start injecting into a relatively new service, but put your tools in and just grow stronger with chaos experiments,” he recommended.
Start by measuring your steady-state — the ups and downs of your system. He recommends using an observability tool to accomplish this.
The typical system-level metrics include:
Samdan says typical business-level metrics include:
Set acceptable limits for each of these metrics. Then develop a hypothesis: What happens if this happens? Some examples can be:
You can ask big questions, but then only start experimenting on the small parts. Samdan reminds you to only inject failure into a controlled piece of your system, like only injecting latency towards one function, not the entire architecture. You want to maintain that smaller blast radius.
That’s also why you only run one experiment at a time. Then you can continue, injecting latency into two, three, four functions. He says you keep going until something breaks.
“You should stop when something goes wrong, even if you are not running it in production. You should stop just to understand how you are going to roll back when such things happen,” Samdan said.
He echoed what Liz Fong-Jones said in her ChaosConf talk: that you should absolutely intentionally plan when you have your chaos experiments and let everyone know ahead.
“You don’t need to surprise other people. You don’t need to surprise other departments. And, most importantly, in production, your customers should know about it,” he said.
So if something goes terribly wrong, they aren’t worried because you talked about it ahead and you already had a plan to roll back which you also shared with them.
Chaos gets way more complicated in serverless environments, which are highly distributed and event-driven. Risks with serverless tend to come from the services you don’t have insight or control over. Essentially, serverless is chaotic at its heart.
With serverless you inherit a whole new set of failures, within its many resources, which can include:
Samdan says these are ticking time bombs if you are just communicating with the other system synchronously, waiting for a response.
In serverless, there are also failures you tend to create, like:
But all of these flaws are easy for you to interrogate with chaos experiments, like:
Everything follows the same pattern:
Samdan says latency is the most important serverless metric to experiment against because, in serverless, if the response is late, that’s often a sign the service is broken.
He says a common fix for serverless issues is to aim for asynchronous communication whenever possible and then properly tune synchronous timeouts.
Other serverless fixes include putting circuit breakers in place and using exponential backoff to find an acceptable rate of pacing retransmissions.
Samdan says chaos engineering is about learning exactly how your system is supposed to behave when something happens. And it allows you to make a plan for how you respond to issues as a team:
This systematic, continuous experimentation doesn’t just improve your system. Samdan reminds us that it also improves team communication.
“You need to just make your system ready with chaos engineering because, if it is a serverless system, you should never stop running experiments.” — Emrah Samdan, Thundra
“You should never make it harder for your teams. You should never stop to hug the ops. Incidents can happen. We are here to improve ourselves, not hurt others,” he said.
Gremlin, New Relic and Thundra are sponsors of The New Stack.