VOOZH about

URL: https://thenewstack.io/chaos-engineering-for-cloud-native/

⇱ Chaos Engineering for Cloud Native - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-10-14 06:35:42
Chaos Engineering for Cloud Native
contributed,sponsor-chaosnative,sponsored,sponsored-post-contributed,
Cloud Native Ecosystem / DevOps

Chaos Engineering for Cloud Native

How you can practice chaos engineering more effectively in cloud native, which I call 'cloud native chaos engineering' or CNCE.
Oct 14th, 2021 6:35am by Uma Mukkara
👁 Featued image for: Chaos Engineering for Cloud Native
ChaosNative sponsored this post. Insight Partners is an investor in ChaosNative and TNS.
Uma Mukkara
Uma is CEO of ChaosNative and a maintainer of the LitmusChaos project.

In today’s e-commerce world, customer retention, customer acquisition and customer satisfaction are top priorities of management, whether it is the operations, business or product engineering team. Service reliability is key to your business success. While this is obvious, service reliability has become more important in recent times. The reason is simple: More people are using digital services and businesses have become digital-centric. This digital transformation is one of the reasons we are seeing server administrators, database administrators and network administrators becoming site reliability engineers (SREs).  The move to microservices and containerization of applications also is fueling this transformation.

👁 Image

Digital transformation making service reliability a key business requirement

What Is Service Reliability?

The most common answer is: No outages or very long uptimes. However, we all know that outages are inevitable at some point. So service reliability is really about reduced outages. In addition, services get upgraded from time to time, so continuity of service availability during upgrades is also a key requirement. If you consider service outages as eventually inevitable, an optimal way to think of service reliability is to create a situation where business impact is minimal when a service outage happens. A service outage should not happen frequently, but when it does, quick recovery helps minimize damage to user experience.

👁 Image

Service reliability

Service reliability depends on multiple factors, such as application stability, performance at scale, performance at load, underlying infrastructure and the ability of applications to withstand a failure in any one of multiple components involved in providing the service.

Challenges in Cloud Native

IT is moving to cloud native. In a cloud native environment, there are a lot of moving pieces. You will have many infrastructure and middleware software components that come in handy when you build a service in cloud native. The main difference between a cloud native service versus a traditional one is that with cloud native you are dealing with many more components — sometimes multiple times more components — each of which are independently well-tested and stitched together. So, how does this affect service reliability? You might think that these services are naturally more reliable because of well-tested microservices, but you will be surprised how often this is not the case. The reason is that your primary application on which a service is running might not fare well when faults happen in the dependent components.

👁 Image

Cloud native & Reliability

As shown in the above diagram, in cloud native, the application that defines your service is making up only about 10% of the code in the entire stack. Reliability for that app requires the rest of the stack to be stable and work together when faults occur. Any fault or multiple faults happening in the stack should not impair service availability. This defines how good your service reliability is.

The other factor that contributes to the service reliability in cloud native is how fast changes are happening in cloud native DevOps. Containerization and advancements in continuous integration/continuous deployment (CI/CD) systems, along with the organizational push to release faster, are some of the reasons microservice release times are frequent. They have moved from quarterly to monthly to weekly, and in some cases multiple times in a week. This means you are releasing code faster, and the dependent code in your service is also going through continuous upgrades. In cloud native, something is always being upgraded. This makes it difficult to ensure that service reliability is kept at the desired level.

The above two areas show the need for service reliability in cloud native. Fortunately, if you invest in a structured approach known as chaos engineering, service reliability can definitely be achieved. Chaos engineering is an important science now more than ever. While chaos engineering’s roots go back to 2011, only in the past couple of years has it attained prominence, primarily driven by cloud native service deployments becoming more common.

Chaos Engineering 101

In simple terms, chaos engineering is about collecting, designing, implementing, orchestrating and scaling the faults in various applications and the infrastructure of a service. It is about engineering the faults around a service in a controlled way, to validate the resilience of the service and find any weaknesses. Chaos engineering has been thought of and advocated in many other forms — testing in production, breaking in production, etc. But to me, it is engineering around chaos; it is a continuous process and all principles of engineering need to be applied to chaos as well.

There is some hesitance to adopt chaos engineering into Ops or DevOps, which is to be expected. After all, it requires breaking things. However, a new school of thought is to start chaos at the beginning: Consider it as part of your service build-up, so it gets embedded into your delivery process or DevOps. Some new implementations of cloud native DevOps consider chaos engineering a structural and integral part of their testing or CI/CD. This scenario is going to become increasingly common as more success stories emerge.

So, let’s define chaos engineering one more time, but technically. Chaos engineering is the science of designing a fault along with a steady-state hypothesis, orchestrating the fault and validating the steady-state hypothesis. Steady-state definition and validation of it sometimes are more important than the introduction of a fault. Steady-state definition could vary slightly from team to team, and from time to time based on the load conditions and other state variables of the service. The easiest way is to associate the steady-state with the service level objectives (SLOs) at that time and validate them. To summarize another way, chaos engineering can also be considered continuous validation of the SLOs under random fault injections into the dependent applications or the infrastructure of a service.

👁 Image

Chaos Engineering

Cloud Native Chaos Engineering 101

We have discussed the need for chaos engineering in cloud native and considering it from the beginning rather than an afterthought. Now let’s discuss how you can practice chaos engineering more effectively in cloud native, what I call “cloud native chaos engineering” or CNCE. The cloud native definition includes some mandatory principles, such as requiring declarative config, being flexible, scalable, across multiple clouds, using declarative APIs, etc. All these principles are applied broadly when you practice cloud native chaos engineering. We have been advocating the following as principles of cloud native chaos engineering for the past three years.

Principles of Cloud Native Chaos Engineering

👁 Image

Cloud Native Chaos Engineering principles

Open Source

Cloud native communities and technologies have been revolving around open source. Chaos engineering frameworks being open source in nature benefit in building strong communities around them and help them make them more comprehensive, rugged and feature-rich.

Chaos Experiments as Building Blocks

Chaos experiments need to be simple to use, highly flexible and tunable. Chaos experiments have to be rugged, with little or no chance of resulting in false negatives or false positives. Chaos experiments are like Lego blocks: You can use them to build meaningful chaos workflows.

Manageable Chaos Experiments and API

Chaos engineering has to employ well-known software engineering practices. Managing chaos scenarios can quickly become complex, with more team members getting involved, more changes happening frequently and the requirements being altered. Upgrading chaos experiments becomes common. The chaos engineering framework should enable management of chaos experiments in an easy and simple manner, and should be done in the Kubernetes way. Developers and operators should think of chaos experiments as Kubernetes customer resources.

ChaosNative Inc. provides products and services for the reliability of cloud native DevOps built on top of the popular open source Chaos engineering project LitmusChaos. ChaosNative offers the hosted Litmus service at cloud.chaosnative.com. ChaosNative and TNS are under common control.
Learn More
The latest from ChaosNative

Scale Through GitOps

Start with low-hanging fruit in terms of obvious and simple issues. As you start fixing them, chaos scenarios become more comprehensive and large; the number of chaos scenarios also increases. Chaos scenarios need to be automated or need to be triggered when a change is made to the applications of the service. Tools around GitOps may be used to trigger chaos when a configuration change happens to either application or chaos experiments.

Open Observability

Observability and chaos engineering go together when fixing issues related to reliability. There are many observability stacks and systems that are well-developed and put into practice. Introduction of chaos engineering should not require a new observability system. Rather, the context of chaos engineering should fit nicely into the existing system. For this, chaos metrics from the system where chaos is introduced are exported into the existing observability database, and the context of chaos is painted on to the existing dashboards.

Litmus Is Built for Cloud Native

LitmusChaos is built with the cloud native chaos engineering principles mentioned above as its foundational architectural goals. Litmus is a complete chaos engineering framework built as a cloud native application. You can use it to build, orchestrate and scale chaos experiments seamlessly. Litmus comes with a bunch of ready-to-use chaos experiments hosted on an open hub called ChaosHub. These experiments form the basis for Litmus workflows, which can be closely compared with your actual chaos scenarios. SREs, QA and developers collaborate to build the required chaos workflows and orchestrate them using GitOps for scaling, as well as to trigger chaos upon a configuration change to the deployments.

👁 Image

Litmus overview

Chaos Engineering Made Easy

LitmusChaos is simple to deploy on your Kubernetes cluster or namespace. Litmus is hosted at ChaosNative Litmus Cloud as SaaS and comes with a forever-free plan. Sign up at litmuschaos.cloud for a super-quick start on chaos engineering or follow the Litmus docs to install Litmus on your own Kubernetes cluster.

👁 Image

Litmus on cloud

Chaos Engineering Community Conference

ChaosNative hosts an annual chaos engineering conference called ChaosCarnival. Global chaos engineering experts, enthusiasts and practitioners speak at this conference to share their experiences, best practices and success stories. Register for free to get updates on the conference.

ChaosNative Inc. provides products and services for the reliability of cloud native DevOps built on top of the popular open source Chaos engineering project LitmusChaos. ChaosNative offers the hosted Litmus service at cloud.chaosnative.com. ChaosNative and TNS are under common control.
Learn More
The latest from ChaosNative
TRENDING STORIES
Uma is CEO of ChaosNative and a maintainer of the LitmusChaos project.
Read more from Uma Mukkara
ChaosNative sponsored this post. Insight Partners is an investor in ChaosNative and TNS.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, ChaosNative.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.