VOOZH about

URL: https://thenewstack.io/develop-a-daily-reporting-system-for-chaos-mesh-to-improve-system-resilience/

⇱ Develop a Daily Reporting System for Chaos Mesh to Improve System Resilience - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2021-12-09 10:00:10
Develop a Daily Reporting System for Chaos Mesh to Improve System Resilience
contributed,
Cloud Native Ecosystem / Kubernetes / Observability

Develop a Daily Reporting System for Chaos Mesh to Improve System Resilience

This article will introduce how chaos engineering helps us improve our system resilience and why we need a daily reporting system to complement Chaos Mesh.
Dec 9th, 2021 10:00am by Lei Li
👁 Featued image for: Develop a Daily Reporting System for Chaos Mesh to Improve System Resilience
Feature image via Pixabay.
Lei Li
Lei Li is a senior software engineer of the database lab at Digital China. He is passionate about open source, distributed systems, and chaos engineering. He is also an active contributor to open source projects, such as TiDB for PostgreSQL.

Chaos Mesh is a cloud native chaos engineering platform that orchestrates chaos experiments on Kubernetes environments. It allows you to test the resilience of your system by simulating problems such as network faults, file system faults, and Pod faults. After each chaos experiment, you can review the testing results by checking the logs.

But this approach is neither direct nor efficient. Therefore, I decided to develop a daily reporting system that would automatically analyze logs and generate reports. This way, it’s easy to examine the logs and identify the issues.

In this article, I will introduce how chaos engineering helps us improve our system resilience and why we need a daily reporting system to complement Chaos Mesh. I’ll also give you some insights about how to build a daily reporting system, as well as the problems I encountered during the process and how I fixed them.

What Is Chaos Mesh and How It Helps Us

With Chaos Mesh, we can conveniently simulate extreme cases in our business and test whether our system remains intact.

At my company, Digital China, we combine Chaos Mesh with our DevOps platform to provide a one-click CI/CD process. Every time a developer submits a piece of code, it triggers the CI/CD process. In this process, the system builds the code and performs unit tests and a SonarQube quality check. It then packages the image and releases it to Kubernetes. At the end of the day, our daily reporting system pulls the latest images of each project and performs chaos engineering on them.

The simulation doesn’t require any application code change; Chaos Mesh takes care of the hard work. It injects all kinds of physical node failures into the system, such as network latency, network loss, and network duplication. It also injects Kubernetes failures, such as Pod or container faults. These faults may reveal vulnerabilities in our application code or the system architecture. When the loopholes surface, we can fix them before they can do real damage in production.

Spotting these vulnerabilities isn’t easy, however: The logs must be carefully read and analyzed. This can be a difficult job for both the application developer and the Kubernetes specialist. The developer may not work well with Kubernetes; a Kubernetes specialist, on the other hand, may not understand the application logic.

This is where the Chaos Mesh daily reporting system comes in. After the daily chaos experiments, the reporting system collects logs, draws a plot, and provides a web user interface for analyzing the possible loopholes in the system.

In the following sections, I’ll explain how to run Chaos Mesh on Kubernetes, how to generate daily reports, and how to build a web application for daily reporting. You’ll also see an example of how the system helps in our production.

Run Chaos Mesh on Kubernetes

Chaos Mesh is designed for Kubernetes, which is one of the important reasons why it can allow users to inject faults into the file system, Pod, or network for specific applications.

In earlier documents, Chaos Mesh offered two ways to quickly deploy a virtual Kubernetes cluster on your machine: kind and minikube. Generally, it only takes a one-line command to deploy a Kubernetes cluster as well as install Chaos Mesh. But starting Kubernetes clusters locally affects network-related fault types.

If you use the provided script to deploy a Kubernetes cluster using kind, then all the Kubernetes nodes are virtual machines (VM). This adds difficulty when you pull the image offline. To address this issue, you can deploy the Kubernetes cluster on multiple physical machines instead, with each physical machine acting as a worker node. To expedite the image-pulling process, you can use the `docker load` command to load the required image in advance. Apart from the two problems above, you can install kubectl and Helm by following the documentation.

Before you install Chaos Mesh, you need to first create CRD resources:

git clone https://github.com/pingcap/chaos-mesh.git
cd chaos-mesh
# Create CRD resources
kubectl apply -f manifests/

After that, install Chaos Mesh using Helm:

# For Helm 2.X
helm install chaos-mesh/chaos-mesh –name=chaos-mesh –namespace=chaos-testing
# For Helm 3.X
helm install chaos-mesh chaos-mesh/chaos-mesh –namespace=chaos-testing

To run a chaos experiment, you have to define the experiment in YAML files and use `kubectl apply` to start it. In the following example, I created a chaos experiment using PodChaos to simulate a Pod fail:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
namespace: chaos-testing
spec:
action: pod-failure
mode: one
value: ”
duration: ’30s’
selector:
namespaces:
– chaos-demo-1
labelSelectors:
‘app.kubernetes.io/component’: ‘tikv’
scheduler:
cron: ‘@every 2m’

Let’s apply the experiment:

kubectl apply -f podfail.yaml

Generate a Daily Report

For demonstration purposes, in this post I run all the chaos experiments on TiDB, an open source, distributed SQL database. To generate daily reports, you need to collect logs, filter errors and warnings, draw a plot, and then output a PDF.

Collect Logs

Usually, when you run chaos experiments on TiDB clusters, many errors are returned. To collect those error logs, run the `kubectl logs` command:

kubectl logs <podname> -n tidb-test –since=24h >> tidb.log

All logs generated in the past 24 hours of the specific Pod in the `tidb-test` namespace are saved to the `tidb.log` file.

Filter Errors and Warnings

In this step, you have to filter error messages and warning messages from the logs. There are two options:

  • Use text-processing tools, such as awk. This requires a proficient understanding of Linux/Unix commands.
  • Write a script. If you’re not familiar with Linux/Unix commands, this is the better option.

The extracted error and warning messages will be used in the next step for further analysis.

Draw a Plot

For plotting, I recommend gnuplot, a Linux command-line graphing utility. In the example below, I imported the stress test results and created a line graph to show how queries per second (QPS) are affected when a specific Pod becomes unavailable. Since the chaos experiment was executed periodically, the number of QPS exhibited a pattern: It would drop abruptly and then quickly return to normal.

Generate the Report in PDF

Currently, there is no available API for generating Chaos Mesh reports or analyzing results. My suggestion is to generate the report in PDF format so it will be readable on different browsers. In my case, I used gopdf, a support library that allows users to create PDF files. It also lets you insert images or draw tables, which meets the needs of a chaos engineering report.

The last step is to simply run the whole system at a scheduled time every day. My choice is crond, a command-line utility that executes cron jobs in the background, to execute the commands early each morning. So, when I start work, there is a daily report waiting for me.

Build a Web Application for Daily Reporting

But I want to make the report more readable and accessible. Isn’t it nicer if you can check reports on a web application? At first, I wanted to add a backend API and a database to store all report data. It sounds applicable, but it may be too much work since all I want is to know which report requires further troubleshooting. The exact information is shown in the file name, for example, report-2021-07-09-bad.pdf. Thus, the reporting system’s workload and complexity are greatly reduced.

Still, it is necessary to improve the backend interfaces as well as enrich the report content. But for now, a daily workable reporting system is just fine.

In my case, I used Vue.js to scaffold the web application using a UI library antd. After that, I updated the page content by saving the automatically generated report to the static resources folder `static`. This allows the web application to read the static reports and then render them to the frontend page. For details, check Use antd in vue-cli 3.

Below is an example of a web application that I developed for daily reporting. The red card indicates that I should check the testing report because exceptions are thrown after running chaos experiments.

👁 Chaos mesh report.

Web application for daily reporting

Clicking the card will open the report, as shown below. I used pdf.js to render the PDF.

👁 Chaos Mesh Daily Report

Daily report in PDF

Summary

The Chaos Mesh daily reporting system has been live in our company for four months. Luckily, the system has helped us discover bugs for multiple projects in extreme cases. For example, one time we injected a network duplicate and network loss failure into an application and set the duplication and package loss ratio at a high level. As a result, the application met unexpected situations during message parsing and request dispatch. A fatal error was returned, and the program abnormally exited. With the help of the daily report, we quickly obtained the plot and logs for the specific error. We used that information to easily locate the cause of the exception, and we fixed the system vulnerability.

Chaos Mesh enables you to simulate faults that most cloud native applications might encounter. In this article, I created a PodChaos experiment and observed that QPS in the TiDB cluster was affected when the Pod became unavailable. After analyzing the logs, I can enhance the robustness and high availability of the system. I built a web application to generate daily reports for troubleshooting and debugging. You can also customize the reports to meet your own requirements.

TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.