VOOZH about

URL: https://thenewstack.io/how-a-critical-hosting-failure-solved-a-devops-crisis/

⇱ How a Critical Hosting Failure Solved a DevOps Crisis - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-02-07 14:00:14
How a Critical Hosting Failure Solved a DevOps Crisis
contributed,sponsor-catchpoint,sponsored-topic,
DevOps / Linux / Security

How a Critical Hosting Failure Solved a DevOps Crisis

Resilience isn’t just about solving today’s problems — it’s about building systems and cultures that can adapt to tomorrow’s challenges.
Feb 7th, 2025 2:00pm by Matan Liber
👁 Featued image for: How a Critical Hosting Failure Solved a DevOps Crisis
Photo by Philipp Katzenberger on Unsplash.

When routine system updates caused critical hosting systems to fail and left machines unbootable, Pentera’s DevOps team found themselves in a race against time against a nightmare bug.

With operations grinding to a halt, they collaborated with the company’s in-house security researchers for a different perspective. This collaboration uncovered a flaw in the hosting platform and showcased the power of cross-discipline teamwork to resolve complex issues. This story offers a blueprint for resilience for organizations grappling with similar challenges: combining technical know-how with strategic collaboration to stay ahead of disruptions.

The Unexpected Boot Failure

In the final weeks of 2024, our DevOps faced a surprising situation: Machines previously accessible on the network suddenly failed to connect. This failure halted the team’s ability to continue developing and releasing versions to customers, making it imperative to identify and fix the issue quickly.

The team launched an exhaustive investigation, retracing their steps through environment variables and configuration files to determine what could have changed to cause this. Upon physical inspection of the affected machines, they encountered boot failures accompanied by the following error message:

👁 Image

Looking a bit further up the terminal, they could also see:

👁 Image

Something was causing libcrypto.so.1.1 to be missing during the boot process, rendering the machine unusable.

When in Doubt (and Facing a Time Crunch): Brainstorm

Under pressure to roll out product updates on schedule, the DevOps team faced a tough decision. They didn’t know where the issue was coming from and needed to figure it out quickly. There was a strong indication that something with the initramfs was wrong, which is a key component during the boot process of Debian and other Linux systems, but little more than that. With that knowledge, they could reach out to the Debian team for long-term insights, but there’s no predicting how long it would take, and it wouldn’t resolve the immediate challenge of returning online shortly.

Alternatively, they could implement a workaround to bypass the issue, but that risked leaving the root cause unresolved and inviting future problems. Instead, they opted for a more innovative approach: a brainstorming session involving fresh perspectives — people unfamiliar with the problem and free from biases tied to past actions. Given my background in researching Linux systems, our VP of Research suggested I join the team to see what I could contribute.

As a research team lead within the Pentera Labs team, my experience and perspective differ from those of the DevOps team. While their experience primarily focuses on building and maintaining products, my role involves researching the latest attack trends and techniques, understanding how threat actors exploit vulnerabilities, and, in essence, figuring out how to break and exploit things effectively.

The root of the issue wasn’t immediately apparent. Unlike my usual assignments, I set out to investigate, diving into a task. My goal was to reverse-engineer the conditions or mechanisms that had created a denial-of-service (DoS) scenario. This shift in perspective was challenging but engaging, offering a valuable opportunity to approach the problem creatively.

Debian Mkinitramfs Flaw

I spent two weeks analyzing the system and collaborating closely with DevOps. We uncovered the root cause: a bug that had been dormant in the system until this specific scenario triggered it. Interestingly, the issue wasn’t directly related to the choice of tools or infrastructure upgrades but revealed a more significant systemic weakness within Debian.

The Culprit

A routine part of our product’s installation is upgrading the system’s packages to have the latest versions available. To achieve that, we have compiled Python code that runs apt upgrades. In this case, this was our root cause issue. During the investigation, we discovered that running an apt upgrade inside an ELF file that was compiled using PyInstaller was the cause of this bug. Digging further into why it was happening, it looked like PyInstaller packaged some libraries with the executable file and then used an environment variable LD_LIBRARY_PATH to load them. In short, LD_LIBRARY_PATH specifies directories where the system should look for dynamic libraries before searching the standard library paths.

Removing the apt upgrade from the ELF file resulted in the crash disappearing.

This crash can be easily replicated using the following command (tested on Ubuntu 20.04).

mkdir /tmp/lib && cp /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 /tmp/lib && LD_LIBRARY_PATH="/tmp/lib" apt -y upgrade && reboot

Underlying Cause

The upgrade process can update the kernel or other critical packages, requiring changes to the initial RAM filesystem (initramfs). The initramfs contain essential drivers and tools to mount the root filesystem and boot the system, so they must be regenerated whenever updates affect the boot process.

During this process, the mkinitramfs command uses a subroutine called copy_exec to copy some executables into a temporary directory, which is later compressed into the final initramfs image.

👁 Image

Copy exec uses the dd command to check for library dependencies for those binaries and copies them. For example, running ldd on /sbin/modprobe:

👁 Image

We can see libcrypto.so.1.1 here as well.

In the start of the mkinitramfs script, it creates the necessary directories for those libraries being copied.

👁 Image

However, due to the LD_LIBRARY_PATH environment variable, the output of ldd is changed.

After adding some logs to the copy_file subroutine, which is used by copy_exec to do the actual copying, I got the following log:

Copying/tmp/lib/libcrypto.so.1.1 to /var/tmp/mkinitramfs_E4JSCD//tmp/lib/libcrypto.so.1.1

The directory /tmp/lib was never created inside the temporary mkinitramfs directory, causing the cp command to fail. Thus, any library inside the LD_LIBRARY_PATH directory was left out of the initramfs image.

Remediation

Initially, the team considered avoiding the problematic feature entirely. It seemed like the most straightforward path forward — a workaround to bypass the issue. But this was a short-term band-aid that didn’t address the underlying problem. Without fixing the issue, the bug could resurface in future scenarios, possibly in ways that were harder to predict or control. Fixing the issue would ensure the entire system’s integrity for future operations.

It appears the Debian team encountered a similar issue in the past, as evidenced by the usage of ldd within copy_exec:

👁 Image

The environment variable LD_PRELOAD is unset while using the ldd command.

LD_PRELOAD works very similarly to LD_LIBRARY_PATH, except that it points to a specific library rather than a directory of libraries.

So, to fix the bug we found, all that needs to be done is add another flag to the usage of the ldd command:

--unset=LD_LIBRARY_PATH

Security Perspective

As a security researcher investigating the situation, I was intrigued by the potential use of what I had found as an attack vector. The outcome would be a massive DoS attack on critical hosting services, a highly destructive endgame. However, logically, from my perspective, it’s not the most attractive tactic unless your goal from the outset is to shut down the entire operation.

Executing the attack would require very high-level permissions. As an attacker, if I had already gained access to those levels of credentials, I would have much more attractive options for an attack. I could use those permissions to access more lucrative systems, move laterally, or escalate permissions. I wouldn’t want to waste my access on an attack that would shut down the whole system, alerting the organization to an issue and potentially taking the system I have access to offline. So while this could technically be utilized as an attack, the more realistic outcome is precisely what happened here. A DevOps team accidentally creates these conditions rather than a hacker actively and purposefully exploiting them.

Cross-Discipline Collaboration: A Blueprint for Resilience

This incident highlights how cross-discipline collaboration builds resilience at the organizational level. By combining the DevOps team’s operational expertise with the investigative mindset of security researchers, we avoided waiting on the Debian team for support. This approach allowed us to identify the underlying issue and develop a real fix rather than relying on a rough workaround.

For team leaders, the lesson is clear: resilience stems from encouraging diverse perspectives and fostering interdepartmental collaboration. In this case, it was security researchers teaming with DevOps, but the principle applies across any combination of specialized skill sets. Breaking down silos and inviting fresh viewpoints can transform challenges into opportunities, ensuring long-term solutions rather than quick fixes.

To make collaboration like this a repeatable process, leaders can take deliberate steps to institutionalize it. For example:

  • Establish cross-functional “tiger teams” to tackle high-priority problems that cut across disciplines
  • Create shared knowledge hubs where teams can document and exchange insights, tools, and strategies to address recurring challenges
  • Promote cross-training opportunities, so team members develop a baseline understanding of other disciplines, improving communication and trust when it’s time to collaborate

Resilience isn’t just about solving today’s problems — it’s about building systems and cultures that can adapt to tomorrow’s challenges. Strategic teamwork isn’t merely a “nice-to-have”; it’s how organizations thrive in an increasingly complex and unpredictable world.

Today’s digital world requires resilience and exceptional performance. Digital enterprises turn to the Catchpoint IPM platform and expertise to proactively identify and resolve issues across the Internet Stack before they impact customers or workforce. The Internet Relies on Catchpoint.
Learn More
The latest from Catchpoint
TRENDING STORIES
Matan Liber is a Cyber Attack Team Lead, Security Researcher and exploit developer at Pentera. Prior to joining Pentera, Matan served in a classified unit in the IDF, specializing in malware analysis, reverse engineering and IR.
Read more from Matan Liber
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pentera.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
The annual research report on all things reliability – uncover trends and insights to shape your reliability strategy.