VOOZH about

URL: https://thenewstack.io/supercharge-your-disaster-recovery-plan-in-5-simple-steps/

⇱ How to Supercharge Your Disaster Recovery Plan - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-02-15 06:21:41
How to Supercharge Your Disaster Recovery Plan
contributed,sponsor-zesty,sponsored,sponsored-post-contributed,
Cloud Native Ecosystem / Compliance / Data

How to Supercharge Your Disaster Recovery Plan

How can you safeguard your organization in the event of service disruption? Here are some steps you can take to prepare to recover quickly.
Feb 15th, 2022 6:21am by Maxim Melamedov
👁 Featued image for: How to Supercharge Your Disaster Recovery Plan
Feature image via Pixabay.
Zesty sponsored this post.
Maxim Melamedov
Maxim is the co-founder and CEO of Zesty. With over 13 years of experience in the tech industry, Maxim thrives on solving complex problems and disrupting previously established norms.

It’s every engineer’s worst nightmare: Our cloud provider has a sudden outage, causing your system to fail, product to malfunction and angry customers to tweet up a storm that your service is down. Such disruptions can cause serious repercussions for your credibility as a business and put the reliability of your product into question.

This nightmare scenario is on the mind of every engineer, and major cloud providers are taking note. In fact, AWS CTO Werner Vogels talked about design for failure architecture, stating that a data center might be interrupted one day, no matter how good you or your cloud vendor are at data center operations. And of course, his predictions have come to fruition.

We’ve seen even the largest and most successful companies fall victim to outages — the recent failures at Facebook, Slack and AWS are some of the most prominent examples. While not all outages can be attributed to the cloud, the recent example from AWS has proven that having viable and proactive business continuity (BCP) and disaster recovery (DR) plans as well as runbooks on each can make all the difference.

While BCP and DR are often grouped together, business continuity tends to be more common and less labor-intensive than disaster recovery. BCP generally refers to your typical cloud outage, while DR refers to a situation in which all your data is completely destroyed due to malicious actors or other destructive events. For BCP plans, it’s usually adequate to have more than one copy of your data and servers, whereas DR plans require you to have more backups and protocols in place.

Another important aspect to determine is your recovery point objective (RPO) and recovery time objective (RTO). RTO is the amount of time your business can afford to disconnect when recovering from disaster, whereas RPO is the amount of data that you can lose to a disaster (for example, 24 hours) without damaging your business’s reputation or breaching your service-level agreement (SLA).

So now that these important factors have been established, how can you safeguard your organization in the event of another cloud outage or any other issue that may arise? Here are some steps you can take to prepare and restore your service once the worst-case scenario occurs.

Create a Multiavailability Zone Deployment

The easiest and most common architecture for BCP is to use at least two availability zones (AZs) within the same region. For example, on AWS, each region is built out of three AZs, which are located relatively close to one another and are connected via a dedicated fiber and low-latency connectivity. This allows you to keep your service afloat so you can continue to serve your customers when one AZ fails.

Cloud providers tend to spread their services across multiple AZs (for example Amazon S3, Amazon DynamoDB, Google Cloud Spanner, etc.). Hence, they are built to handle AZ failure by design.

Take into consideration that such architecture may involve inter-AZ costs that need to be accounted for during the design phase.

Use a Single AZ in a Multiregion Deployment

In this scenario, you are implementing your application and databases across one AZ in two different regions. This enables you to have your service available when one region experiences downtime.

These two AZs can be deployed in a few ways:

  1. Each region will service 50% of the workload using a load balancer or DNS (Domain Name System) routing.
  1. The main region will serve most or all of the traffic, and the second region will be there to serve the users in case of a failure. If you choose to go this route, you may want to automate this failover task.

Use a Multi-AZ and Multiregion Deployment

The most recent AWS outage took place in the Northern Virginia (US-East-1) region and affected the entire region due to networking impact. If all your essential workloads were running in that region, your services were inevitably going to be affected. This means all your service would be out until the AWS services in that region were restored. Talk about putting all your eggs in one basket! This is a rare situation in which a network failure affected more than one AZ.

The best protection for such a scenario is to run different workloads and backups in various locations so if one region goes down, you can continue to serve your customers from a different region.

Of course, the more regions you’re running across, the more complexity is added to your environment and the more expensive your cloud bill can become. So be strategic about where and how many regions you want to build on, taking into consideration how critical your product actually is. Is it lifesaving? If so, you’ll need to go the extra mile to ensure your product functions in all situations and therefore warrants diversifying regions as much as possible.

Run on a Multicloud or Hybrid Deployment

Running on more than one cloud provider has become increasingly popular. But the reality is that it’s quite hard to maintain more than one cloud environment, and this challenge becomes even more difficult when you use managed services. For example, if you are using Amazon DynamoDB, similar solutions are unavailable through other cloud providers.

As a result, the industry trend is to run each workload on one specific cloud (on one multi-AZ or multiregion architecture), but enable different workloads to run on various other clouds — meaning you split some of the service between two different cloud providers. In such a scenario, not all the systems are down when an outage occurs.

Alternatively, another common practice is to have a DR site on premises. When customers migrate their workloads to the cloud, they tend to use the on-premises as a DR site, so if something goes wrong, they will be able to run some of the critical services locally. This is a good practice when your company’s cloud maturity is in the initial stages and you are not using managed services.

Create Backups

While all the solutions above are ideal for BCP, they won’t cover you in a situation where your data is being encrypted or deleted, which would require a solid DR strategy. So in addition to having more than one implementation for your service, you may need to have a solid backup plan that will meet the company’s RPO and RTO requirements.

There are many third-party backup solutions and cloud backup services available, for example, AWS Backup, that can assist you with automating backups and saving them in a separate account and region. You should guard the backup account like you’re protecting a vault so it can withstand a situation in which you’re being attacked and your data is encrypted.

If such an attack occurs, the data on the backup account will be used to restore your services so you can continue servicing customers.

Now that we’ve covered the most common methods for constructing a BCP and DR plan for cloud environments, let’s discuss the tools that can be used to implement these methods:

Leverage Infrastructure as Code

Infrastructure as code or IaC enables the automated configuration of your environment. Once you configure the parameters you want to use, it will be saved into a master file, otherwise known as a manifest. From there, your environment can be automatically recreated for testing, disaster recovery or a variety of other situations.

Use Scaling Rules for Containers

If you’re using containers, implementing scaling rules based on various metrics can be tremendously helpful. You can scale up to increase clusters in the same block or scale-out, which would duplicate instances. By implementing scaling rules, you can easily back up and restore your container-based applications so you can retrieve all important workloads if there’s an outage. Ideally, you’d need to scale both up and out on the container and instance level for this to be most effective.

Reroute DNS Requests

If servers in one location are down, you can reroute all requests to various other locations where you’re running your services. DNS providers such as Cloudflare can be configured to detect when a system is down and automatically perform geolocation-based rerouting.

Likewise, you can also set up container orchestrators to define and automate various rules for rerouting requests. We recommend implementing autoscaling as well to ensure availability and using Amazon ECS for the implementation of rerouting requests.

Set Up a Pilot Light to Run in Multiple Locations

Another recommended business continuity strategy is to run a pilot light, which is essentially a replicated version of your workload that’s running on standby in a different region. If a disaster occurs, all your data will be sitting there, ready to be set up. Simply deploy your infrastructure and scale your resources after an incident, and your product should be up and running without too much delay.

If you cannot afford any downtime whatsoever, you may want to consider running warm standby instead. According to AWS, “The warm standby approach involves ensuring that there is a scaled-down, but fully functional, copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always-on in another Region.”

While significantly more expensive, warm standby will be up and running faster than a pilot light, as there is no infrastructure setup needed.

Zesty is the first cloud management platform that leverages AI-driven automation to increase efficiency and reduce AWS spend by over 50%. The platform automatically adjusts compute and storage resources to match real-time application needs, with no human input.
Learn More
The latest from Zesty

Summing Up

Hope for the best and prepare for the worst is one of the most important rules to live by when it comes to disaster recovery. There is no such thing as being too prepared as cloud outages can happen to the best of us with no notice whatsoever.

By following the above strategies, you can ensure your service is available in multiple locations, can easily divert traffic to unaffected regions, and is backed up and ready for action should a disaster strike. Best of all, you can breathe easier knowing you won’t have to wake up in the middle of the night for any emergency configurations. Your business’s functionality and credibility are maintained to the highest standards.

Zesty is the first cloud management platform that leverages AI-driven automation to increase efficiency and reduce AWS spend by over 50%. The platform automatically adjusts compute and storage resources to match real-time application needs, with no human input.
Learn More
The latest from Zesty
TRENDING STORIES
Maxim is the co-founder and CEO of Zesty. With over 13 years of experience in the tech industry, Maxim thrives on solving complex problems and disrupting previously established norms.
Read more from Maxim Melamedov
Zesty sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Simply.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.