Note

Access to this page requires authorization. You can try signing in or .

Access to this page requires authorization. You can try .

Disaster recovery

Disaster recovery (DR) for Azure Databricks replicates workspaces, data, and configurations across cloud regions so your teams keep working when a regional outage takes your primary deployment offline. A complete DR plan covers not only Azure Databricks but the data sources, ingestion tools, BI tools, and schedulers it connects to.

This page covers the concepts, strategies, tooling, and test procedures you need to design and run a cross-region DR solution.

New to DR planning? Start with Disaster recovery industry terminology for definitions of RPO and RTO.

Important

Use managed disaster recovery. Azure Databricks recommends managed disaster recovery for cross-region DR on AWS and Azure. It replicates Unity Catalog metadata, managed table data, and workspace assets on a continuous schedule, provides a stable URL that survives failover, and lets you trigger failover from the account console. No replication scripts to write or maintain. Use the DIY guidance on this page only for resources managed DR doesn't replicate, or if you require active-active topologies, cross-cloud replication, or fine-grained control over the replication pipeline.

Intra-region high availability guarantees

The rest of this page covers cross-region DR, but Azure Databricks also provides high availability (HA) inside a single region. Understand these guarantees first. They determine whether you need a separate DR strategy.

HA and DR solve different problems:

  • HA uses availability zone (AZ) redundancy inside a region. If one zone fails, services keep running in the others.
  • DR uses inter-region replication. You run secondary Azure Databricks workspaces in another region and replicate data and configurations to them, then fail over during a regional outage.

If you don't need multi-region DR, Azure Databricks HA might be enough. HA avoids cross-region complexity but doesn't protect against a full-region outage. If you rely on HA alone for DR, verify your cloud region's separation and redundancy.

Intra-region HA guarantees cover the control plane and the compute plane.

Terminology

Use these definitions consistently when discussing DR with your team.

Typical recovery workflow

A Azure Databricks DR scenario typically plays out as follows:

  1. A failure hits a critical service in your primary region: a data source, a network, or another dependency the Azure Databricks deployment relies on.
  2. You investigate with your cloud provider.
  3. If the wait is unacceptable, you decide to fail over to your secondary region.
  4. Confirm the same problem doesn't affect your secondary region.
  5. Fail over (for detailed steps, see Test failover):
    1. Stop all workspace activity. Users stop workloads and back up recent changes where possible. Jobs shut down (if the outage hasn't already failed them).
    2. Run the secondary-region recovery procedure to update routing and redirect connections and network traffic.
    3. Repoint downstream systems (BI tools, schedulers, third-party integrations) to the secondary workspace and resume their connections.
    4. After testing, declare the secondary region operational. Users log in to the now-active deployment, and you retrigger scheduled or delayed jobs.
  6. After the primary-region issue is mitigated, confirm the fix.
  7. Fail back (for details, see Test restore (failback)):
    1. Stop all work in the secondary region.
    2. Run the primary-region recovery procedure to redirect routing back.
    3. Replicate any new data back to the primary region. Minimize what needs to replicate. For example, read-only jobs that ran in the secondary deployment might not require write-back.
    4. Test the primary-region deployment.
    5. Declare the primary region active and resume production workloads.

Important

Some data loss can occur during these steps. Define how much loss is acceptable for your organization, and how you mitigate it.

Step 1: Understand your business needs

Identify which data services are critical and define their target RPO and RTO. Research each system's real-world tolerance.

DR, failover, and failback carry real costs and risks, including data corruption, data duplication (writing to the wrong storage location), and users making changes in the wrong region.

Map every Azure Databricks integration point that affects your business, and choose the tools and communication channels your plan uses.

Step 2: Choose a process that meets your business needs

Default to managed disaster recovery. It handles workspace replication, Unity Catalog metadata, managed-table data, and failover orchestration without custom scripts. Use the DIY guidance below only if you fall outside its scope, for example, resources managed DR doesn't replicate, active-active topologies, cross-cloud replication, or fine-grained control over the replication pipeline.

A DIY solution must replicate the correct data across the control plane, compute plane, and data sources. Redundant workspaces map to different control planes in different regions, so you keep them in sync with a script-based solution, either a synchronization tool or a CI/CD workflow. For the data itself, most teams use Azure Databricks jobs (often scheduled) or Delta Deep Clone to copy tables between regions. You don't need to sync data from within the compute plane (such as from Databricks Runtime workers).

If you use the VNet injection feature (not available with all subscription and deployment types), deploy networks consistently in both regions using template-based tooling like Terraform.

Replicate your data sources across regions as needed.

DR solutions typically involve two (or more) workspaces. Choose between the following strategies based on the disruption length you must tolerate, operational effort, and the cost to fail back to the primary region.

Step 3: Prep workspaces and do a one-time copy

First, stand up a secondary Azure Databricks workspace (or workspaces) and its supporting metastore in your chosen secondary region. The secondary workspace must mirror the primary's account, region, and identity configuration before you can replicate data or assets to it.

If you use managed disaster recovery, Azure Databricks handles the initial bootstrap of in-scope catalogs and workspace assets when you create a failover group. You don't need to run a one-time copy for those resources. Continue with the rest of this section for any data sources or assets that managed DR doesn't replicate.

For a production workspace running outside managed DR's scope, run a one-time copy to sync the passive deployment with the active deployment. This copy handles:

  • Data replication: Use a cloud replication solution or Delta Deep Clone.
  • Token generation: Automate replication and future workloads with generated tokens.
  • Workspace replication: Replicate using the methods in Step 4: Prepare your data sources. For comprehensive guidance on exporting workspace configuration, data, and AI/ML assets, see Export workspace data.
  • Workspace validation: Test the workspace and process to confirm they execute successfully and produce the expected results.

Subsequent syncs run faster than the initial copy, and your tooling logs record what changed and when.

Step 4: Prepare your data sources

Azure Databricks can process a large variety of data sources using batch processing or data streams.

Step 5: Implement and test your solution

If you use managed disaster recovery, you can trigger a planned failover from the account console to validate that your setup works end to end. The same procedure covers both DR tests and real outages. See Fail over and fail back.

Test your DR setup regularly. An untested DR plan often fails when you need it. Some teams switch active regions every few months on a schedule to validate assumptions, exercise processes, and keep the team familiar with the runbook.

Important

Test your DR solution in real-world conditions on a regular schedule.

If a test reveals a missing object or template, update your plan: remove the dependency, replicate it to the secondary workspace, or make it available another way.

Test the organizational and configuration changes too. Your DR plan affects your deployment pipeline, so the team must know what to keep in sync. After you set up DR workspaces, confirm that your infrastructure, jobs, notebooks, libraries, and other workspace objects are available in the secondary region.

Expand your standard work processes and configuration pipelines to deploy changes to all workspaces. Manage user identities across workspaces, and configure job automation and monitoring for the new workspaces.

Plan and test changes to your configuration tooling.

Automation scripts, samples, and prototypes

For AWS and Azure, managed disaster recovery handles workspace and managed-table replication without custom automation. The references below apply only if you're building a DIY solution outside managed DR's scope.

For DIY DR pipelines, use the Databricks Terraform provider to manage workspace assets as code and co-deploy to primary and secondary regions.

If you orchestrate Azure Databricks from Azure Data Factory, replicate the relevant ADF pipelines so they refer to a linked service mapped to the secondary workspace.

Additional resources


Feedback

Was this page helpful?

Additional resources