VOOZH about

URL: https://thenewstack.io/how-doordash-migrated-from-aurora-postgres-to-cockroachdb/

⇱ How DoorDash Migrated from Aurora Postgres to CockroachDB - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-12-05 03:00:19
How DoorDash Migrated from Aurora Postgres to CockroachDB
sponsor-cockroach-labs,sponsored-topic,
Data

How DoorDash Migrated from Aurora Postgres to CockroachDB

At the start of the pandemic, the food-ordering app crashed under overwhelming demand. Moving to a distributed SQL database management system made it more resilient.
Dec 5th, 2023 3:00am by Heather Joslyn
👁 Featued image for: How DoorDash Migrated from Aurora Postgres to CockroachDB
Images by Heather Joslyn.
The author of this article moderated a panel at RoachFest23.

NEW YORK — If your customer-facing application fails, the only silver lining might be if your direct competitors suffer the same calamity at the same time.

That day arrived for DoorDash on Friday, April 17, 2020 — roughly a month into the global pandemic lockdown that caused demand for online food ordering services like DoorDash to explode.

The company’s main database cluster peaked at 1.636 million QPS. Buckling under the load of unprecedented demand, it “crashed and burned,” resulting in hours of downtime, Allessandro Salvatori, a principal engineer at DoorDash, told an audience at RoachFest23, Cockroach Labs’ user conference, held here in October.

The only bright side? “Our two main competitors also went down,” Salvatori said, to booming laughter from his audience.

But it wasn’t funny, he noted, to the establishments then relying on DoorDash and similar online services to keep them afloat during the pandemic: “Put yourselves in the shoes of the poor restaurant owners … They didn’t have any bookings, any on-premises business — and they didn’t have online orders, either.”

DoorDash took the lessons it learned from the incident as it transitioned from Amazon Aurora Postgres to CockroachDB. At RoachFest, Salvatori — and his colleague, Michael Czabator, a DoorDash staff engineer — told how their company made the move, and how it helped make the business’s flagship application more resilient and consistent, improving availability for its customers.

A ‘Poor Man’s Solution,’ Until a Crisis Hits

When DoorDash began a decade ago, it was built around a Django monolith that talked to a single, Aurora database cluster, which offered a single writer. By the time Salvatori joined the company five years ago, the company had broken some parts of the monolith into microservices and added a few more database clusters.

“But their destiny was already written,” Salvatori said. “Due to our exponential growth, they soon would be in the same spot as our main databases.”

Until the monolith was broken up, it offered a single view of the toll that demand for the application was taking on the databases. But once that monolith was broken into microservices, that visibility would disappear.

“Our biggest enemy was the single primary architecture of our database,” Salvatori said. “And our North Star would be to move to a solution that offered multiple writers.”

In the meantime, the DoorDash team adopted a “poor man’s solution,” approach to dealing with its overmatched database architecture, Salvatori told the Roachfest audience: building vertical federation of tables, while not blocking microservices extractions.

In this game of “whack-a-mole,” he said, “Different tables would be able to get their own single writer and therefore scale a little bit and allow us to keep the lights on for a little bit longer. But we needed to take steps toward limitless horizontal scalability.”

Cockroach, a distributed SQL database management system, seemed like the right answer. The April 17 disaster brought new urgency to the transition plans.

Building a Tool for Extracting Data

To solve the immediate crisis, DoorDash’s engineers made use of “escape hatches” in the core object-relational mapper (ORM) and hacks “deep down in the guts of the Django ORM,” as Salvatori put it, to gain some short-term headroom. Then, the team quickly built a prototype tool, ready by the end of that month, to extract table extractions.

As a guinea pig, he said, DoorDash used its critical identity table: “If that didn’t work, no consumer, no Dasher, no merchant would be able to log in to our website or our mobile apps. The name of the game was to be able to revert back to the previous source of truth if everything didn’t go as planned.”

Four unsuccessful attempts followed; on the fifth try, the cutover to a new cluster was successful. Over the next 33 days, the team extracted seven databases and 34 tables, moving them to seven new database clusters. The transition brought the total QPS on the main database cluster down to 100,000.

The motivation for investing in the purpose-built extraction tool — which flowered into a collection of tools — was clear from a business standpoint. Without it, “we would have had to throw several bodies at the problem,” Salvatori said, citing a previous example when it took six engineers six months to extract a single table from the legacy database.

DoorDash’s bespoke extraction tool, he said, was designed to not only minimize labor but also to be constantly improved and enhanced. It was also given a UI that gave users visibility and allowed them to edit the SQL that would be generated from extraction.

It also was designed to be operated safely across databases used in production, with an emergency revert function, that allowed the data to switch back to the legacy database in as few clicks as possible.

The extraction tool, he said, also had other use cases, such as local migrations, repacking a table in a suspendable way, performing “hitless” upgrades (without degrading service or performance) — and extracting tables to CockroachDB.

A Strategy for Database Migration

As DoorDash began to undertake the migration to CockroachDB, Salvatori said, it prepared a strategy to avoid the common pitfalls of a traditional database migration. Among the pillars of that strategy:

Tail changes directly from information in database tables, using the custom tool “We had a lot of clusters on Aurora 9.6 that had no logical replication,” Salvatori said. Even though logical replication was available on Aurora 10, “that comes with a 40% on our single primary, which was our biggest pain point.”

The team wanted the ability to suspend the replication if it needed to deal with a problem. Without tailing changes in this manner, he said. it would “run the risk of the OOM killer kicking in, preventing the vacuum from making progress and eventually running into transaction wraparound issues.”

Use pulls instead of pushes. The team wanted to be able to suspend a migration and resume it later.

Maintain a feedback loop. “We wanted to be able to slow down if the health of either the source of the destination database is showing some cracks,” Salvatori said.

Spread out batches. Use an auto-tunable batch size. Activity is spread out its activity over ranges of the tables.

Keep it simple. The team maintained a single source of truth: Before extraction, the source of truth would be the “old” database. After extraction, the source of truth would be the new database.

Also, it didn’t replay every single change, but rather captured only what had been inserted, updated or deleted, fast-forwarding to the latest value.

When the cutover to CockroachDB occurred, it took three attempts to succeed. Salvatori showed the Roachfest audience a video screen capture of the moments leading up to the successful migration — and the triumphant aftermath.

“This is literally a lossless cutover, as far as I can tell,” one engineer said in the video, after the migration worked.

Handling 1.9 Petabytes of Data

So how is it working? Czabator, who joined DoorDash in February (from Cockroach) and works on its storage infrastructure team, gave the RoachFest audience an update.

DoorDash currently runs Cockroach in a single AWS region, running three AZs in that region. Since the fall of 2022, it has seen the number of nodes it runs increase by 55% (from about 1,500 to more than 2,300). However, the amount of data it handles has more than doubled, from just under 1 to 1.9 petabytes.

👁 Photo of Michael Czabator of DoorDash at Roachfest23

Michael Czabator of DoorDash told the RoachFest23 audience that his company has seen a decline in the volume of alerts about incidents, despite running dozens more clusters than it did before migrating to CockroachDB.

The online food ordering company’s teams use a self-service console to manage CockroachDB users and to handle schema management, index operations and changefeeds.

DoorDash has also developed a control plane for automation and operational tasks within CockroachDB and other technologies, Czabator said. The control plan leverages Argo workflows for things like repaving (replacing all the nodes in a cluster with new ones), in-place upgrades, scaling up or down, operations monitoring, managing settings and users, and running health checks.

“This has been a really big win for us,” he said. “Historically some of these operations have been vectors for incidents. When you have a human involved, things can get messed up. Sometimes there’s too much data, too much to interpret.”

The time saved has been significant, he said. For instance, manually repaving all nodes in production would take 100 days of work. With the automation, it takes only 10.

Among other results:

  • Scaling, using Amazon Web Services autoscale groups, has also become easier. When a certain level of CPU is used, three new nodes — one for each AZ — are autogenerated within two minutes, with traffic flowing to the new nodes.
  • Over the last year, despite the addition of about 70 clusters, the volume of alerts the teams received about incidents declined.
  • DoorDash has piloted using AWS EC2 m7g incidences, powered by AWS Graviton processors, in production alongside its CockroachDB ARM build, seeing better query throughput and significant declines in latency: about 35% in average p999 latency, and 45% in average P99 latency. The cost of using EC2 m7g incidences also saves DoorDash about 15% in cost compared to AWS EC2 m6i incidences, which are powered by Intel processors.

One of DoorDash’s core values, Czabator said, is “1% better every day.”

That means constant iteration, he said: “One of the ways that manifests in our organization is by continually refining the alarms, alerts, the metrics we have around Cockroach to improve the service to our customers.”

Cockroach Labs makes CockroachDB, the most highly evolved cloud-native, distributed SQL database on the planet. It helps companies of all sizes — and the apps they develop — scale fast, survive disaster, and thrive everywhere.
Learn More
The latest from Cockroach Labs
TRENDING STORIES
Heather Joslyn is the former editor-in-chief of The New Stack. She previously worked as editor-in-chief of Container Solutions, a Cloud Native consulting company, and as an editor/reporter at The Chronicle of Philanthropy and the Baltimore City Paper.
Read more from Heather Joslyn
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and Cockroach Labs are sponsors of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.