Message queues have served us well for two decades. But as distributed systems grow more complex — and as AI agents start running for hours or even days — developers are discovering that queues were never really designed for the job we’ve been forcing them to do.
If you’ve ever spent a late night debugging a payment pipeline that silently dropped transactions, chasing a message that got lost between services, or hand-rolling yet another database table to track saga state, this article is for you. We’re going to look at why queues fall short, what durable execution actually means, and why platforms like Temporal, Conductor, and Restate are quietly replacing entire categories of infrastructure glue code.
The Queue Trap We All Fell Into
Let’s be honest: message queues are brilliant tools. Amazon SQS, for instance, handled over 70 million messages per second at peak during Prime Day 2022. RabbitMQ and ActiveMQ have powered real-time systems reliably for years. For simple fire-and-forget tasks — sending an email, resizing an image — a queue is perfectly adequate and, frankly, probably the right choice.
However, queues were designed to move messages, not to manage workflows. That’s a subtle but crucial distinction. And as soon as a business process spans multiple services, involves conditional branching, or needs to run for more than a few seconds, you start bolting things onto your queue setup that were never part of its original design.
Specifically, you start adding:
- A database table to track “where are we in the process?”
- Dead-letter queues (DLQs) to catch failed messages
- Custom retry logic with exponential backoff
- Cron jobs to re-trigger stalled workflows
- Idempotency keys to avoid double-processing
- Observability tooling because you can’t see inside a queue
Before long, your “simple queue setup” is a distributed state machine held together with duct tape and hope. And when something breaks at 2:47 AM — and it will — you’re left manually reconciling state across six services.
Message queues provide the wrong level of abstraction. They focus on individual events rather than the complete, end-to-end business process. Every time you reach for a DLQ or a state-tracking table, you’re patching around a missing abstraction, not solving the real problem.
Enter the Saga Pattern — And Its Own Complications
The software industry’s answer to distributed transactions was the Saga pattern. Instead of one big atomic transaction (which falls apart across microservices), you break the work into a sequence of smaller steps. Each step has a corresponding “compensating action” that can undo it if something fails later.
Conceptually, sagas are elegant. In practice, however, they introduce a whole new layer of complexity. Consider what you actually need to implement and maintain: compensating transactions for every step, idempotency guarantees so retries don’t double-charge customers, monitoring and tracing across services, and a robust handling of “partial execution” states where, say, the stock has been reserved but the payment hasn’t cleared.
As Microsoft’s Azure Architecture documentation notes, debugging sagas grows exponentially more complex as the number of participating services increases. Compensating transactions don’t always succeed, which can leave the system in an inconsistent intermediate state that requires manual intervention.
Think of it this way: With a queue-based saga, you’re essentially building a workflow engine from scratch — one step at a time, scattered across multiple services, with no central view of what’s happening. Durable execution gives you that engine off the shelf.
Queues vs. Durable Execution: A Direct Comparison
Before diving deeper into how durable execution works, it’s worth laying out the differences side by side. This comparison covers a multi-step workflow — say, an order fulfillment process that touches inventory, payment, and shipping services.
| Capability | Message Queue (SQS/RabbitMQ) | Durable Execution (Temporal/Conductor) |
|---|---|---|
| State persistence across crashes | Must build yourself | Automatic, built-in |
| Workflow visibility / observability | Requires external tooling | Native execution history |
| Long-running workflows (days/weeks) | Awkward — needs DB state table | First-class support |
| Automatic retries with backoff | Partial — DLQ + custom logic | Configurable per activity |
| Saga / compensating transactions | Manual implementation | Native, straightforward |
| Replay from point of failure | Not supported | Core feature (event replay) |
| Scheduling / timers | External cron jobs | Built-in durable timers |
| Operational complexity | Low (for simple tasks) | Higher initial setup |
What “Durable Execution” Actually Means
The term sounds abstract, so let’s ground it. Durable execution means your code is crash-proof by design. You write a workflow function in your normal programming language — Python, Go, Java, TypeScript — and the platform guarantees it runs to completion, even if servers crash, networks fail, or deployments happen mid-execution.
The key mechanism is event history replay. Every step your workflow takes gets persisted as an event. If a worker process dies halfway through a ten-step workflow, the system replays those events on a new worker and resumes exactly where it left off — with no re-execution of already-completed steps and no lost state. As Temporal’s co-founder Maxim Fateev described it, the goal is a “fault-oblivious stateful execution environment”: you write code as if failures don’t exist, and the platform handles the rest.
Temporal Growth Metrics — Series B to Series D
This is fundamentally different from a queue. A queue tells you “this message was received.” A durable execution platform tells you “this step completed, here’s what it returned, here’s what happened next, and if anything failed, here’s exactly where and why.” That difference matters enormously when something goes wrong in production.
The Three Main Players
Temporal — The Battle-Hardened Pioneer
Temporal is the most mature player in this space, born from Uber’s internal Cadence project and spun out as an independent company in 2019. You write workflows as code — actual functions in your language of choice — and Temporal handles persistence, retries, timeouts, and state management transparently.
In February 2026, Temporal raised $300M at a $5B valuation, led by Andreessen Horowitz, with participation from Sequoia, Lightspeed, and others. OpenAI, Netflix, Snap, Datadog, and Nordstrom are among its notable customers. Its platform has processed 9.1 trillion lifetime action executions.
One trade-off worth knowing: Temporal embeds orchestration logic directly in code. This means developers need to be careful to avoid non-deterministic operations — things like reading the current time, using random values, or making uncontrolled external calls — inside workflow functions. Break this rule and you risk subtle replay failures that are genuinely hard to debug.
Conductor — JSON-First and LLM-Ready
Originally built at Netflix and now maintained as Conductor OSS (Apache 2.0), Conductor takes a different approach: workflows are defined in JSON rather than code. This separation of orchestration logic from implementation makes workflows deterministic by construction — there are no non-determinism bugs to debug because the definition language itself doesn’t allow them.
In practice, this also makes Conductor particularly well-suited for AI-driven workflows. Because JSON definitions can be generated and modified at runtime by LLMs or APIs without a compile-and-deploy cycle, Conductor has become a natural choice for teams building dynamic, model-driven pipelines. It ships with native support for 14+ LLM providers and built-in vector database integration.
Restate — Lightweight and Serverless-Friendly
Restate uses the same journal/replay mechanism as Temporal but with a significantly lighter footprint. It integrates natively with serverless platforms like AWS Lambda and Cloudflare Workers, making it particularly appealing for teams that need durable execution without the operational overhead of running a full Temporal cluster. It opened its cloud product publicly in 2025 with usage-based pricing.
Developer Effort: Queue-Based Saga vs. Durable Execution
A Concrete Example: Order Fulfillment
Let’s make this tangible. Imagine you’re processing an e-commerce order that needs to: charge the customer, reserve inventory, notify the warehouse, and send a confirmation email — in that order. If the warehouse notification fails after the payment succeeds, you need to either retry the notification or refund the charge.
With a queue-based approach, you’d typically have four separate services, each consuming from a queue, a database table tracking the current state of each order, retry queues for each step, and compensating logic scattered across multiple codebases. Adding a new step means touching multiple systems and hoping the state machine still holds.
With Temporal, the entire workflow is expressed as a single function. Here’s a simplified illustration of what that structure looks like:
# Pseudocode — illustrating workflow structure (not runnable) workflow: OrderFulfillment(order_id) step 1: charge_customer(order_id) on_failure: stop and surface error step 2: reserve_inventory(order_id) on_failure: compensate → refund_customer(order_id) step 3: notify_warehouse(order_id) retry: up to 5 times with exponential backoff on_failure: compensate → release_inventory + refund_customer step 4: send_confirmation_email(order_id) retry: up to 3 times
If the server crashes between step 2 and step 3, the workflow resumes at step 3 on a new worker. No custom state table. No manual reconciliation. No lost orders. The compensation logic is co-located with the workflow definition — not hidden in a DLQ consumer three repositories away.
Key insight: You write the happy path. You declare the compensations. The platform handles the rest — persistence, retries, timeouts, replay, and state visibility are all automatic.
Why This Matters Even More for AI Agents
The demand for durable execution has accelerated dramatically with the rise of agentic AI. Traditional LLM interactions are stateless — you send a prompt, you get a response, done. But AI agents that actually do things in the world — booking appointments, writing and executing code, processing documents across multiple APIs — run for minutes, hours, or even days.
As Temporal’s CEO Samar Abbas put it: “Agentic AI doesn’t fail because the models aren’t good enough. It fails because the systems around them can’t handle real-world execution.” Most agentic AI pilot projects stall precisely because teams underestimate the infrastructure complexity of keeping a stateful, multi-step process alive and observable across real-world chaos.
This is also why OpenAI, Replit, and Lovable use Temporal in production, and why the OpenAI Agents SDK now integrates durable execution as a first-class feature. When your agent needs to pause for human approval, wait for a webhook, or retry a failed tool call without re-running everything before it, durable execution is no longer optional — it’s the foundation.
When Should You Stick With Queues?
It’s worth being honest: queues are still the right tool for a meaningful class of problems. If your workload is genuinely fire-and-forget — sending transactional emails, processing image uploads, fanning out notifications — a simple queue is faster to set up, easier to operate, and more than sufficient. You don’t need a durable execution platform to send a welcome email.
The signal that you’ve outgrown a queue is almost always one of these: you’ve added a state-tracking table, you’re building custom retry logic, you have DLQ consumers with business logic in them, or you’re manually reconciling failed transactions. At that point, you’re already building a workflow engine — you’re just doing it the hard way.
| Use Case | Best Tool | Why |
|---|---|---|
| Send a notification email | Queue (SQS, RabbitMQ) | Simple, stateless, fire-and-forget |
| Resize uploaded images | Queue or serverless function | Single-step, idempotent, low complexity |
| Order fulfillment (multi-service) | Durable execution | Multi-step, stateful, needs compensation |
| Customer onboarding flow | Durable execution | Long-running, human-in-the-loop steps |
| AI agent with tool calls | Durable execution | Stateful, long-running, failure recovery critical |
| Compliance / audit pipelines | Durable execution | Needs full execution history and replay |
The Ecosystem Is Converging
One reliable sign that an idea has gone mainstream is when the major cloud platforms and frameworks start adopting it. And indeed, that convergence is well underway. Microsoft shipped its Azure Durable Task Extension for multi-day human-in-the-loop pauses in late 2025. Cloudflare Workflows reached general availability in 2025 with step-based durable execution running on Workers. LangGraph, Pydantic AI, and the OpenAI Agents SDK have all adopted durable execution as a core primitive.
Furthermore, the investment signal is hard to ignore. Temporal’s valuation tripled in under a year — from $1.72B at Series C in March 2025 to $5B at Series D in February 2026. That trajectory doesn’t happen without serious enterprise adoption and a clear product-market fit.
Meanwhile, Conductor’s Apache 2.0 licensing and JSON-native design are attracting teams that want the benefits of durable orchestration without vendor lock-in. And Restate is carving out a niche for serverless and edge environments where Temporal’s operational footprint is overkill.
In short, the question is no longer whether durable execution is ready for production. It already is, at OpenAI-scale. The more relevant question is: which flavour fits your team?
What We Have Learned
A brief summary of the key takeaways from this deep-dive:
- Message queues are excellent for simple, stateless tasks but the wrong abstraction for multi-step, stateful workflows — they push the complexity onto you rather than handling it.
- The Saga pattern solves distributed transaction consistency but introduces its own maintenance burden: compensating actions, idempotency logic, and cross-service state tracking scattered across codebases.
- Durable execution platforms like Temporal, Conductor, and Restate solve this at the infrastructure level — your workflow code resumes automatically after any failure, with full execution history and built-in retry logic.
- The key mechanism is event history replay: every completed step is persisted; on crash, the runtime replays those steps without re-executing them and resumes from the point of failure.
- Temporal has emerged as the dominant platform (9.1 trillion executions, $5B valuation), while Conductor offers a JSON-first, LLM-friendly alternative and Restate targets lightweight and serverless deployments.
- AI agents are accelerating adoption because long-running, multi-step autonomous systems are essentially workflows — and workflows that crash halfway through are not acceptable in production.
- The right time to migrate from a queue is when you find yourself building state tables, custom retry logic, or DLQ consumers with business logic — you’re already building a workflow engine; durable execution just does it properly.
Thank you!
We will contact you soon.
Eleftheria DrosopoulouMay 11th, 2026Last Updated: May 7th, 2026

This site uses Akismet to reduce spam. Learn how your comment data is processed.