/

Engineering

How We Use Durable Execution to Power Agentic Checkout

Sophia Willows

Head of Engineering @ Rye

6 minutes read

Rye's checkout API looks like three calls from the outside. On the inside, each call kicks off a multi-step, long-running, partial-failure-prone workflow. That workflow runs on durable execution, with retries, state persistence, and pause-and-resume treated as first-class runtime primitives.

TL;DR / Key Takeaways

  • Agentic checkout is a distributed workflow problem dressed up as a REST API. Behind each confirm call sits a sequence of steps — offer retrieval, payment resolution, charge creation, order placement, capture — any of which can fail independently, some of which pause while a browser agent takes over.

  • The workflow runs on a durable execution runtime — AWS Lambda with the @aws/durable-execution-sdk-js. Each step is a self-contained ctx.step() call with automatic retries and state persistence. Agent fallback uses ctx.waitForCallback() to suspend the workflow until the agent completes.

  • We run this on AWS Lambda, using the @aws/durable-execution-sdk-js. Three things made it the right fit: our existing AWS footprint, pay-per-invocation economics at our volume, and no control plane to self-host.

  • Get started with the API →

Agentic Checkout Is a Distributed Workflow Problem

Agentic checkout looks like a REST API from the outside and behaves like a distributed workflow from the inside.

When an agent sends a product URL to Rye, we have to resolve the URL to a specific marketplace, fetch a live offer with real pricing and shipping and tax from that merchant, authorize a charge, place the order on the merchant's site, capture the charge, and hand back a confirmed order. If we have a deterministic workflow for the merchant, we run it; if not, we hand control to a browser agent. The agent might finish quickly. It might not.

Each step can fail independently. Some steps run in milliseconds, others run for minutes. Any step can roll back the ones that came before it — a failed order placement has to refund an already-created charge. The whole thing has to be idempotent end-to-end, because agents retry aggressively. And the public API contract is just the three calls from product URL to confirmed order.

This is the shape that durable execution was designed for.

What Durable Execution Gives You

Durable execution is a programming model where a workflow is broken into discrete steps and the runtime takes care of persistence, retries, and suspend-resume between them.

Three primitives matter most:

step — atomic, retriable units of work. You wrap each logical chunk of work in a ctx.step() call. The runtime captures the result when the step succeeds and replays it on retry — so if a downstream step fails and the workflow restarts, earlier steps don't re-execute. Each step gets its own retry policy.

Callbacks — built-in pause/resume. You can tell the workflow to suspend and wait for an external event (a message, a webhook, a human-in-the-loop approval) for hours or days. The runtime persists the workflow's state and resumes execution when the callback fires.

State persistence between steps. Step outputs, workflow inputs, and intermediate state are all durably recorded by the runtime. You don't write database rows between steps to persist "what happened so far." The runtime does it for you.

Together, these three primitives cover the core of what a reliable checkout workflow needs: idempotent step-level execution, suspended waits for long-running work, and durable state between steps — all as first-class runtime concerns.

Why We Built On AWS Lambda

Three things made AWS Lambda the right fit for this workload:

Our existing AWS footprint. We already had production infrastructure and institutional knowledge in AWS. Building on Lambda meant extending an existing relationship, not starting a new one.

Pay-per-invocation economics at our volume. Every checkout is a workflow execution, and checkout-heavy months compound fast. Lambda's pricing model lines up cleanly with how our traffic scales.

Nothing to self-host. The durable execution SDK ships as a Lambda runtime. There is no control plane to stand up, no cluster to operate, and no additional oncall surface.

The trade-off we accepted: our primary stack runs on a different cloud. Adopting AWS Lambda for the checkout workflow meant taking on a cross-cloud architecture — the checkout intent API invokes Lambda over the public internet, and Lambda calls back into our primary services through an authenticated proxy. That complexity is real, but it's concentrated at the boundary. Inside the workflow, everything is a standard durable-execution program.

The Checkout Workflow in Practice

The skeleton of the checkout workflow looks like this:

// Illustrative sketch of the workflow shape
export const handler = withDurableExecution(async (event, ctx) => {
  // 1. Resolve a live offer from the merchant
  const offer = await ctx.step('fetch-offer', fetchOffer);

  // 2. If no deterministic path exists, hand off to a browser agent
  //    and suspend the workflow until it reports back
  const agentResult = await ctx.waitForCallback(
    'await-agent',
    enqueueAgent,
    { timeout: { minutes: AGENT_TIMEOUT_MINUTES } }
  );

  // 3. Authorize payment and create a charge
  const payment = await ctx.step('authorize-payment', authorizePayment);
  const charge = await ctx.step('create-charge', createCharge);

  // 4. Place the order (deterministic first, agent fallback if needed)
  const order = await ctx.step('place-order', placeOrder);

  // 5. Capture the charge and finalize
  await ctx.step('capture-charge', () => capture(charge.id));
  await ctx.step('finalize', finalizeOrder);

  return order;
});

Each ctx.step() is automatically retried on transient failure with exponential backoff. Step results are persisted by the runtime — if order placement fails and the workflow restarts, we don't re-fetch the offer or re-charge the card. The workflow picks up exactly where it left off.

If order placement truly fails — not transiently, but for real — the compensating refund runs as another ctx.step() in the same workflow function, and the workflow records a terminal failure state. No separate saga orchestrator, no outbox table, no state-machine bookkeeping. The refund is just another step the runtime persists, retries, and records alongside the rest.

The skeleton of the checkout workflow looks like this:

// Illustrative sketch of the workflow shape
export const handler = withDurableExecution(async (event, ctx) => {
  // 1. Resolve a live offer from the merchant
  const offer = await ctx.step('fetch-offer', fetchOffer);

  // 2. If no deterministic path exists, hand off to a browser agent
  //    and suspend the workflow until it reports back
  const agentResult = await ctx.waitForCallback(
    'await-agent',
    enqueueAgent,
    { timeout: { minutes: AGENT_TIMEOUT_MINUTES } }
  );

  // 3. Authorize payment and create a charge
  const payment = await ctx.step('authorize-payment', authorizePayment);
  const charge = await ctx.step('create-charge', createCharge);

  // 4. Place the order (deterministic first, agent fallback if needed)
  const order = await ctx.step('place-order', placeOrder);

  // 5. Capture the charge and finalize
  await ctx.step('capture-charge', () => capture(charge.id));
  await ctx.step('finalize', finalizeOrder);

  return order;
});

Each ctx.step() is automatically retried on transient failure with exponential backoff. Step results are persisted by the runtime — if order placement fails and the workflow restarts, we don't re-fetch the offer or re-charge the card. The workflow picks up exactly where it left off.

If order placement truly fails — not transiently, but for real — the compensating refund runs as another ctx.step() in the same workflow function, and the workflow records a terminal failure state. No separate saga orchestrator, no outbox table, no state-machine bookkeeping. The refund is just another step the runtime persists, retries, and records alongside the rest.

The skeleton of the checkout workflow looks like this:

// Illustrative sketch of the workflow shape
export const handler = withDurableExecution(async (event, ctx) => {
  // 1. Resolve a live offer from the merchant
  const offer = await ctx.step('fetch-offer', fetchOffer);

  // 2. If no deterministic path exists, hand off to a browser agent
  //    and suspend the workflow until it reports back
  const agentResult = await ctx.waitForCallback(
    'await-agent',
    enqueueAgent,
    { timeout: { minutes: AGENT_TIMEOUT_MINUTES } }
  );

  // 3. Authorize payment and create a charge
  const payment = await ctx.step('authorize-payment', authorizePayment);
  const charge = await ctx.step('create-charge', createCharge);

  // 4. Place the order (deterministic first, agent fallback if needed)
  const order = await ctx.step('place-order', placeOrder);

  // 5. Capture the charge and finalize
  await ctx.step('capture-charge', () => capture(charge.id));
  await ctx.step('finalize', finalizeOrder);

  return order;
});

Each ctx.step() is automatically retried on transient failure with exponential backoff. Step results are persisted by the runtime — if order placement fails and the workflow restarts, we don't re-fetch the offer or re-charge the card. The workflow picks up exactly where it left off.

If order placement truly fails — not transiently, but for real — the compensating refund runs as another ctx.step() in the same workflow function, and the workflow records a terminal failure state. No separate saga orchestrator, no outbox table, no state-machine bookkeeping. The refund is just another step the runtime persists, retries, and records alongside the rest.

What This Gives Us in Practice

Three capabilities carry most of the weight.

Retries, with semantics that match the step

Each step has its own retry policy. A read against our internal catalog can be retried aggressively; a charge-creation step should retry with caution. Retry behavior lives as a parameter on the step itself, which keeps the reasoning local — the code that does the work and the code that decides when to retry it sit in the same place.

Pause and resume for agent fallback

This is the one that changed the most.

Agentic checkout sometimes involves handing control to the browser agent that navigates live merchant checkouts — for offer retrieval on a merchant we don't yet have a deterministic workflow for, or for order placement when the merchant's checkout flow has drifted since we last generated one. The agent can take anywhere from seconds to completion, depending on the merchant.

Expressing "wait for a browser agent to finish its task" in the durable-execution model is a single ctx.waitForCallback() call with a bounded timeout. The workflow suspends. No process is held open. When the agent reports back — through a callback identifier the runtime hands out when the wait begins — the workflow resumes at exactly that line, with the agent's result in hand. For a system built around "when possible, fall back to an LLM," this primitive is the thing that makes the architecture honest.

State without a state machine

The workflow's execution position is the state. The runtime persists it. There's no separate status field to update between steps — you call the next step, and the runtime records what happened. Recovery from a crash between steps 4 and 5 is handled by the runtime: when the workflow resumes, it picks up at the step that was executing, with all prior step outputs already materialized.

What We Learned

A few things we'd tell our past selves.

Cross-cloud latency is real, but survivable. Invoking Lambda from our primary cloud adds a few hundred milliseconds of overhead on the outbound call, plus whatever a Lambda cold start costs. For checkout, where end-to-end latency is dominated by merchant-site interaction, the cross-cloud hop is a rounding error. For shorter-running workflows, we'd think harder.

Treat steps as the unit of idempotency. Every mutating step in our workflow writes with an idempotency key derived from the workflow ID and the step name. The runtime will retry your step transparently; if your step has a non-idempotent side effect, you'll learn about it in production.

waitForCallback timeouts are a product decision, not just an infra one. We set the agent callback timeout to on the order of minutes because that's the outer bound of how long a browser agent should reasonably take for a single task. If you're adopting durable execution, spend time thinking about the timeout on every suspended step — it becomes part of your user-visible SLA.

Keep the application code oblivious. Developers calling Rye's API don't see any of this. The SDK still exposes the same create → poll → confirm contract it always has. The durable workflow lives on our side of the API; on their side, nothing changed.

Frequently Asked Questions

Do developers using Rye's API need to know any of this?

No. The Rye SDK exposes a standard create → poll → confirm contract, and all of the durable-execution machinery runs on our side of the API. This post exists for engineers who want to understand how the reliability they rely on actually works — it's not required reading to integrate.

What happens if a merchant's checkout flow changes mid-order?

The workflow treats each step independently. If order placement fails because a merchant's checkout has drifted, the durable runtime can hand off to a browser agent as a fallback — the workflow suspends via ctx.waitForCallback(), the agent navigates the updated flow, and the workflow resumes exactly where it left off. No state is lost.

How does this handle partial failures — say, payment succeeds but order placement doesn't?

Every mutating step runs with its own retry policy and idempotency key. If order placement fails after a charge has been created, the compensating refund runs as another ctx.step() in the same workflow — same retry guarantees, same durable state. There's no separate saga orchestrator or outbox table; the rollback is just another step.

How does this relate to the browser-based checkout agent?

The browser agent is a worker that the durable workflow hands control to when a deterministic path isn't available. The workflow doesn't care whether a step is resolved by a deterministic function or by a browser agent — both surface as ctx.step() or ctx.waitForCallback() from inside the workflow. The runtime persists the result either way.

Does this add latency to checkout?

The durable execution runtime itself adds minimal overhead — the latency in any given checkout is dominated by the merchant interaction, not the orchestration layer. We've optimized the infrastructure to keep startup latency predictable for the checkout workflow.

What's Next

Durable execution is how we run checkout today. Over the next few quarters we're migrating more flows onto the same model — returns, subscription billing, long-running reconciliation jobs, and some of the pre-checkout validation work that currently lives in ad-hoc serverless functions. The consolidation of "background work we need to be reliable" onto one programming model is probably the biggest architectural lever we have this year.

The shorter version: ecommerce is a distributed-systems problem wearing a REST API. Durable execution is what that problem actually looks like in code.

Get started with the Rye API →

Stop the redirect.
Start the revenue.

Stop the redirect.
Start the revenue.

Stop the redirect.
Start the revenue.