What is Durable Execution or Workflows-as-Code?

Durable Execution is the practice of making code execution persistent, so that services recover automatically from crashes and restore the results of already completed operations and code blocks without re-executing them.

It effectively gives code a level of reliability and recoverability that is typically provided by workflow systems, hence also the name “workflows-as-code”.

A Durable Execution Engine records the progress of connected application code as it executes in your infrastructure. Whenever there is a failure, the code can use that persisted progress information to recover itself to where it was before the crash.

Why is resiliency important?

Writing resilient applications is hard. Often, it involves many components interacting with each other: services, databases, message queues, workflows, etc. Infrastructure and process failures can happen at any point throughout the code execution. For more complex applications, it’s a near impossible task to test and handle all the permutations of how multistep code and services can fail.

Imagine this pseudocode function that processes orders

process_order() {
   do_payment() 
   write_order_to_db()         
   send_confirmation_email()
}

Now imagine all the ways in which this can go wrong. There could be a network glitch right after we triggered the payment, and we would need to make sure that we don’t charge our customer twice. Or what if you process the payment successfully but the database is down, and you can’t write the order to it? Or how do we know, during recovery, which steps got already executed? And this is just one function… Usually there are multiple services interacting with each other. This opens up an entire new box of potential issues: race conditions where multiple requests try to reserve the same product, or being able to add another product to your cart during the checkout process and buy it without paying for it, or issues like cascade failures and timeouts.

As a result, applications contain many components whose main purpose is providing resiliency: workflow orchestrators for retries, message queues for reliable async communication, K/V stores for application state, scheduling infrastructure for delayed tasks, etc. The business logic gets flooded by complex retry and recovery logic to coordinate across all the different point solutions. And even then, it’s hard to make sure you cover all the corner cases.

That’s exactly what Durable Execution solves!

How does Durable Execution work?

Restate implements Durable Execution by recording progress in a persistent log. The log is managed by the Restate Server, which acts as a proxy or API gateway to the durably executed services.

Here is an example of a handler (function) that uses Restate’s Durable Execution. It updates a user’s roles in a system. It first applies the role in one system, and if that is successful, it applies a list of permissions.

async function applyRoleUpdate(ctx, update) {
    const { userID,  roleName, permissions } = update;

    // apply a change to external system (e.g., DB update).
    const roleID = await ctx.run(() => createNewRole(roleName));

    // simply loop over the array or permission settings,
    // each step is journaled.
    for (const permission of permissions) {
        await ctx.run(() => applyPermission(roleId,  permission));
    }

    await ctx.run(() => applyRole(userId, roleId));
}
@Service
public class RoleUpdateService {

  @Handler
  public void applyRoleUpdate(Context ctx, Update update) {

    // apply a change to external system (e.g., DB update).
    String roleId = ctx.run(() ->
        createNewRole(update.getUserId(), update.getRole()));

    // simply loop over the array or permission settings,
    // each step is journaled.
    for (Permission permission : update.getPermissions()) {
        ctx.run(
              JacksonSerdes.of(Permission.class),
              () -> applyPermission(roleId, permission)
        );
    }

    ctx.run(() -> applyRole(update.getUserId(), roleId));
  }
}

The code communicates its progress to the server by using the Restate context of the embedded SDK. Every context call (ctx.something()) leads to a progress event sent to Restate. Restate adds these as entries to a log, and this then becomes the ground truth for what the service has done. You can see it like a database for code execution progress. This code itself runs like any other service in your infrastructure, for example on Kubernetes or AWS Lambda. Here is a visual representation of this process:

Durable Execution

Durable Execution helps with multiple aspects here. Any failures are automatically retried. Work that has been already completed does not get repeated during retries. Instead, the previously recorded results are sent over to the service and get replayed, giving us stable, deterministic values during retries. As you can see, the code uses regular code and control flow, but is fully resilient against failures, race conditions, and timeouts.

When to use Durable Execution?

Keeping multiple systems in sync is just one application of Durable Execution. Durable Execution can be a useful primitive for many use cases.

Use cases diagram

For example:

  • Workflows: Workflows are often used in microservices architectures to coordinate the execution of complex processes that span multiple services and systems. You define the steps of your workflow in code, and they get executed resiliently by the Durable Execution Engine with the handling of retries, compensations, state, etc. Workflows can be long-running (e.g. waiting on human approval), but also fast, latency-sensitive functions can benefit from workflow-like semantics (e.g. user interaction flows).
  • Microservice orchestration: Developing applications in a microservice architecture is hard. They expose developers to all sorts of tough distributed systems problems, making it non-trivial to build applications that are consistent, scalable, and resilient. Durable Execution helps with coordinating and persisting communication between microservices, and with recording intermediate actions and interactions with other systems and APIs. This makes writing robust applications much easier. Restate takes this even further and also helps with providing idempotency for any request, limiting concurrency while aiding scalability, and keeping application state consistent.
  • Async tasks: Reliably scheduling async tasks for now or for the future usually requires extra infrastructure components to provide resiliency: a message queue, cron jobs, or a workflow orchestrator. With Durable Execution, these async tasks become just another invocation or log entry, without the need for additional setup. The Durable Execution Engine registers the task and makes sure that it runs to completion, either immediately or whenever it should. On top of that, Restate’s SDKs make it easy to implement tasks through features like durable webhooks and timers, the flexibility to switch between asynchronous and synchronous modes by re-attaching to ongoing tasks, the capability to fan out tasks and then collect their results, and more.
  • Event processing: Durable Execution helps with transactional event processing. For example, running workflows based on Kafka, where you have a set of transactional steps that you want to execute one after the other. Or complex, dynamic control flow, where a Kafka event spawns multiple calls to other services and APIs. As opposed to many stream processing systems, Restate does not put any restrictions on the control flow (e.g. DAG). Each event craft its own path through the code and builds up its own recovery log. Restate manages the complexities of reading from Kafka to make sure each event gets processed exactly once. Restate handles retries and recovery of your event handlers to the exact point before the crash.

The future is durable

At Restate, we believe that Durable Execution will become one of the main building blocks of the applications of the future. It’s very hard to obtain the same levels of consistency, simplicity, and resiliency without it.

We also believe that reliably executing a single workflow or function is just the beginning. Restate expands the concept of Durable Execution to other parts of applications, to give you resilience and durability across the full spectrum of your application logic: communication, execution and state. You can think of this as follows. Durable Execution gives your functions and services retries, recovery and idempotency. But this mechanism can be extended to communication and promises across services, processes and time, where Restate acts like an event broker that supports not only sending messages, but also request-response, webhooks, and delayed calls. And finally, with its central position, Restate can serve as a concurrency guard for applications that modify application state, making sure that state remains consistent at all times, and that only a single function is changing the state at a single point in time. We will dive deeper into this in one of the follow-up blog posts. In the meantime, have a look at our examples and documentation to learn more.

Join our Discord channel or Slack Community to stay up to date and get the notifications for our upcoming blog posts.

What’s next?