The Anatomy of a Durable Execution Stack from First Principles

How Restate works, Part 2

Posted February 20, 2025 by Stephan Ewen, Ahmed Farghal, and Till Rohrmann ‐ 20 min read

We dive into the architecture details of Restate, a Durable Execution engine we built from the ground up. Restate requires no database/log or other system, but implements a full stack that competes with the best logs in terms of durability and operations.

This is the second article in our series on building a durable execution system from first principles. The first blog post in this series, Every System is a Log, looks at this from the application side, and shows how a unified log architecture results in a tremendous simplification of distributed coordination logic. This post discusses the details of how we built the log-based runtime for that paradigm.

Modelling locking and database updates through an application log,
taken from Every System is a Log

To build this runtime, we asked ourselves, what such a system would look like when designed from first principles? We built a precursor to this with Stateful Functions, and from all the lessons learned there, we arrived at a design with a self-contained complete stack, centered around a command log and event-processor, shipping as a single Rust binary. To get an idea of the user experience we arrived at, check the videos in the announcement post.

This stands somewhat in contrast to the common wisdom “don’t build a new stateful system, just use Postgres”. But we saw a clear case to build a new stack, for multiple reasons: First, the interactions and access patterns are different enough from existing systems that we can offer both better performance and operational behavior, similar to why message queues exist, even though you can queue with a SQL table. Second, the architecture of event logs has made significant advancements in recent years, but the advanced implementations are exclusive to proprietary stacks and managed offerings - the open source logs and queues still follow architectures from the on-prem era. Third, we saw how a converged stack lets us provide a much better end-to-end developer experience, from first experiments on your laptop to multi-region production deployments.

Recap: Server and Services

A Restate application stack consists of two components: Restate Server, which sits at a similar place in your stack as a message broker, and the application services, which are durable functions/handlers containing the application logic. The server receives invocation events, persists them, and pushes them to the services, similar to an event broker. The services run the code that corresponds to RPC- or event-handlers, workflows, activities, or actors. Services may run as processes, containers, or even serverless functions.

But Restate doesn’t just push invocations, it maintains a bidirectional connection with the executing service handler and lets the code perform durable actions as part of the invocation, including journaling steps, sending events to other handlers, accessing/modifying state, creating persistent futures (callbacks) and timers. The services use a thin SDK library which communicates the actions to the server - somewhat comparable to a KafkaConsumer or JDBC client, but more high-level. See Restate’s examples for sample code and details.

The server handles all coordination and durability for the invocation life cycle, journals, embedded K/V state, and manages failover, leader-election, and fencing. The server’s view on an invocation and its journal is the ground truth; the services follow the server’s view and function executions may be cancelled/reset/retried as needed.

That approach makes the services completely stateless and simple to operate. They scale rapidly and run on serverless infrastructure like AWS Lambda, Cloudflare workers, etc.
This characteristic also lets us build those SDKs in many languages fairly easily, including TypeScript, Java, Kotlin, Python, Go, and Rust.

Clusters, Object Stores, and the Latency Gap

A distributed Restate deployment is a cluster of nodes that are connected to each other. Invocations and events can be sent to any node, and all nodes participate in the storage of events and dispatching of invocations to services/functions. Restate is active/active from a cluster-perspective, but has leader/follower roles at the granularity of individual partitions, similar to systems like Kafka or TiKV.

Restate stores data using two mechanisms: New events (invocations, journal entries, state updates, …) are persisted in an embedded replicated log (called Bifrost). From there, events move to state indexes in RocksDB, which are periodically snapshotted to an object store. So at any point in time, the majority of the data is durable in the object store (the nodes maintain a copy as cache) while a smaller part of the data is durable in the log replicated across the nodes.

This is a form of storage tiering, though not the classical tiering like in modern logs. It is more similar to a database management system, where the write-ahead-log (WAL) would be replicated across nodes, while the table data files and indexes would be persisted on S3 (and cached on the nodes).

Object store + latency gap

Architectures that keep most- or all - of their data on object stores have become popular for many reasons: Object stores are unbeatable in terms of combined scalability, durability, and cost (AWS S3 cites eleven 9s of durability, stores more than 100 trillion objects, and is cheaper than persistent disks). Plus, the storage exists disaggregated from compute nodes, making the nodes stateless (or owning little state), which is highly desirable for efficient operations.

Object stores are also available in most on-prem setups we’ve encountered. It was natural for us to design Restate such that object storage would be the primary durability for the majority of the data.

The reason why Restate has additionally a replication layer that persists new events (rather than writing events straight to object storage) is to provide low latency. Pure object store approaches have latencies that average around 100ms to make data durable, with tail latencies being a multiple of that. While that is feasible for analytical systems (e.g., Apache Flink) and data pipelines (like WarpStream), such latencies can quickly become prohibitive for many applications.

Restate’s replication bridges the latency gap between the requirements of fast durable functions and the capabilities of object storage.

The setup described above is what we ship first, in Restate 1.2: a fast log replicated between Restate server nodes. However, Restate uses virtual log abstraction, to easily support other log implementations as well, without having to build a full consensus machinery each time. This is a defining feature of Restate’s runtime implementation that we’ll dive deeper into in the next article in this series. We are currently using that mechanism to build object-store support in the log as well, which is a powerful feature for bringing the amount of data persistent on nodes to very small amounts, even zero.

There is no single best configuration for that setup - only a spectrum of trade-offs to pick from. In his Materialized View newsletter, Chris Riccomini describes it as a CAP-theorem-like choice:

In our context (durable execution runtime), durability must be a given, but we have the additional dimension of how much replicated data is kept on the nodes. Restate’s triad thus is: latency-cost-disk.

  • ➕ Low latency, ➕ low cost, ➖some data on disks: Quorum replication to nodes with async batch writes to S3.
    The nodes provide fast durability through replication and keep the data for anything between a few 100ms and minutes, before moving the events to object storage. Restate 1.2 can be seen as a variant with a long flush interval.

  • ➕ Low latency, ➖ high cost, ➕ no data on disks: Quorum replication directly to S3 Express One Zone.
    Restate’s replication mechanism deals only with ordering, quorum, and repairs upon loss of a zone, but doesn’t keep data on the nodes (except caches).

  • ➖ High latency, ➕ low cost, ➕ no data on disks: Synchronous batch writes to S3.
    Restate’s replication mechanism isn’t used.

Naturally, there are nuances: direct replication has an even lower latency than S3 Express 1Z quorums. Synchronous batch writes to S3 can be cheaper than anything else, because that approach may avoid cross AZ bandwidth cost. Disks still exist as caches in all configurations. And there is the option of using quorum replication to S3 Express 1Z in different regions, to support multi-region deployments without relying on disks. But it shows that there is the spectrum of options, with which we aim to enable developers to use Restate in diverse setups across cloud and on-prem, while maintaining a simple dependency: just an object store.

Restate 1.2 ships with all the virtual log infrastructure, and a low-latency replicated log implementation. We are currently working on the other configurations - if you are interested in being an early tester or design partner, please reach out to us.

As a final thought, being able to adjust to different trade-offs also helps Restate and its users adapt to changing cloud pricing models. To quote another prolific dist. sys. writer:

Partitioned Scale out

Restate follows the partitioned scale-out model: A cluster has a set of partitions, each with a log partition and an event processor instance. Partitions operate independently and allow the system to scale both across cores and nodes.

Everything related to an invocation happens within a single partition: invocation, idempotency & deduplication, journal entries, state, promises/futures, avoiding the need to synchronize and coordinate with any other shards. The target partition for an invocation is determined by hashing the virtual object key, workflow ID, or idempotency key, if applicable - otherwise, the partition is freely chosen.

In some cases, a function execution produces an event for another partition, for example RPC events or completions. In that case, events are still written to the local partition, and the server has an out-of-band exactly-once event shuffler to move events to the right target partition.

The partitions are not exposed to applications (though you see them when using restatectl) - only keys are directly addressable (virtual object id, workflow id, idempotency key), to allow changing the number of partitions without losing consistency.

From here on, we look only at what happens inside a single partition.

Event Log and Processor

The work that Restate Server does inside one partition happens in two components: the distributed durable log (called Bifrost) and the processors. The log is the fast primary durability for events (e.g., make invocations, add journal entry, update state, create durable promise, …), the processor acts on events (e.g., invokes handlers) and materializes their state. Log and processor are co-partitioned, meaning a partition processor connects to one log partition. They are independent, but frequently co-located in the same process.

Compared to databases, you can think of Bifrost as the transaction WAL, and the processor as the query engine and table storage. Compared to stream processing, you can think of Restate’s log as Kafka and the processor as a stream processing application (like KStreams or Flink).

Log and processor form a tight loop: the processor continuously tails the log, and acts on the events (e.g., making invocation). That may produce more events (journal entries, state updates, …), which are written to the log and again processed by the processor.

Let’s go through an example to illustrate this:

  1. A client invokes service handler processPayment with idempotency key K through Restate. The ingress enqueues the invocation to the log partition, as determined by hashing K.
  2. The leader Processor for the partition receives that event and checks its local idempotency key state. K is not contained there for processPayment. The processor atomically adds K to the state and transitions the invocation to running, then builds a connection to the target service endpoint, and pushes the invoke journal entry.
  3. The service streams back a step result event (ctx.run({...})) and the processor enqueues that journal entry event in the log. Being persistent in the log is the point when “the step happened”, meaning from there on it will always be recovered.
  4. When the processor receives the event from the log (that means no other processor has taken over leadership in the meantime) then it adds the event to the invocation’s journal state and sends an ack to the service.
  5. Similar steps happen when the service sends a state update, a timer, an RPC event, or creates a durable promise. Events get always added to the log first, and once they are received by the processor they are acted upon (e.g., added to journal, routed as invocation to other service, etc.)
  6. Once the invocation completes, the Processor adds the result event to the log. Upon receiving that event, it sets the invocation state as complete and sends the result back to the client.

When the function execution fails (e.g., crash, loss of connection, user-defined error), the processor dispatches a new invocation, attaching the full journal events from this invocation so far. To avoid split brain scenarios between services, the processor tracks invocation execution attempts (retries) and rejects events sent from an invocation if a newer attempt has started. This can be tracked with simple in-memory state, because invocations are sticky to partitions, and partitions have strong leaders.

State storage

The processor stores all its non-transient state in an embedded RocksDB instance. Operations on that embedded store are very fast, but the state is lost when the node is lost. However, all state of the processor is deterministically derived from the durable log and can always be rebuilt from the log during recovery. To avoid arbitrarily long re-build phases, the RocksDB database is periodically snapshotted to the object store and the log is trimmed to the point of the snapshot. Processors can be restored by fetching the latest snapshot and attaching to the log at the event sequence number when the snapshot was taken.

The implementation of the partition processor is a tight event loop in Rust’s Tokio runtime. Partition Processors operate independently from each other and access exclusively local data structures (in memory, RocksDB, streams to ongoing invocations). The partition-local handling of invocations is easy in a log-first design, and would be much harder to achieve if we built this on a general purpose database.

That property makes the design also both simple and fast: Committing an event (e.g., step / activity) means appending the event to the log (obtaining a write quorum). As soon as that happens, the event is pushed from the log leader in-memory to the attached processor and ack-ed to the handler/workflow. This takes a single network roundtrip for a replication quorum, and no distributed reads. The durability of RocksDB happens completely asynchronously in the background.

Leaders and Followers

Both log and processors have one leader and optional followers. In the case of the log, followers increase durability for events through additional copies. In the case of processors, followers are hot standbys that have a copy of the state (deterministically derived from the log) and can quickly take over upon failure. Only the processor leader actually dispatches invocations for functions and workflows to the services, and only the leader writes snapshots to object store.


High-level architecture and request flow.

Control Plane, Data Plane, External Consensus

So far, everything we looked at was the data plane of the system: The log and the partition processor.

Everything is coordinated by a control plane, which is responsible for failure detection, failover coordination, and re-configuration. The control plane stores metadata for the cluster (like configurations) and runs the cluster controller that handles partition placement and leader election.

Control Plane and Data Plane

Because Restate has one control plane for both log and processors, it can co-coordinate both, for example ensuring that the leader processor is always co-located with the log partition leader, to reduce network hops and optimize reading from local memory caches. In contrast, if we were to build this transparently on an external log like Kafka, this co-location would be harder to achieve. The benefits of this joint control plane show in many parts of the system and are one of the reasons Restate is simpler to set up, scale, and operate.

Besides managing re-configurations and failover, the control plane also provides the external consensus for the data plane, allowing the data plane to operate more efficiently and with simpler properties than full consensus. We’ll go into the details of Restate’s log implementation in the next blog post - for now, a useful high-level way to think about this is that the control plane moves the data plane from one steady configuration to another, whenever the previous configuration is no longer functional (a failure happened) or desired (e.g., re-balancing). This blog post from Jack Vanlightly gives a nice introduction to that concept.

The Control Plane reconfigures the Data Plane (Figure from “An Introduction to Virtual Consensus in Delos “by Jack Vanlightly)

Another benefit of this design is that it allows Restate to use a simpler/slower implementation of consensus on the control plane, because it is rarely invoked. Restate abstracts its consensus to just an atomic compare-and-swap (CAS) metadata operation, which the built-in metadata store backs with an implementation of the RAFT consensus algorithm. But this can be easily extended to plug in different storage systems, as long as they support atomic CAS.

Failover & Reconfiguration

Though the control plane jointly coordinates log and processor reconfiguration, each has their own mechanisms to ensure consistency.


A segmented log (Figure from “An Introduction to Virtual Consensus in Delos “by Jack Vanlightly)

Bifrost’s (the log’s) mechanism is based on a mix of Delos (Virtual Consensus) and LogDevice. From a high-level, bifrost is segmented and failover or reconfiguration seals the active segment and creates a new segment, possibly with a new leader and a different set of nodes that store replicas. To the outside and the partition processors, everything looks like a single contiguous log.

When a partition processor fails, the control plane selects a new leader for that partition.The failover procedure relies on the external consensus provided by the control plane: New leaders obtain the next epoch in a strictly monotonous sequence (so newer leaders have higher epochs). The new leader appends a message to the log to signal their epoch is now active, and then simply starts appending events from its operations. The old leader (who might still be following the log) will receive that epoch bump message and step down at that exact point - it will keep materializing state (as a follower) but not dispatch invocations any more. The old leader also aborts ongoing function executions and lets the new leader recover those.


Leader handover via messages in the log

Any messages carrying lower epochs than the latest epoch-bumping message will be ignored, which filters messages that the old leader might have still been appending to the log before it found out that it was superseded by another leader. If the old leader was attempting to commit a journal entry, but the message was appended after the epoch-bump message, that commit cannot happen: The new leader will (or might have already) recover the process without that journal entry and execute and commit that step. This mechanism ensures that any step / activity result is committed exactly once. No split brain view is possible.

This mechanism also automatically resolves concurrent competition for leadership - the highest epoch will win, and late events are consistently ignored.

Converged Single Binary

Restate is architected as a set of individual components that communicate with each other and make no assumptions about the whereabouts of their peers. A Restate binary can run every component or a subset of them; a set of components is described by a role.

The default configuration is the converged mode, where every binary runs every role. In that case, you get a distributed architecture in a single binary. You can start more instances of the binary to form clusters. This mode is easy to use and efficient, because it also lets different components communicate efficiently through in-memory channels and caches whenever possible (e.g., log to processor).


The roles and components of Restate

However, you can of course also run it as a disaggregated setup, where different sets of nodes run different roles. That lets you separate control plane from data plane and pick your best trade-offs in terms of cost/durability/availability for metadata and data. For example:

  • Deploy three nodes with the admin role across three different regions, ensure application and consensus metadata are disaster proof.
  • Deploy six nodes with the Log-Server role across three availability zones, making sure data is replicated to tolerate zone outages.
  • Deploy Ingress and Worker roles in one availability zone to strictly co-locate them with zone-local services.

Restate’s architecture gives you a great developer experience from the start (launching a single binary on your laptop) all the way to sophisticated distributed deployments (disaggregated distributed setups).

Some Performance Numbers

How fast can a system like this be? The answer is, we don’t really know, we have plenty more optimization we can do. Our main focus for this release was durability, resilience, and operational tooling.

But even before any deep performance optimization, the system already pushes some pretty cool numbers, both latency- and throughput wise. Below are numbers from Restate 1.2. We measure the throughput / latency of the server against mock functions and activities, to put maximum stress on the server.

Latency

Restate aims to keep latencies low, despite giving strong guarantees on durability (replication) and consistency (strong consensus on locking keys, workflow ids, etc.).

Below are the numbers for durable functions with an increasing number of intermediate durable steps, running on a 3-way replicated cluster. Under low load, things are generally fast and a single step has a median latency of 3ms. Under high load, steps still have a median latency of 10ms after the initial workflow setup. Tail latencies under low load are a bit higher than we like (possibly caused by some Tokio / RocksDB issues) and we believe we can get these down in the future.

LoadIntermediate StepsLatency
p50p90p99
Low (10 concurrent clients)05ms34ms54ms
315ms42ms69ms
931ms57ms93ms
2761ms106ms155ms
High (1200 concurrent clients)028ms41ms58ms
358ms76ms98ms
9116ms138ms163ms
27283ms320ms356ms

One thing you can observe here is that Restate is built to make durable steps cheap. An initial function or workflow invocation needs to check whether the workflow ID, idempotency key, or object key already exists and whether it is under execution. But once an invocation has been made, adding steps is just the equivalent of a conditional append to the log.

Throughput

We ran a workflow of 9 intermediate durable steps (totalling 11 actions, including invoke/result), using 1200 concurrent clients to submit workflows.

Restate pushes 94,286 actions (steps) per second, equalling 8,571 full workflows each second. The system maintains a p50 of 116.36ms and a p99 of 163.33ms for the full workflow! The p50 per step is 10ms.

The experiment ran on AWS c6id.8xlarge nodes, which are admittedly beefy, but also deliver a throughput that most companies will not ever reach or exceed in their transactional load. And, if they exceed that, Restate still scales to more nodes through more partitions.

Up next: a fast, flexible, state-of-the-art log

We mentioned that we built our own implementation of a distributed replicated log (called Bifrost), because we didn’t find any of the existing logs suitable in terms of latency (single roundtrip, quorum replication with external consensus), durability (active-active with flexible quorums), flexibility (segmented log that can be dynamically reconfigured).

The next post in this series will dig into all the nitty gritty details of how we built that log. This log itself is a marvel of engineering and one of the reasons Restate is as powerful as it is. If you are into distributed systems, you will enjoy that one for sure!

Try it out!

We hope you find this work as exciting as we do. If you want to try this out, the quickstart, helps to get you to your first demo app (and give us a star on GitHub).
To get right into the distributed fun, check this guide to running a cluster locally.

Let us know what you think or ask questions in our community on Discord or Slack and get more deep content like this from us on X, LinkedIn, Bluesky, or via email.

Restate is also available as a fully managed cloud service, if all you want is to use it and let us operate it.