Persistent serverless state machines with XState and Restate
Posted October 8, 2024 by Jack Kleeman ‐ 8 min read
XState is a fantastic JavaScript framework for building state machines. You can do this with a DSL that describes states and state transitions, or with Stately Studio which allows you to build visually and export to code. XState is particularly popular for modelling frontend state, where each page load starts out with a fresh state machine that is updated by user input and HTTP responses. To make it possible to create state machines that span across time, XState integrates directly into the JS event loop, allowing you to have delayed transitions, or even react to a Promise (itself a form of asynchronous state machine). However, this requires your code to be running continuously and store state in memory, which means it can be tricky to move state machines to the backend.
Restate is a new kind of durable execution engine that has the goal of making it easy to turn straightforward imperative code into fault-tolerant code that runs as part of a distributed system. For JS users, it might seem a little like a distributed and persistent event loop - implementing its own ways to reliably schedule work and react to events, without your code needing to run all the time. That got us thinking - could we run XState on top of the Restate event loop? Our goal is to create ‘virtual state machines’ that wake up, react to events, and then go back to sleep, storing their state durably so they can operate indefinitely and across transient failures.
All business logic can fit into a state machine abstraction, and often in a simpler and more concise format. Long-lived virtual state machines allow you to model business objects like accounts and users as a state machines. Perhaps you want to send a user a followup email a month after registration. You could run some sort of nightly cronjob to check who needs that email, splitting that logic out from other user code. But it would be much simpler to model the user onboarding flow as a state machine that incorporates the delayed action . Your state machine doesn’t need to be running until Restate wakes it up to process the action. And you can instantiate a state machine for every signup because while idle they are just some bytes stored in Restate.
Let’s walk through an example of a state machine. We’ll pretend we’re a payments company and we want to implement transfers between two users. There’s a few elements to consider:
- We want to decrease the balance of (debit) the sender, and increase the balance of (credit) the recipient. Failing in between these two tasks is possible and needs to be handled carefully.
- We need to run some sort of financial crime checks. In practice this looks like an approval step; some external system will be asked if the payment is allowed. This approval might come back very quickly if its low risk, but it might escalate to a human if not, in which case it could be days. In the slower case, we probably want to return to the user a message saying that their payment is in progress and that they’ll be emailed when its finished.
Let’s hop over to Stately Studio and model this:
Stately gives us most of the code we need directly from the flowchart; we really only need to implement the Promise actor updateBalance
and the actions sendEmail
and requestApproval
. We use Promise actors where something asynchronous and fallible is happening that we must react to; actions are used to model fire-and-forget tasks. We’ll assume that the approval process, however it works, will send a ‘approved’ or ‘rejected’ message back to the state machine when its completed.
Let’s not worry too much about the implementation specifics of the actions and actors, though - we’ll just log when they execute. What problems would we run into if we want to deploy this state machine in the backend?
- We’d have to run it in a container that lives for as long as the payment takes to be approved, which might be days. The state would be in memory in that container, so if it crashes we have a big problem. We can’t run this in a serverless function without persisting that state but even then, the email sending delay will be super expensive as it requires the state machine staying alive for the duration of the delay.
- If the container crashes right after the Debited stage finishes, the sender will have had their money taken without the receiver receiving it. We’d have to run some kind of regular scan against a database looking for payments stuck in this state and re-executing the state machine.
- For the approval workflow to get a message back to the state machine, it somehow needs to be able to communicate with the state machine that fired it off. This requires some sort of clever message routing back to the original process, or we’ll have to communicate through some sort of key value store keyed against the payment ID.
Let’s look at how to implement this with Restate. Restate will act as the orchestrator; external services (eg, the frontend) will submit work to it, and Restate will ensure that the work gets done, even through transient failures, by calling some service code that includes the XState library and the @restatedev/xstate
library, which teaches XState how to use Restate as its event loop and state store.
The only real changes we need to the state machine are to use the fromPromise
import from the Restate library when creating Promise actors, and to use the xstate
function to convert an XState state machine to a Restate service. You can see the full implementation on GitHub, and run it with npm run payment-example
.
This example runs as a Node process, but it can equally be deployed as a Lambda, a Next.js server action, or anywhere else. The Restate runtime will call out to the service over HTTP when there is work to be done. The state machine doesn’t rely on your service being healthy; every transition is a durable event within Restate, and the state machine will not move on until that event has been processed to completion. Let’s give it a try.
First, we’ll register our locally running service against a Restate instance:
> restate-server
> restate dep register http://localhost:9080
Deployment ID: dp_14F1TkDXzh0W87yLqa8yGK5
❯ SERVICES THAT WILL BE ADDED:
- payment
Type: VirtualObject ⬅️ 🚶🚶🚶
HANDLER INPUT OUTPUT
snapshot value of content-type 'application/json' value of content-type 'application/json'
create value of content-type 'application/json' value of content-type 'application/json'
send value of content-type 'application/json' value of content-type 'application/json'
invokePromise value of content-type 'application/json' value of content-type 'application/json'
✔ Are you sure you want to apply those changes? · yes
✅ DEPLOYMENT:
SERVICE REV
payment 1
Restate has discovered some methods on a Virtual Object - a keyed and stateful Restate service. We have snapshot
for reading the current state (similar to actor.getSnapshot()
), create
to create a new state machine instance (like createMachine
), send
to handle input events from other systems (like actor.send()
), and invokePromise
which is an internal utility method. Lets kick off our first state machine at the virtual object key myPayment
.
> curl http://localhost:8080/payment/myPayment/create --json \
'{"input": {"senderUserID": "alice", "recipientUserID": "bob", "amount": 100}}'
{"status":"active", "value":"Awaiting approval", ..}
We get an immediate response saying that the state machine has entered the ‘Awaiting approval’ state. Let’s take a look at the service logs:
[payment/create] INFO: Invoking function.
[payment/create] INFO: Requesting approval for myPayment
[payment/create] INFO: Scheduling event from myPayment to myPayment with id xstate.after.10000.Payment.Awaiting approval and delay 10000
[payment/create] INFO: Function completed successfully.
We can see that the requestApproval
action ran, and a new event was scheduled. This is the Restate event loop in action; XState tells Restate to schedule a transition in 10 seconds - that will take us into the “Awaiting manual approval” state. But the function doesn’t need to be running while it waits for the event to come, so it completes. If we hang around for a few more seconds, we will see that event come in:
[payment/send] INFO: Invoking function.
[payment/send] INFO: Relaying message from myPayment to myPayment : xstate.after.10000.Payment.Awaiting approval
[payment/send] INFO: Sending email to alice
[payment/send] INFO: Function completed successfully.
The event arrived and led the sendEmail
action to run. If we check the state again, we’ll see this led to a transition:
> curl http://localhost:8080/payment/myPayment/snapshot
{"status":"active", "value":"Awaiting manual approval", ..}
Now we can pretend we are the manual approver and send an ‘approved’ event:
> curl http://localhost:8080/payment/myPayment/send --json '{"event": {"type": "approved"}, "source": {"id": "eve"}}'
{"status":"active", "value":"Approved", ..}
The event will move the state machine into the ‘Approved’ state. This will instantaneously kick off the updateBalance
Promise actor reducing the senders balance. This actor is its own state machine, and will execute under the invokePromise
handler for as long as it needs to complete its asynchronous work (eg, making an HTTP request to Stripe). Once complete, the actor will send a message back to the main state machine describing its success or failure state. That will progress the state machine onwards to either ‘Debited’ or ‘Refunding’, and from there to completion. Each transition is a separate Restate invocation, saving the state durably each time.
With XState we’ve managed to take a complex, stateful business process and reduce it down to a simple flowchart and some functions to implement side effects. Restate allows us to convert this into a serverless function and deploy it anywhere we like without changing the code at all. Its a powerful combination for writing resilient backend processes!