Code that sleeps for a month

Solving durable execution’s immutability problem: part two

Posted February 23, 2024 by Jack Kleeman ‐ 9 min read

In part one, we discussed how durable executions maintains program state in a journal, and how this can lead to complications when updating service code in such a way that the journal no longer makes sense. We observed that limiting the duration that handlers run makes this immutability problem a lot easier to deal with. But wait - doesn’t that lose one of the most interesting properties of durable execution - the ability to create business processes that operate over long durations (’code that sleeps for a month’)? At Restate, we think that with the right primitives, you don’t have to lose anything.

However, if you like to program code with long sleeps because it fits your mental model well, Restate will absolutely let you do that. You should program in a way that you find intuitive. But if you like durable execution yet are skeptical of very-long-running handlers and their versioning problems, here are a few ways you can achieve the same properties without actually making the handlers run that long, by blending in durable messaging and state.

Handler suspension

Something as simple as sending your users an email a month after they signup is oddly difficult in general. You obviously can’t do it during the signup request, which needs to complete. You probably need to write down your intention to a database or queue and then come back to it later, which then means writing some sort of cronjob or consumer. What you really want is to just call the email service but say ‘do this in a month’.

Tools like Temporal and Durable Functions give you a ‘handler suspension’ abstraction. You can write handlers that sleep for a month before resuming and doing some more work. That means for the email use case, you just need to kick off the email handler during the signup request, with a delay as a parameter, and have strong guarantees that if the execution started, then in a month that email gets sent. In Restate, this pattern would look like this:

const emailService = restate.router({
  email: async (ctx: restate.RpcContext, request: { email: string, delay: number }) => {
    // adding new steps before the sleep would be a problem
    await ctx.sleep(request.delay)
    await ctx.sideEffect(() => ses.sendEmail(...))
  }
});
const emailApi: restate.ServiceApi<typeof emailService> = { path: "email" };

Having a request in-flight for a month isn’t ideal; it means that if we want to update the email handler in a way that updates the code before the ctx.sleep call - for example, we could add a check that the email is valid - we will need to keep the old version of the code around for a full month until existing requests on that version complete, as they will fail on the new version. That means that whatever functionality we wanted to add before the sleep won’t fully take effect for a month, and it also makes it very hard to apply urgent security patches, because you need to backdate them to all versions that might be called.

So, what’s the property we actually want here? Well, like we said, we just want to send the email service a request that we want to be processed in a month. The thing that hangs around ‘in-flight’ wouldn’t be a journal of a partially-completed workflow, with potentially many steps, but instead a single request message. And request versioning isn’t actually so hard! Only introduce new parameters as optional, and don’t delete any of the existing ones, and you’re sorted. You can also use Protocol Buffers to assist with backwards compatibility validation. Here’s what it looks like:

const emailService = restate.router({
  delayedEmail: async (ctx: restate.RpcContext, request: { email: string, delay: number }) => {
    // adding new steps here will not affect in-flight requests
    ctx.sendDelayed(emailApi, request.delay).send({ email: request.email })
  }
  // short running handler
  email: async (ctx: restate.RpcContext, request: { email: string }) => {
    await ctx.sideEffect(() => ses.sendEmail(...))
  }
});
const emailApi: restate.ServiceApi<typeof emailService> = { path: "email" };

The trick is to do your big sleeps between invocations, using request versioning if necessary, instead of doing your big sleeps inside invocations. We’ve expanded from a suspendable handler abstraction to a suspendable RPC abstraction. Restate is playing the role of a durable event bus as well as a durable executor.

Control loops

What if I want to write a workflow that runs a control loop; doing a set of tasks every hour, maybe even forever (or until you hit some sort of history limit)? A lot of people use workflow engines in this way; it’s pretty easy to reason about long-lived control loops. To deal with versioning, you have to stop all the loops, do the update, and kick them off again, but this can be a chore if you have thousands of them in-flight. It looks like this:

const controlService = restate.router({
  loop: async (ctx: restate.RpcContext) => {
    // pre-loop setup - must not change while requests are in-flight	
    let state: State = { ... }

    while (true) {
      // loop work - must not change while requests are in-flight
      state = await ctx.sideEffect(() => mutateState(state))

      if (endCondition) {
        return
      }

      await ctx.sleep(1000 * 3600) // 1 hour
    }
  }
});
const controlApi: restate.ServiceApi<typeof controlService> = { path: "control" };

How can we solve this problem without having unboundedly long-running handlers? We can use the same principle: do your big sleeps between invocations. In practice, this is just tail recursion:

const controlService = restate.router({
  setup: async (ctx: restate.RpcContext) => {
    // pre-loop setup
    const state = { ... }

    ctx.send(controlApi).loop(state)
  },
  loop: async (ctx: restate.RpcContext, state: State) => {
    // loop work
    const nextState = await ctx.sideEffect(() => mutateState(state))

    if (!endCondition) {
      // schedule the next iteration
      ctx.sendDelayed(controlApi, 1000 * 3600).loop(nextState) // 1 hour
    }
  },
});
const controlApi: restate.ServiceApi<typeof controlService> = { path: "control" };

Here, again, we have tightly limited what needs to be versioned; it’s just the request body, in this case the loop state, which you need to ensure is consistent from one iteration to the next.

Virtual objects

Another common use case for long running handlers is using them as a way to ‘own’ some complex state, which is rebuilt in memory from the journal every time the code is replayed. This simulates an experience not unlike Cloudflare’s Durable Objects. Essentially we are using the journal of a particular invocation as a sort of append-only log. For example, we might use a workflow to store the state of a chess game. This pattern actually isn’t currently a first class citizen in Restate (although you can build a library for it yourself). We’ll discuss why in a second - but for now let’s pretend we have a stream primitive a bit like Temporal’s signal concept that allows in-flight handlers to pull information from other handlers.

const chessService = restate.router({
  game: async (ctx: restate.RpcContext, gameID: string) => {
    let boardState = { ... }

    // ctx.stream does not exist!
    for await (const move of ctx.stream("moves").receive()) {
      if (verifyMove(boardState, move)) {
        boardState = applyMove(boardState, move)
      }
    }
  },
  move: async (ctx: restate.RpcContext, move: Move) => {
    // ctx.stream does not exist!
    ctx.stream("moves").send(move)
  },
});
const chessApi: restate.ServiceApi<typeof chessService> = { path: "chess" };

Remember, this handler only appears to be long running - the runtime may suspend execution after every incoming move, which is why it can run in serverless environments. So what actually stores the previous moves? When the handler resumes because of a new move, it will replay every move it’s received up to now, reverify all of them, and build the board state, until it gets to the point where it’s ready to process the next move. This is a bit unfortunate; if we presume that verifyMove is O(1), then we are doing an O(N) operation on every move unnecessarily. That’s because the runtime has no information about what the state of the handler on suspension was; it only knows how to rebuild it from scratch. This is a fairly typical problem in event-sourcing approaches. Essentially, we are using the ever-growing journal of the handler as a database; a clever trick, but one with performance implications that make suspension more expensive.

That’s why Restate actually exposes a key value database, so that invocations on a particular key can communicate with future invocations on that key explicitly through state:

const chessService = restate.keyedRouter({
  move: async (ctx: restate.RpcContext, gameID: string, move: Move) => {
    const boardState = await ctx.get<BoardState>("boardState")
    if !(verifyMove(boardState, move)) {
      throw new TerminalError(`Invalid move ${move}`)
    }
    ctx.set("boardState", applyMove(boardState, move))
  },
});
const chessApi: restate.ServiceApi<typeof chessService> = { path: "chess" };

So, instead of streams that send messages to in-flight handlers which are accumulating state in the journal, we can just send RPCs to services that are accumulating state in Restate. This makes things a lot easier to reason about; RPC is always the mechanism that services use to communicate, and explicit state is always the mechanism they use to remember things. Zooming out, this looks no different than normal services which store state in a persistent store.

But wait; why can’t we just use any KV store for this? Lots of people write workflows that interact with Postgres, for example. Well, then we introduce the dual write problem; you need to keep the store in sync with your execution, regardless of failure. Durable execution alone won’t save you; you need some transactional and locking semantics from the store, and to integrate them correctly. With Restate, you get a state store that is guaranteed to update atomically with the execution progress of your handler, keeping things consistent across failures, race conditions, and network partitions.

Conclusion

We don’t know yet whether these sorts of tactics can replace every use case for long running handlers, but we think it can cover a lot of them. And if we can make most handlers short-running, then versioning can be solved by immutable deployments, where old versions are kept around only long enough for invocations on those versions to complete, which will usually only be a few minutes. However, we’re not opinionated about this! Sometimes a long running handler is a really convenient way to solve a problem, and the associated versioning concerns can sometimes be mitigated via version-aware control flow1, or by restarting handlers on updates.


  1. There are ways to upgrade in-progress workflows by putting version-specific branches into the code, but we believe this is should only be a last-resort approach, because it leads to very hard-to-maintain code as these branches accumulate. ↩︎