Agentic Workflows Are Just Code — Treat Them That Way
Resilience, suspendability, observability, human-in-the-loop, and multi-agent coordination, for any agent and SDK.The AI agents space is rapidly evolving and maturing, and more and more developers are calling attention to productionization concerns, like resilience, reliability, scaling, and observability of AI agents. As engineers who dedicated most of our professional life to the resilience and scalability of various types of applications, we wholeheartedly agree with the sentiment.
However, we believe agent-specific workflow solutions are the wrong solution here. Because (1) agentic workflows are ultimately just regular programs (yes!) and (2) agentic workflows are part of your broader infrastructure and interact with other services, so they should use a solution that works for the full stack. Vercel’s CTO said it well on their blog:
Building AI agents might seem like a new thing that calls for new abstractions, but it is just regular programming. Use if-statements, loops, or switches, whatever fits. Don’t overthink the structure.
A great approach to write resilient general-purpose code is Durable Execution - an almost magic way to write code in a way as if it could run forever without failing. To demonstrate just how well this fits AI agents, we show how to use Restate to give existing Agents magic resilience, suspendability, observability, human-in-the-loop capabilities, multi-agent coordination, and more, in just a few lines of code.
Restate is an open-source lightweight (single-binary) runtime for building innately resilient backend services, by combining Durable Execution with state management and communication.
The approach here is independent and orthogonal to existing SDKs: We use both the Vercel AI SDK (TypeScript) and OpenAI Agent SDK (Python) for the examples. All the code snippets can be found in the Restate AI examples GitHub repository.
Durable Execution: workflow guarantees for existing agents and code
Durable Execution is a mechanism that gives you workflow guarantees for your regular code. It records relevant steps in a journal, and uses that journal to recover the code execution in case of a crash (or, as we later see, long sleep, or signal). Unlike traditional workflows (e.g. Airflow), Durable Execution works with dynamically composed flows and doesn’t require an up-front defined workflow graph. Putting Durable Execution underneath our agent gives us a whole set of almost magic properties right away: code that behaves as if it could run forever and never crash.
To make use of Durable Execution, we need to wrap expensive and non-deterministic actions (LLM inference calls and tool invocations) into durable steps that can be restored after a crash. Many popular AI SDKs (e.g., Vercel AI SDK, OpenAI Agents SDK) let you plug in middleware for that purpose. Here is an example of how you can turn a tool into a durable tool, whose results are restored after failures (the Restate Context will be connected to Restate in the next step).
Vanilla Vercel AI SDK:
const model = openai("gpt-4o-2024-08-06");
const weather = tool({
description: "Get the current weather for a given city.",
parameters: z.object({ city: z.string() }),
execute: async ({ city }) => {
const result = await fetchWeather(city);
return await parseWeatherResponse(result);
},
});
Durable Model / Tool (Restate):
const model = wrapLanguageModel({
model: openai("gpt-4o-2024-08-06"),
middleware: durableCalls(restate_ctx, { maxRetryAttempts: 3 }),
});
const weather = tool({
description: "Get the current weather for a given city.",
parameters: z.object({ city: z.string() }),
execute: async ({ city }) => {
const result = await restate_ctx.run("get weather",
async () => fetchWeather(city)
);
return await parseWeatherResponse(result);
},
});
Vanilla OpenAI Agent SDK:
# agent entry point
result = await Runner.run(
my_agent,
input=message,
run_config=RunConfig(model="gpt-4o"),
)
@function_tool
async def get_weather(
wrapper: RunContextWrapper[restate.Context],
req: WeatherRequest
) -> WeatherResponse:
"""Get the current weather for a given city."""
result = await fetch_weather(req.city)
return await parse_weather_data(result)
Durable Model / Tool (Restate)
# agent entry point
result = await Runner.run(
my_agent,
input=message,
context=restate_ctx,
run_config=RunConfig(
model="gpt-4o",
model_provider=DurableModelCalls(restate_ctx)
),
)
@function_tool
async def get_weather(
wrapper: RunContextWrapper[restate.Context],
req: WeatherRequest
) -> WeatherResponse:
"""Get the current weather for a given city."""
restate_ctx = wrapper.context
result = await restate_ctx.run("Get weather", fetch_weather, args=(req.city,))
return await parse_weather_data(result)
These are all the code changes you need in your agent.
As a final step, we need to serve the agent and connect is to Restate. Restate’s server sits in front of your agent process, similar to a message broker or a reverse proxy. You call your agent through Restate, letting it take full control of the connection to transparently handle failure detection, retries, scaling, concurrency control, while keeping the agent process lightweight.
This setup gives you a fully resilient agentic workflow that can run for very long times and recover previous progress after failures! As a bonus, we also get idempotency for free and the ability to detach/re-attach from agents, schedule calls, or directly trigger agents from Kafka.

Observability
Because Restate manages the calls to the agentic workflow function and tracks the progress journal, it knows a great deal about what your agent is doing. Without setting up any additional infrastructure, you can see all executions and steps from your agentic workflows. You can also inspect how agents interact with each other and their state (sorry for the spoiler, keep reading to get the details).

This is a treasure trove of information to learn, debug, and audit what your agents did. You can find more examples of what the UI shows you in this blog post. Imagine how useful it would be to have this kind of information when deploying your agents to production!
Long-running tasks & human-in-the-loop
Sometimes you need to include a human evaluator, approval step, or another external signal in an agentic workflow. In a simple standalone app, you could model this by awaiting a promise/future that gets completed via a callback. Luckily, Durable Execution allows us to do the exact same thing, without worrying about failures or interruptions.
On top of that, since all the agent’s progress is stored in a journal, we can shut down the agent when it awaits the promise and restore it once the approval has come in (the promise is completed). This is particularly valuable for serverless platforms where you get billed by the millisecond: you pay for active work, not the wait time.


Depending on the deployment, you might want to suspend your agent even while it awaits the result of a long LLM inference call. You can do that simply by moving the inference calls to a different process and letting it call back into the agent by completing a durable promise with the inference result. Some Durable Execution platforms (for example Restate) let you mix and match FaaS and long-running processes/containers, so you can run the AI agent code on FaaS (like Lambda) and move the inference calls to a container (à la Fargate).
Beyond Durable Execution: Sessions and memory
So far, we’ve focused on handling a single request and a single conversation. But in many scenarios, you have long-running multi-turn conversations with agents. A user might start a conversation now, respond hours later, and return again after a few days. Multiple users may be having separate conversations going on, and a single conversation may be open in multiple browser windows.
While Durable Execution doesn’t handle this directly, Restate extends it with a feature called Virtual Objects: durable functions with an identity and the ability to store state. A single object (identified by a key, such as user_id or session_id) would represent a specific multi-step conversation, allowing for long-lived stateful interactions.
Virtual Objects guarantee that a single instance exists per key, queue interactions, and store transactional state. They offer a convenient way to store the message history and other data, like the last agent who the user talked to.
// Keyed by session ID
const agent = restate.object({
name: "Agent",
handlers: {
run: async (restate_ctx: restate.ObjectContext, message: string) => {
// Load the session context
const messages = (await restate_ctx.get<Message[]>("messages")) ?? [];
messages.push({role: "user", content: message});
const result = await runVercelAIAgent(restate_ctx, messages);
// Store the session context
messages.push({role: "assistant", content: result});
restate_ctx.set("messages", messages);
return result;
},
}
});
# Keyed by session ID
agent = restate.VirtualObject("Agent")
@agent.handler()
async def run(restate_ctx: restate.ObjectContext, message: str) -> str:
# Load the session context
messages = await restate_ctx.get("messages") or []
messages.append({"role": "user", "content": message})
result = await runOpenAIAgent(restate_ctx, messages)
# Store the session context
messages.append({"role": "assistant", "content": result})
restate_ctx.set("messages", messages)
return result
The Restate UI gives us a nice overview of the state stored in the Agent objects:

This approach is complementary to AI memory solutions like mem0 or graffiti. You can use virtual objects to enforce session concurrency and queueing (optionally remember session context) while storing the agent’s memory in mem0.
Resilient multi-agent systems
We now have stateful, resilient, long-running agents; this includes multi-agent applications as modeled by the OpenAI SDK, where all agents share the same process and loop. In that setup, handing work over to another agent means primarily switching prompt, tool set, and some context/history information for the next loop iteration.
For true distributed multi-agent setups, where agents run concurrently as separate processes (to execute and scale independently), the final missing piece is reliable asynchronous communication:
- Communication channels that recover from failures
- End-to-end idempotency to avoid kicking off expensive work twice
- Suspending the calling agent while the callee agents are doing work
- Reliable scheduling of agent invocations, for periodic work
Restate extends Durable Execution with such messaging and RPC between durable functions, so handing over work to another agent looks just like RPC-ing them. The examples below expose remote agents via tools:
tool({
description: "Handoff to BlueSky agent for research.",
parameters: z.object({ prompt: z.string() }),
execute: async ({ prompt }) =>
await restate_ctx.serviceClient(blueSkyAgent).run_agent(prompt),
});
@function_tool
async def handoff_to_bluesky_agent(
wrapper: RunContextWrapper[restate.Context], prompt: str
) -> str:
"""Handoff to a BlueSky agent for answering general questions."""
restate_ctx = wrapper.context
return restate_ctx.service_call(bluesky_agent.research, arg=prompt)
While this looks like a simple RPC client making a call, the invocation of the target agent is asynchronous and durable (like a queue), lets the caller suspend while awaiting a response, can be detached / re-attached, canceled, and lets you kick off and await multiple parallel remote agents. Because Restate acts simultaneously as the message/RPC broker and Durable Execution orchestrator on both caller and callee side, it can transparently guarantee end-to-end idempotency and resilience. The same mechanism also lets us reliably schedule invocations for later, for example, to schedule an agentic task for later.
tool({
description: "Schedule a task to be executed by the agent after a delay.",
parameters: z.object({ task: z.string(), delay: z.number() }),
execute: async ({ task, delay }) => {
restate_ctx
.serviceSendClient(taskAgent)
.doTask(task, restate.rpc.sendOpts({ delay }));
},
});
@function_tool
async def schedule_task(
wrapper: RunContextWrapper[restate.Context], task: str, delay: timedelta
):
"""Schedule a task to be executed by the agent after a delay."""
restate_ctx = wrapper.context
restate_ctx.service_send(task_agent, arg=f"Execute task: {task}", send_delay=delay)
If this feels reminiscent of A2A, that is no co-incidence: Restate can be thought of as a general-purpose stateful task orchestration framework. A2A is an orchestration framework for agents, and can be easily implemented on top of the more general Restate framework. If you are adopting A2A, here is an implementation of a fully resilient A2A server using Restate that can be self-hosted and scale from laptop to multi-zone cluster.
Build Agentic Workflows like any other code
At the end of the day, agents are just programs. The same principles that make any system or backend service resilient and scalable apply to them.
Durable Execution, paired with your existing SDKs, gives your agents a powerful upgrade: resilience to failure, observability by default, suspendability, memory, and multi-agent coordination, without locking you into a specific AI framework or cloud service.
And when agents look and run like any other code, it also becomes easy to navigate between deterministic workflows that use AI tools and some full-blown agentic workflow sections that employ LLM/tool loops for autonomous problem solving. And this is ultimately the sweet-spot for AI applications today.
If this resonates with you, here are some ways to get started:
- 🚀 Start with our Vercel AI or OpenAI templates
- 🔧 Dive deeper into how Durable Execution works under the hood
- ☁️ Try Restate Cloud or self-host
✨ Star us on GitHub and join the conversation on Discord or Slack — we’d love to hear what you’re building.
Note: Parallel tool calls aren’t supported out of the box due to non-deterministic replay. For that, use Restate’s promise combinators (TS/Python) and wrap the logic in a single tool.