
Aient builds a production-aware AI DevOps agent on Restate
“Aient turns runtime telemetry into merged code fixes, by detecting production problems, finding the root cause, and opening pull requests automatically. Restate powers the agent harness, durable tool execution, and streaming.”
Henrik Feldt
Founder, Aient
Aient is a new kind of team member — one that detects problems in production, triages them, builds context from runtime telemetry, and delivers a fix as a GitHub pull request ready to review and merge. Setup takes minutes. Connect GitHub, merge an instrumentation PR, plug into Slack or Linear. The pipeline runs from runtime signal to reviewed code fix without engineering overhead.
Aient's entire backend runs on Restate, with Supabase as a read model.
I wanted to get the correctness, but not spend the time. In earlier projects, I spent more time building infrastructure than building product. And I wanted to avoid that this time.
-- Henrik Feldt, Founder at Aient
Before Restate: Architecture & Challenges
At the core of Aient is a custom agent harness built on the Vercel AI SDK and modelled on the surface area of Anthropic's TypeScript SDK. The agents, running in the harness, run complex work involving many LLM and tool calls. Getting this to production meant solving:
- Durable tool dispatch: Vercel AI's native exec runs tool calls inline in the process. Aient needed tool calls intercepted, dispatched to a sandbox elsewhere, and individually resumable.
- Long-running sessions and compaction: agent sessions can run for hours or days, so context and token cost have to be managed without losing the thread.
- Per-agent inboxes and turn-taking: each agent needs serialized message handling so conversations behave predictably.
- Resumable and suspendable everything: every HTTP request and background task had to survive failures, and idle agents had to be cheap to keep around.
Why Restate?
In earlier work, the Aient team built similar systems from scratch — using event sourcing, snapshotting, and other workflow orchestrators for resiliency. The infrastructure layer consumed more engineering time than the product on top of it. That experience shaped what Aient is built to be: not a monitoring tool, but a permanent member of the engineering team. One that catches every error, builds context from the full telemetry picture, and contributes fixes directly to the codebase — so the rest of the team can keep shipping, knowing production is always being watched by someone who never sleeps and never misses a signal.That experience is also what Aient is built to solve for others: the engineering overhead of keeping production running shouldn't fall on the team building the product.
After comparing some alternatives, they chose Restate as the basis of their agent harness because it gives them a trustworthy programming model for long-running/agentic workflows, with clear runtime behavior and good operational primitives:
- Persistence as a property of the runtime. Instead of bolting durability onto application code, Restate makes resumability and low cost-to-operate properties of the runtime itself.
- A programming model that fits a distributed-systems harness. Intercepting tool calls, dispatching them elsewhere, and making the whole loop resumable mapped naturally onto Restate's primitives, rather than requiring a streaming or DAG model on top.
- The Restate operator for Kubernetes. Continuous deployment and version tracking across long-running work is genuinely hard. The operator already handles those patterns out of the box.
- Clear durability/serialization boundaries. With Restate, the points where work is journaled are explicit. In particular,
ctx.runmakes it clear what crosses the durable boundary, which made the system easier to reason about operationally. - A simple type model. In other workflow systems, code can look like it returns normal data while actually returning deferred values or handles. In Restate,
ctx.runreturns real data and makes the durable boundary explicit, so IDE types match runtime behavior much more closely, creating a nicer developer experience.
We're building something that sits in the middle of other people's production systems, 24/7. Reliability isn't a feature we'll add later — it's the whole point. Restate was the obvious choice: the same reasons that make it right for us today are the reasons it'll still be right when we scale.
-- Henrik Feldt, Founder at Aient
The Results
Aient's backend runs entirely on Restate today, with roughly 87 controller loops and services, including the agent harness, the MCP server, and the operational loops that monitor customer environments. Supabase sits alongside as a read model.
- Durable Execution: All services and agent loops run on Restate, with persisted journals so agents are resumable rather than restarting from scratch on failure.
- Virtual Objects: Each investigation runs as its own serialized agent with a dedicated inbox. Concurrent signals — a Slack reply, a new error spike, a developer query via MCP — are handled in order, so the agent always has a consistent view of what's happening.
- Durable remote tool calls: Tool calls are lifted out of the inline Vercel AI exec path and dispatched as explicit Restate calls, so each tool invocation is durable and recoverable on its own.
- Pub/sub for streaming: Model output streams through Restate's pub/sub in real time — including the agent's reasoning steps. Engineers can follow what Aient is thinking as it works, not just see the result when it's done.
- Awakeables for human-in-the-loop: Aient meets engineers where they already work — in Slack, Linear, via MCP inside their coding tools. Every interaction, whether it's a Slack reply, a Linear comment, or a query from an AI assistant, feeds into a single persistent thread per investigation. Because that thread lives in Restate, Aient always has complete context regardless of where the engineer responded from. Agents suspend on an awaitable when waiting for human input and resume the moment a signal arrives. No resources are held open while waiting and no context gets lost.
- Single-thread context across channels: Because each investigation runs as one persistent thread in Restate, Aient maintains full context whether an engineer responds via Slack, comments in Linear, or queries through MCP in their IDE.
The result is an agent that behaves less like a tool and more like a senior engineer who's always on call — one that spots the problem, does the diagnostic work, and shows up with a PR. The team still owns every decision. Aient just makes sure there's always something ready to decide on.
More Customer Stories
Advisoa Achieves Zero-Error, Durable Fintech Workflows with Restate
Advisoa relies on Restate to power Paypilot's most critical, error-sensitive systems, such as onboarding and bookkeeping workflows.
Read storyDeliveru builds serverless AI-powered recruiting platform on Restate
Learn how Deliveru built a serverless recruiting platform on Restate Cloud and AWS Lambda that automates candidate screening and document processing.
Read storyDodo Payments builds webhooks that never fail with Restate
Dodo Payments uses Restate for durable webhook delivery, achieving 99.99%+ reliability with sub-500ms latency for their global payments platform.
Read storyReady to build resilient applications?
Start building with Restate today and join these success stories.