All Customer Stories
Aient logo

Aient builds a production-aware AI DevOps agent on Restate

AI / Developer ToolsTypeScriptVercel AI SDKKubernetesSupabase
Aient turns runtime telemetry into merged code fixes, by detecting production problems, finding the root cause, and opening pull requests automatically. Restate powers the agent harness, durable tool execution, and streaming.
H

Henrik Feldt

Founder, Aient

Aient is a new kind of team member — one that detects problems in production, triages them, builds context from runtime telemetry, and delivers a fix as a GitHub pull request ready to review and merge. Setup takes minutes. Connect GitHub, merge an instrumentation PR, plug into Slack or Linear. The pipeline runs from runtime signal to reviewed code fix without engineering overhead.

Aient's entire backend runs on Restate, with Supabase as a read model.

I wanted to get the correctness, but not spend the time. In earlier projects, I spent more time building infrastructure than building product. And I wanted to avoid that this time.

-- Henrik Feldt, Founder at Aient

Before Restate: Architecture & Challenges

At the core of Aient is a custom agent harness built on the Vercel AI SDK and modelled on the surface area of Anthropic's TypeScript SDK. The agents, running in the harness, run complex work involving many LLM and tool calls. Getting this to production meant solving:

  • Durable tool dispatch: Vercel AI's native exec runs tool calls inline in the process. Aient needed tool calls intercepted, dispatched to a sandbox elsewhere, and individually resumable.
  • Long-running sessions and compaction: agent sessions can run for hours or days, so context and token cost have to be managed without losing the thread.
  • Per-agent inboxes and turn-taking: each agent needs serialized message handling so conversations behave predictably.
  • Resumable and suspendable everything: every HTTP request and background task had to survive failures, and idle agents had to be cheap to keep around.

Why Restate?

In earlier work, the Aient team built similar systems from scratch — using event sourcing, snapshotting, and other workflow orchestrators for resiliency. The infrastructure layer consumed more engineering time than the product on top of it. That experience shaped what Aient is built to be: not a monitoring tool, but a permanent member of the engineering team. One that catches every error, builds context from the full telemetry picture, and contributes fixes directly to the codebase — so the rest of the team can keep shipping, knowing production is always being watched by someone who never sleeps and never misses a signal.That experience is also what Aient is built to solve for others: the engineering overhead of keeping production running shouldn't fall on the team building the product.

After comparing some alternatives, they chose Restate as the basis of their agent harness because it gives them a trustworthy programming model for long-running/agentic workflows, with clear runtime behavior and good operational primitives:

  • Persistence as a property of the runtime. Instead of bolting durability onto application code, Restate makes resumability and low cost-to-operate properties of the runtime itself.
  • A programming model that fits a distributed-systems harness. Intercepting tool calls, dispatching them elsewhere, and making the whole loop resumable mapped naturally onto Restate's primitives, rather than requiring a streaming or DAG model on top.
  • The Restate operator for Kubernetes. Continuous deployment and version tracking across long-running work is genuinely hard. The operator already handles those patterns out of the box.
  • Clear durability/serialization boundaries. With Restate, the points where work is journaled are explicit. In particular, ctx.run makes it clear what crosses the durable boundary, which made the system easier to reason about operationally.
  • A simple type model. In other workflow systems, code can look like it returns normal data while actually returning deferred values or handles. In Restate, ctx.run returns real data and makes the durable boundary explicit, so IDE types match runtime behavior much more closely, creating a nicer developer experience.

We're building something that sits in the middle of other people's production systems, 24/7. Reliability isn't a feature we'll add later — it's the whole point. Restate was the obvious choice: the same reasons that make it right for us today are the reasons it'll still be right when we scale.

-- Henrik Feldt, Founder at Aient

The Results

Aient's backend runs entirely on Restate today, with roughly 87 controller loops and services, including the agent harness, the MCP server, and the operational loops that monitor customer environments. Supabase sits alongside as a read model.

  • Durable Execution: All services and agent loops run on Restate, with persisted journals so agents are resumable rather than restarting from scratch on failure.
  • Virtual Objects: Each investigation runs as its own serialized agent with a dedicated inbox. Concurrent signals — a Slack reply, a new error spike, a developer query via MCP — are handled in order, so the agent always has a consistent view of what's happening.
  • Durable remote tool calls: Tool calls are lifted out of the inline Vercel AI exec path and dispatched as explicit Restate calls, so each tool invocation is durable and recoverable on its own.
  • Pub/sub for streaming: Model output streams through Restate's pub/sub in real time — including the agent's reasoning steps. Engineers can follow what Aient is thinking as it works, not just see the result when it's done.
  • Awakeables for human-in-the-loop: Aient meets engineers where they already work — in Slack, Linear, via MCP inside their coding tools. Every interaction, whether it's a Slack reply, a Linear comment, or a query from an AI assistant, feeds into a single persistent thread per investigation. Because that thread lives in Restate, Aient always has complete context regardless of where the engineer responded from. Agents suspend on an awaitable when waiting for human input and resume the moment a signal arrives. No resources are held open while waiting and no context gets lost.
  • Single-thread context across channels: Because each investigation runs as one persistent thread in Restate, Aient maintains full context whether an engineer responds via Slack, comments in Linear, or queries through MCP in their IDE.

The result is an agent that behaves less like a tool and more like a senior engineer who's always on call — one that spots the problem, does the diagnostic work, and shows up with a PR. The team still owns every decision. Aient just makes sure there's always something ready to decide on.

More Customer Stories

Ready to build resilient applications?

Start building with Restate today and join these success stories.