Multi-LLM Agent RuntimeOpenAI · Claude · Gemini · Grok
A lightweight runtime for tool-using LLM agents with model fan-out, retries, and structured-output evals — written after we got tired of heavyweight frameworks making simple problems complicated.
The diagram, walked through in plain language
- 1Operators design workflows in n8n
n8n is a drag-and-drop workflow builder. Ops engineers build flows like 'new lead arrives → enrich it → draft a reply → send to approval queue' without writing any Python.
- 2n8n calls the runtime over HTTP
When a workflow needs an AI step, it hits a small FastAPI runtime — under 2,000 lines of code total, readable in a sitting.
- 3Each 'agent turn' is a pure function
Given the current state and an inbound message, the runtime returns the next state, outbound messages, and any tool calls (e.g. 'look this up', 'send an email'). Tools run via a Redis queue.
- 4The right model gets picked for each task
A policy router sends tool-heavy work to Claude, fast/cheap work to Gemini Flash, hard reasoning to OpenAI's o-series. If the chosen model fails or hits its budget, failover takes seconds.
- 5All outputs are validated before reaching business logic
Models reply in JSON matching a Pydantic schema. Invalid JSON gets retried up to twice with a repair prompt; if it still fails, the error is logged loudly rather than swallowed.
- 6Every step is replayable
Each tool call and model call goes to a Postgres run log. Months later, we can reproduce the exact same agent run for debugging or audit — including the exact prompt and the exact tool payload.
The brief
The team had been on a popular agent framework for six months and had reached that familiar inflection point: the abstractions they were fighting to bend were larger than the problem they were solving. Adding a new model took days, not hours. Observability was a flaming Jupyter notebook.
The ask was for something smaller. Not a framework. A runtime.
The constraints
- Under 2,000 lines of code, end to end. Readable in a sitting.
- Every tool call and every model call is logged, replayable, and evaluable. No exceptions.
- Model fan-out across four providers with per-provider quotas, budgets, and fallback policies.
- Structured outputs validated against Pydantic at the boundary. Malformed output is a logged failure, not an uncaught exception.
- Operator-facing workflows authored in n8n so the ops team can wire things up without shipping Python.
- Statelessness: the runtime holds nothing between turns. All state is in Postgres or Redis.
The shape we built
An agent turn is a pure function: (state, inbound message) → (new state, outbound messages, tool calls). The runtime is a scheduler that executes these turns, dispatches tool calls through a Redis-backed queue, and writes every step to a Postgres run log.
Tools are declared with Pydantic models on input and output. The runtime generates the JSON-schema for the model provider and validates the return value before it ever reaches business logic. If the model produces invalid output, we retry with a schema-repair prompt exactly twice, then fail loud.
Model selection is policy-driven: “tool-calling intensive → Claude”, “fast + cheap → Gemini Flash”, “reasoning-hard → o-series”, with a per-customer override. Failover is seconds, budget-aware, and observable.
n8n sits on top, not underneath. It calls the runtime via a small HTTP surface. Ops engineers wire up workflows — “new lead → enrich → draft reply → approval queue” — without touching the runtime.
What was hard
- Structured-output parity across providers. Each provider has a different dialect of “here's a JSON schema.” Normalizing them behind a single Pydantic-first API took longer than the rest of the runtime.
- Replay. A run's log should be sufficient to reproduce it deterministically. That required pinning temperatures, capturing full tool payloads, and, in one case, freezing a provider's system prompt against their silent update.
- Budget enforcement. Token counting differs per provider and per model family. We settled on budget in dollars, not tokens, and pre-compute a pessimistic cost per turn before submitting it.
What it does today
1.2 million tool calls per day at peak across four providers. P50 agent turn is 1.4 seconds including the tool round-trip. 99.6% of model outputs are schema-valid on first attempt; the remaining 0.4% are corrected by schema-repair retry. Operator workflows are written by ops engineers, not software engineers. The runtime has shipped to production three times without a rollback since GA.
What I'd do differently
I'd model tools as versioned contracts from commit zero. Tool signatures evolve; replayability requires you to know which version of a tool was in play at a given time. We added this in month four and backfilled it, which was exactly as much fun as it sounds.
- Python 3.12 · FastAPI
- Pydantic (tool / output schemas)
- OpenAI · Anthropic · Gemini · Grok
- n8n (operator-facing workflows)
- Redis Streams (tool queue)
- Postgres (run log + replay)
Continue the tour
Have a similar problem?
If this shape of engagement fits what you're working on, I'd be happy to scope it.