Case study · AI / LLM · 2025

Multi-LLM Agent RuntimeOpenAI · Claude · Gemini · Grok

A lightweight runtime for tool-using LLM agents with model fan-out, retries, and structured-output evals — written after we got tired of heavyweight frameworks making simple problems complicated.

Providers

hot-swappable

1.2M

Tool calls / day

peak week

1.4s

P50 agent turn

including tool round-trip

99.6%

Structured-output pass

schema-valid on first try

Multi-LLM agent runtime architecture — n8n operator workflows on top, Pydantic schema boundary, FastAPI runtime with policy router and schema-repair, provider fan-out to OpenAI, Claude, Gemini, Grok, Redis tool queue, and Postgres run log for replay. — n8n on top · Pydantic boundary · runtime · four providers · tool queue · run log

How it works · step by step

The diagram, walked through in plain language

1
Operators design workflows in n8n
n8n is a drag-and-drop workflow builder. Ops engineers build flows like 'new lead arrives → enrich it → draft a reply → send to approval queue' without writing any Python.
2
n8n calls the runtime over HTTP
When a workflow needs an AI step, it hits a small FastAPI runtime — under 2,000 lines of code total, readable in a sitting.
3
Each 'agent turn' is a pure function
Given the current state and an inbound message, the runtime returns the next state, outbound messages, and any tool calls (e.g. 'look this up', 'send an email'). Tools run via a Redis queue.
4
The right model gets picked for each task
A policy router sends tool-heavy work to Claude, fast/cheap work to Gemini Flash, hard reasoning to OpenAI's o-series. If the chosen model fails or hits its budget, failover takes seconds.
5
All outputs are validated before reaching business logic
Models reply in JSON matching a Pydantic schema. Invalid JSON gets retried up to twice with a repair prompt; if it still fails, the error is logged loudly rather than swallowed.
6
Every step is replayable
Each tool call and model call goes to a Postgres run log. Months later, we can reproduce the exact same agent run for debugging or audit — including the exact prompt and the exact tool payload.

The brief

The team had been on a popular agent framework for six months and had reached that familiar inflection point: the abstractions they were fighting to bend were larger than the problem they were solving. Adding a new model took days, not hours. Observability was a flaming Jupyter notebook.

The ask was for something smaller. Not a framework. A runtime.

The constraints

Under 2,000 lines of code, end to end. Readable in a sitting.
Every tool call and every model call is logged, replayable, and evaluable. No exceptions.
Model fan-out across four providers with per-provider quotas, budgets, and fallback policies.
Structured outputs validated against Pydantic at the boundary. Malformed output is a logged failure, not an uncaught exception.
Operator-facing workflows authored in n8n so the ops team can wire things up without shipping Python.
Statelessness: the runtime holds nothing between turns. All state is in Postgres or Redis.

The shape we built

An agent turn is a pure function: (state, inbound message) → (new state, outbound messages, tool calls). The runtime is a scheduler that executes these turns, dispatches tool calls through a Redis-backed queue, and writes every step to a Postgres run log.

Tools are declared with Pydantic models on input and output. The runtime generates the JSON-schema for the model provider and validates the return value before it ever reaches business logic. If the model produces invalid output, we retry with a schema-repair prompt exactly twice, then fail loud.

Model selection is policy-driven: “tool-calling intensive → Claude”, “fast + cheap → Gemini Flash”, “reasoning-hard → o-series”, with a per-customer override. Failover is seconds, budget-aware, and observable.

n8n sits on top, not underneath. It calls the runtime via a small HTTP surface. Ops engineers wire up workflows — “new lead → enrich → draft reply → approval queue” — without touching the runtime.

What was hard

Structured-output parity across providers. Each provider has a different dialect of “here's a JSON schema.” Normalizing them behind a single Pydantic-first API took longer than the rest of the runtime.
Replay. A run's log should be sufficient to reproduce it deterministically. That required pinning temperatures, capturing full tool payloads, and, in one case, freezing a provider's system prompt against their silent update.
Budget enforcement. Token counting differs per provider and per model family. We settled on budget in dollars, not tokens, and pre-compute a pessimistic cost per turn before submitting it.

What it does today

1.2 million tool calls per day at peak across four providers. P50 agent turn is 1.4 seconds including the tool round-trip. 99.6% of model outputs are schema-valid on first attempt; the remaining 0.4% are corrected by schema-repair retry. Operator workflows are written by ops engineers, not software engineers. The runtime has shipped to production three times without a rollback since GA.

What I'd do differently

I'd model tools as versioned contracts from commit zero. Tool signatures evolve; replayability requires you to know which version of a tool was in play at a given time. We added this in month four and backfilled it, which was exactly as much fun as it sounds.

Stack

Python 3.12 · FastAPI
Pydantic (tool / output schemas)
OpenAI · Anthropic · Gemini · Grok
n8n (operator-facing workflows)
Redis Streams (tool queue)
Postgres (run log + replay)

More work

Continue the tour

All

Algo Trading · 2025

Order Router & Execution Engine

$80M routed · 38ms p99 · zero downtime

A trading desk's chart fires a buy or sell signal; this system safely turns each signal into a real order at the right brokerage in milliseconds — while quietly making sure they never trade more than they meant to or place an order they can't afford.

Read case study

AI / LLM · 2024

AI Content Platform

10K daily users · 12 models · 35% lower cost

A SaaS that generates marketing-style writing (articles, ads, product copy) for thousands of paying users — intelligently picking the cheapest AI model that can do each job well, and switching providers in seconds when one of them goes down.

Read case study

Fintech · 2024

Fintech Reporting Dashboard

200M rows · 60% faster · sub-second queries

A financial dashboard that used to take seven seconds to show 'this month's profit and loss' now takes half a second — because we moved the heavy reports off the live database without changing a single number the customer's accountant sees.

Read case study

SaaS · 2024

JobbyAI

resume scoring · job match · interview prep

A free web app that helps job seekers in three ways: it scores their resume, ranks how well they match a job posting, and prepares them for the interview — all using a single AI model behind the scenes, with no signup required to try it.

Read case study

Algo Trading · 2023

Quant Backtest Harness

50K parameter combos · 3 engines · one CLI

A single command-line tool that lets a quant team test trading strategies on three different simulation engines without rewriting any strategy code — and then compares the results in one shared format, so 'which strategy is actually better' becomes a question with a real answer.

Read case study

Fintech · 2023

Accounting API Sync

4 providers · one trait · zero drift

A behind-the-scenes service that keeps an accounting SaaS in sync with QuickBooks, Xero, Wave, and AccountEdge — when a customer edits an invoice in either place, the change shows up on the other side within 30 seconds, without ever silently overwriting work.

Read case study

Algo Trading · 2024

TradingView ↔ Plaid Bridge

webhook in · broker-native out · 4 signal types

A bridge that takes 'buy' or 'sell' alerts from TradingView charts, checks the user actually has the cash via their bank link (Plaid), then sends the order to their brokerage — all in under a fifth of a second, so the price they wanted is still the price they get.

Read case study

DevTools · 2023

Figma + Chrome Plugin Suite

design · engineering · less friction

Three small browser plugins that quietly fix the slow, fiddly hand-off between designers (working in Figma) and engineers (writing code) — saving each engineer about four hours a week of busywork that nobody was tracking, but everyone resented.

Read case study

Have a similar problem?

If this shape of engagement fits what you're working on, I'd be happy to scope it.

Discuss your architecture See engagement models