Case study · AI / LLM · 2024

AI Content Platform10K daily users · 12 models · 35% lower cost

A multi-tenant AI content platform with model fan-out, evals, and a custom orchestration runtime that replaced a tangle of LangChain chains with something we could actually reason about in production.

10K+

Daily active users

month 9 peak

Models orchestrated

4 providers

-35%

Inference cost

vs. launch month

2.1s

P99 latency

end-to-end streaming

AI content platform architecture — multi-tenant API, orchestration runtime as a typed DAG, provider fan-out across four LLM providers, eval harness, cache, and metered billing. — Orchestration runtime at the centre · tenants, providers, evals around it

How it works · step by step

The diagram, walked through in plain language

1
A user asks for content
They sign in to the SaaS through a Next.js website and ask for, say, an SEO article or a product description.
2
The orchestration runtime plans the work
A small custom engine (~1,200 lines of TypeScript) breaks the request into a few typed steps — research, draft, polish — each with its own budget and fallback rules.
3
The right model is picked for each step
Cheap-and-fast tasks go to a local Llama or Gemini Flash; harder tasks go to GPT or Claude. Twelve models across four providers are on standby, and the runtime picks per step based on cost and quality.
4
Cache first, generate only if needed
Identical or near-identical requests are served from a cache, which is why per-document cost has fallen 35% since launch despite traffic tripling.
5
Every result is checked
A structured-output check makes sure the model returned the JSON shape we asked for. A nightly evaluation harness runs sample traffic against a known-good baseline and fails the deploy if quality slips.
6
One audit log feeds everything
The same record of every model call powers Stripe billing (charged per generated document), the eval harness, and customer-support replays. One source of truth, used three ways.

The brief

A content SaaS had scaled their MVP from 200 users to 4,000 on a LangChain-JS prototype. It had carried them further than it should have and was now the source of nearly every production incident: surprise provider outages, opaque failure modes, unbounded costs, and a debug story that consisted of printing messages and crossing fingers.

They wanted to keep shipping features. We needed to replace the guts without anyone noticing.

The constraints

Zero user-visible downtime during the swap. Feature velocity could not drop for more than a sprint.
Cost per generated document had to come down — the growth curve was going to take them through $250K/month in OpenAI fees unless something changed.
Every model call had to be replayable, evaluable, and auditable. “It worked in dev” was not acceptable.
Tenant isolation at the data layer, not the application layer — enterprise customers were asking.
Provider failover in seconds, not retries-and-prayers.

The shape we built

At the center: a small orchestration runtime, maybe 1,200 lines of TypeScript, that models a generation as a directed graph of typed steps. Each step is a pure function of its inputs plus a retry policy, a budget, and a fallback provider. Nothing clever. Everything inspectable.

Around it: a provider adapter layer (OpenAI, Anthropic, Gemini, a local Llama for the cheap-and-fast tier), a cache keyed on normalized prompts, and a structured eval harness that runs a sample of traffic against a golden set on every deploy. Regressions fail CI.

Tenant isolation lives in Postgres row-level security, not application code. Metered billing reads directly from the same audit log that feeds the eval harness — one source of truth, used three ways.

What was hard

Streaming fallback. When the primary provider dies mid-stream, failing over without the user seeing a broken paragraph required buffering the last 200 tokens and surgically splicing.
Cost modeling. OpenAI's token accounting does not always match your tiktoken estimate. Billing on our measurements rather than theirs took a month of reconciliation.
The “obviously cached, actually not” bug. Two prompts that looked identical to a human hashed differently because of invisible unicode. Normalization is load-bearing.

What it does today

10K+ daily actives, twelve production models across four providers, 35% lower inference cost per document than launch month despite the user base tripling. P99 streaming latency is 2.1s end-to-end. The eval harness catches a real regression roughly once a month before it ships. Exactly one provider has had a multi-hour outage; users saw nothing.

What I'd do differently

I'd lean harder on structured outputs from day one. We bolted JSON-schema validation on in month four and it immediately deleted a class of downstream parsing bugs we had been chasing for weeks. I'd also build the eval harness before the orchestration runtime, not alongside it — evals change how you design the pipeline, not just how you test it.

Stack

Next.js 14 · TypeScript
OpenAI · Anthropic · Gemini · local Llama
Postgres + pgvector
Redis (cache + rate limits)
Stripe (metered billing)
Custom orchestration runtime (TS)
Vercel Edge + fly.io GPU pool

More work

Continue the tour

All

Algo Trading · 2025

Order Router & Execution Engine

$80M routed · 38ms p99 · zero downtime

A trading desk's chart fires a buy or sell signal; this system safely turns each signal into a real order at the right brokerage in milliseconds — while quietly making sure they never trade more than they meant to or place an order they can't afford.

Read case study

Fintech · 2024

Fintech Reporting Dashboard

200M rows · 60% faster · sub-second queries

A financial dashboard that used to take seven seconds to show 'this month's profit and loss' now takes half a second — because we moved the heavy reports off the live database without changing a single number the customer's accountant sees.

Read case study

SaaS · 2024

JobbyAI

resume scoring · job match · interview prep

A free web app that helps job seekers in three ways: it scores their resume, ranks how well they match a job posting, and prepares them for the interview — all using a single AI model behind the scenes, with no signup required to try it.

Read case study

Algo Trading · 2023

Quant Backtest Harness

50K parameter combos · 3 engines · one CLI

A single command-line tool that lets a quant team test trading strategies on three different simulation engines without rewriting any strategy code — and then compares the results in one shared format, so 'which strategy is actually better' becomes a question with a real answer.

Read case study

Fintech · 2023

Accounting API Sync

4 providers · one trait · zero drift

A behind-the-scenes service that keeps an accounting SaaS in sync with QuickBooks, Xero, Wave, and AccountEdge — when a customer edits an invoice in either place, the change shows up on the other side within 30 seconds, without ever silently overwriting work.

Read case study

AI / LLM · 2025

Multi-LLM Agent Runtime

OpenAI · Claude · Gemini · Grok

A small, stateless service that lets non-engineers wire up AI 'agents' (which can call tools, look things up, and reply) — running across four AI providers so a single outage never takes a customer offline, and replay-able to the byte for debugging.

Read case study

Algo Trading · 2024

TradingView ↔ Plaid Bridge

webhook in · broker-native out · 4 signal types

A bridge that takes 'buy' or 'sell' alerts from TradingView charts, checks the user actually has the cash via their bank link (Plaid), then sends the order to their brokerage — all in under a fifth of a second, so the price they wanted is still the price they get.

Read case study

DevTools · 2023

Figma + Chrome Plugin Suite

design · engineering · less friction

Three small browser plugins that quietly fix the slow, fiddly hand-off between designers (working in Figma) and engineers (writing code) — saving each engineer about four hours a week of busywork that nobody was tracking, but everyone resented.

Read case study

Have a similar problem?

If this shape of engagement fits what you're working on, I'd be happy to scope it.

Discuss your architecture See engagement models