Case study · AI / LLM · 2024

AI Content Platform10K daily users · 12 models · 35% lower cost

A multi-tenant AI content platform with model fan-out, evals, and a custom orchestration runtime that replaced a tangle of LangChain chains with something we could actually reason about in production.

10K+
Daily active users
month 9 peak
12
Models orchestrated
4 providers
-35%
Inference cost
vs. launch month
2.1s
P99 latency
end-to-end streaming
AI content platform architecture — multi-tenant API, orchestration runtime as a typed DAG, provider fan-out across four LLM providers, eval harness, cache, and metered billing.
Orchestration runtime at the centre · tenants, providers, evals around it
How it works · step by step

The diagram, walked through in plain language

  1. 1
    A user asks for content

    They sign in to the SaaS through a Next.js website and ask for, say, an SEO article or a product description.

  2. 2
    The orchestration runtime plans the work

    A small custom engine (~1,200 lines of TypeScript) breaks the request into a few typed steps — research, draft, polish — each with its own budget and fallback rules.

  3. 3
    The right model is picked for each step

    Cheap-and-fast tasks go to a local Llama or Gemini Flash; harder tasks go to GPT or Claude. Twelve models across four providers are on standby, and the runtime picks per step based on cost and quality.

  4. 4
    Cache first, generate only if needed

    Identical or near-identical requests are served from a cache, which is why per-document cost has fallen 35% since launch despite traffic tripling.

  5. 5
    Every result is checked

    A structured-output check makes sure the model returned the JSON shape we asked for. A nightly evaluation harness runs sample traffic against a known-good baseline and fails the deploy if quality slips.

  6. 6
    One audit log feeds everything

    The same record of every model call powers Stripe billing (charged per generated document), the eval harness, and customer-support replays. One source of truth, used three ways.

The brief

A content SaaS had scaled their MVP from 200 users to 4,000 on a LangChain-JS prototype. It had carried them further than it should have and was now the source of nearly every production incident: surprise provider outages, opaque failure modes, unbounded costs, and a debug story that consisted of printing messages and crossing fingers.

They wanted to keep shipping features. We needed to replace the guts without anyone noticing.

The constraints

  • Zero user-visible downtime during the swap. Feature velocity could not drop for more than a sprint.
  • Cost per generated document had to come down — the growth curve was going to take them through $250K/month in OpenAI fees unless something changed.
  • Every model call had to be replayable, evaluable, and auditable. “It worked in dev” was not acceptable.
  • Tenant isolation at the data layer, not the application layer — enterprise customers were asking.
  • Provider failover in seconds, not retries-and-prayers.

The shape we built

At the center: a small orchestration runtime, maybe 1,200 lines of TypeScript, that models a generation as a directed graph of typed steps. Each step is a pure function of its inputs plus a retry policy, a budget, and a fallback provider. Nothing clever. Everything inspectable.

Around it: a provider adapter layer (OpenAI, Anthropic, Gemini, a local Llama for the cheap-and-fast tier), a cache keyed on normalized prompts, and a structured eval harness that runs a sample of traffic against a golden set on every deploy. Regressions fail CI.

Tenant isolation lives in Postgres row-level security, not application code. Metered billing reads directly from the same audit log that feeds the eval harness — one source of truth, used three ways.

What was hard

  • Streaming fallback. When the primary provider dies mid-stream, failing over without the user seeing a broken paragraph required buffering the last 200 tokens and surgically splicing.
  • Cost modeling. OpenAI's token accounting does not always match your tiktoken estimate. Billing on our measurements rather than theirs took a month of reconciliation.
  • The “obviously cached, actually not” bug. Two prompts that looked identical to a human hashed differently because of invisible unicode. Normalization is load-bearing.

What it does today

10K+ daily actives, twelve production models across four providers, 35% lower inference cost per document than launch month despite the user base tripling. P99 streaming latency is 2.1s end-to-end. The eval harness catches a real regression roughly once a month before it ships. Exactly one provider has had a multi-hour outage; users saw nothing.

What I'd do differently

I'd lean harder on structured outputs from day one. We bolted JSON-schema validation on in month four and it immediately deleted a class of downstream parsing bugs we had been chasing for weeks. I'd also build the eval harness before the orchestration runtime, not alongside it — evals change how you design the pipeline, not just how you test it.

Stack
  • Next.js 14 · TypeScript
  • OpenAI · Anthropic · Gemini · local Llama
  • Postgres + pgvector
  • Redis (cache + rate limits)
  • Stripe (metered billing)
  • Custom orchestration runtime (TS)
  • Vercel Edge + fly.io GPU pool
More work

Continue the tour

Algo Trading · 2025

Order Router & Execution Engine

$80M routed · 38ms p99 · zero downtime

A trading desk's chart fires a buy or sell signal; this system safely turns each signal into a real order at the right brokerage in milliseconds — while quietly making sure they never trade more than they meant to or place an order they can't afford.

Read case study
Fintech · 2024

Fintech Reporting Dashboard

200M rows · 60% faster · sub-second queries

A financial dashboard that used to take seven seconds to show 'this month's profit and loss' now takes half a second — because we moved the heavy reports off the live database without changing a single number the customer's accountant sees.

Read case study
SaaS · 2024

JobbyAI

resume scoring · job match · interview prep

A free web app that helps job seekers in three ways: it scores their resume, ranks how well they match a job posting, and prepares them for the interview — all using a single AI model behind the scenes, with no signup required to try it.

Read case study
Algo Trading · 2023

Quant Backtest Harness

50K parameter combos · 3 engines · one CLI

A single command-line tool that lets a quant team test trading strategies on three different simulation engines without rewriting any strategy code — and then compares the results in one shared format, so 'which strategy is actually better' becomes a question with a real answer.

Read case study
Fintech · 2023

Accounting API Sync

4 providers · one trait · zero drift

A behind-the-scenes service that keeps an accounting SaaS in sync with QuickBooks, Xero, Wave, and AccountEdge — when a customer edits an invoice in either place, the change shows up on the other side within 30 seconds, without ever silently overwriting work.

Read case study
AI / LLM · 2025

Multi-LLM Agent Runtime

OpenAI · Claude · Gemini · Grok

A small, stateless service that lets non-engineers wire up AI 'agents' (which can call tools, look things up, and reply) — running across four AI providers so a single outage never takes a customer offline, and replay-able to the byte for debugging.

Read case study
Algo Trading · 2024

TradingView ↔ Plaid Bridge

webhook in · broker-native out · 4 signal types

A bridge that takes 'buy' or 'sell' alerts from TradingView charts, checks the user actually has the cash via their bank link (Plaid), then sends the order to their brokerage — all in under a fifth of a second, so the price they wanted is still the price they get.

Read case study
DevTools · 2023

Figma + Chrome Plugin Suite

design · engineering · less friction

Three small browser plugins that quietly fix the slow, fiddly hand-off between designers (working in Figma) and engineers (writing code) — saving each engineer about four hours a week of busywork that nobody was tracking, but everyone resented.

Read case study

Have a similar problem?

If this shape of engagement fits what you're working on, I'd be happy to scope it.