AI Content Platform10K daily users · 12 models · 35% lower cost
A multi-tenant AI content platform with model fan-out, evals, and a custom orchestration runtime that replaced a tangle of LangChain chains with something we could actually reason about in production.
The diagram, walked through in plain language
- 1A user asks for content
They sign in to the SaaS through a Next.js website and ask for, say, an SEO article or a product description.
- 2The orchestration runtime plans the work
A small custom engine (~1,200 lines of TypeScript) breaks the request into a few typed steps — research, draft, polish — each with its own budget and fallback rules.
- 3The right model is picked for each step
Cheap-and-fast tasks go to a local Llama or Gemini Flash; harder tasks go to GPT or Claude. Twelve models across four providers are on standby, and the runtime picks per step based on cost and quality.
- 4Cache first, generate only if needed
Identical or near-identical requests are served from a cache, which is why per-document cost has fallen 35% since launch despite traffic tripling.
- 5Every result is checked
A structured-output check makes sure the model returned the JSON shape we asked for. A nightly evaluation harness runs sample traffic against a known-good baseline and fails the deploy if quality slips.
- 6One audit log feeds everything
The same record of every model call powers Stripe billing (charged per generated document), the eval harness, and customer-support replays. One source of truth, used three ways.
The brief
A content SaaS had scaled their MVP from 200 users to 4,000 on a LangChain-JS prototype. It had carried them further than it should have and was now the source of nearly every production incident: surprise provider outages, opaque failure modes, unbounded costs, and a debug story that consisted of printing messages and crossing fingers.
They wanted to keep shipping features. We needed to replace the guts without anyone noticing.
The constraints
- Zero user-visible downtime during the swap. Feature velocity could not drop for more than a sprint.
- Cost per generated document had to come down — the growth curve was going to take them through $250K/month in OpenAI fees unless something changed.
- Every model call had to be replayable, evaluable, and auditable. “It worked in dev” was not acceptable.
- Tenant isolation at the data layer, not the application layer — enterprise customers were asking.
- Provider failover in seconds, not retries-and-prayers.
The shape we built
At the center: a small orchestration runtime, maybe 1,200 lines of TypeScript, that models a generation as a directed graph of typed steps. Each step is a pure function of its inputs plus a retry policy, a budget, and a fallback provider. Nothing clever. Everything inspectable.
Around it: a provider adapter layer (OpenAI, Anthropic, Gemini, a local Llama for the cheap-and-fast tier), a cache keyed on normalized prompts, and a structured eval harness that runs a sample of traffic against a golden set on every deploy. Regressions fail CI.
Tenant isolation lives in Postgres row-level security, not application code. Metered billing reads directly from the same audit log that feeds the eval harness — one source of truth, used three ways.
What was hard
- Streaming fallback. When the primary provider dies mid-stream, failing over without the user seeing a broken paragraph required buffering the last 200 tokens and surgically splicing.
- Cost modeling. OpenAI's token accounting does not always match your tiktoken estimate. Billing on our measurements rather than theirs took a month of reconciliation.
- The “obviously cached, actually not” bug. Two prompts that looked identical to a human hashed differently because of invisible unicode. Normalization is load-bearing.
What it does today
10K+ daily actives, twelve production models across four providers, 35% lower inference cost per document than launch month despite the user base tripling. P99 streaming latency is 2.1s end-to-end. The eval harness catches a real regression roughly once a month before it ships. Exactly one provider has had a multi-hour outage; users saw nothing.
What I'd do differently
I'd lean harder on structured outputs from day one. We bolted JSON-schema validation on in month four and it immediately deleted a class of downstream parsing bugs we had been chasing for weeks. I'd also build the eval harness before the orchestration runtime, not alongside it — evals change how you design the pipeline, not just how you test it.
- Next.js 14 · TypeScript
- OpenAI · Anthropic · Gemini · local Llama
- Postgres + pgvector
- Redis (cache + rate limits)
- Stripe (metered billing)
- Custom orchestration runtime (TS)
- Vercel Edge + fly.io GPU pool
Continue the tour
Have a similar problem?
If this shape of engagement fits what you're working on, I'd be happy to scope it.