Order Router & Execution Engine$80M routed · 38ms p99 · zero downtime
A Rust + FastAPI order routing service for a quant trading desk, shipped from a blank repository to live order flow in under twelve weeks — and quietly responsible for $80M of live trading volume in Q1–Q3 2025.
The diagram, walked through in plain language
- 1A trading signal arrives
When the trader's chart hits a buy or sell condition, TradingView fires a small webhook (a message over the internet) at our system.
- 2The front door checks who's knocking
A FastAPI service confirms the message is genuinely from the trader (signed with a secret key) and isn't a duplicate of one we just handled.
- 3A risk check, in milliseconds
The signal goes through a Rust 'risk gate' that asks: is the trader within their daily loss limit? Their position size limit? If anything looks off, the order is refused before it ever reaches a broker.
- 4The router picks the cheapest broker
Approved orders go to the order router, which compares fees, spread, and recent fill quality across Alpaca, Interactive Brokers, Binance, and MetaTrader 5 — then picks the best one for this specific trade.
- 5Every step is recorded immutably
Position changes, fills, and any later corrections from the broker are written to an audit log nobody can edit. The dashboard reads the same log, so what the trader sees on screen always matches what really happened.
- 6If anything fails mid-flight, replay catches it
Orders are queued in Redis before the broker call. If the network drops or the broker times out, the queue replays the order rather than losing it.
The brief
The client was a quant trading desk whose strategy code worked beautifully in backtests and just-okay in paper trading. In production, slippage was eating 30% of expected edge, fills were arriving out of order, and a single bad webhook had once flipped their net position the wrong way for forty seconds.
They needed a layer between strategy and broker that was honest about latency, aware of risk, and boring under load. They had twelve weeks before the next live trading window opened.
The constraints
- Risk gate had to add < 50ms p99 to the order path. Any slower and the strategy edge collapsed.
- Idempotency had to be guaranteed across retries, replays, and broker timeouts — the same TradingView webhook may fire twice in a flaky network second.
- Position keeping had to be event-sourced, not derived — auditors wanted a single immutable log.
- Zero-downtime deploys, because some live windows ran across deploy slots.
- All venue adapters behind one interface, so adding the next broker was a 3-day job, not a 3-week one.
The shape we built
Four clean layers: clients (TradingView webhooks, the strategy runtime, the ops dashboard), the edge (FastAPI + a Rust risk gate), the core (the Rust router, an event-sourced position keeper, P&L pipeline, audit log), and the venues (Alpaca, IBKR, Binance, MT5).
The hot path — from webhook to broker ack — never touches Python after the FastAPI auth check. The risk gate runs in Rust against an in-memory snapshot of positions and exposure, and emits a structured allow/deny in single-digit milliseconds. The router lives next door, picks the venue based on a simple cost model (spread, fees, recent fill quality), and writes the order to a Redis Stream before calling the broker. If the broker call dies mid-flight, replay picks it up.
Position keeping is event-sourced: every fill, every cancel, every reconciliation correction is an immutable row. Current position is a fold over those rows. The reporting layer reads the same events into Parquet via Arrow, and the ops dashboard queries that. There is exactly one source of truth.
What was hard
- Broker idiosyncrasies. IBKR's socket protocol versus Binance's REST/websocket split versus Alpaca's clean REST — getting them behind one trait took longer than the entire risk gate.
- Reconciling intraday with broker statements. Brokers correct fills hours after the fact. The event store has to accept these corrections without rewriting history.
- Time. Every layer needed monotonic timestamps the venues didn't provide; we minted our own and stored both.
What it does today
Live since late February 2025. Through Q3 it has routed $80M of notional across four venues. p99 latency on the risk gate has held at 38ms against a budget of 50ms, and median sits around 11ms. Peak sustained throughput is 12.4k req/s. There have been zero reconciliation breaks since launch and one production incident, caused by a venue outage that the system correctly halted into.
What I'd do differently
I'd push more of the venue cost model out of code and into configuration earlier — the desk's preferences changed three times and each one was a small redeploy when it should have been a database write. I'd also add a shadow-route mode from day one, so new venue adapters can accept traffic in parallel with the live router for a week before going hot.
- Rust (risk gate, router core)
- Python · FastAPI (edge, ops)
- Postgres (event-sourced positions)
- Redis Streams (replay queue)
- Parquet · Arrow (P&L pipeline)
- AWS (ECS, RDS, S3 WORM)
- GitHub Actions (CI + zero-downtime deploy)
Continue the tour
Have a similar problem?
If this shape of engagement fits what you're working on, I'd be happy to scope it.