Why Algorithmic Trading and Machine Learning Are the Same Problem in Different Clothes

On every good quant desk I have ever worked on there is a person who trains the model and a person who sizes the position, and they are the same person. Not because firms are cheap, though some are, but because it is the same job — the same loss surface, the same gradient step, the same overfit haunting the same walk-forward split. The vocabulary on the two sides of their monitor differs. The mathematics does not. This is the essay I wish someone had handed me the week I stopped treating machine learning and algorithmic trading as two disciplines.

The thesis is embarrassingly simple. Both fields are doing parameter estimation under uncertainty. Both have chosen their favourite loss functions for local reasons. Both have rediscovered the same regularisers, the same covariance decompositions, the same filtering recursions, and the same humility about out-of-sample performance — each under a different name. Put the textbooks side by side and half the theorems are transliterations of each other. You do not need to learn two fields. You need to learn the mathematics once, carefully, and notice which of its hats it is wearing in a given week.

I am going to argue this the only way it can be honestly argued: by writing the equations down in each vocabulary and pointing at the object in the middle. I will move through seven principles — maximum likelihood, convex optimisation, covariance decomposition, regularisation, generalisation, stochastic calculus, Bayesian filtering, and reinforcement learning — and in each one I will name the same mathematical creature on both sides of the discipline fence. At the end I will be honest about where the analogy breaks, because the markets are not the cat-or-dog dataset, and pretending otherwise costs money.

1 · Both problems are, underneath, maximum likelihood

Start from the bottom. In machine learning you are given a dataset $D = {(x_{i}, y_{i})}_{i = 1}^{N}$ and a parametric family $p (y ∣ x; θ)$ . Training means choosing $θ$ to make the observed data as likely as possible under the model — equivalently, minimising the negative log-likelihood, because logs turn products into sums and computers prefer sums.

in wordsPick the parameters that make the data you actually saw the least surprising.

\hat{θ}_{MLE} = ar g θ max i = 1 \prod N p (y_{i} ∣ x_{i}; θ) = ar g θ min - \frac{1}{N} i = 1 \sum N lo g p (y_{i} ∣ x_{i}; θ)

Maximum likelihood estimation. Every loss you have ever used is a special case of this.

Mean-squared-error loss is MLE under a Gaussian-noise assumption. Cross-entropy loss is MLE under a categorical model. The logistic regression cost function is MLE under a Bernoulli. These are not analogies. They are identities. The only moving part across supervised-learning problems is which probability distribution you assumed for the observations.

Now move across the hallway. An algorithmic trading strategy allocates capital $w_{t}$ across assets with random returns $r_{t + 1}$ , producing a stochastic wealth process $W_{t + 1} = W_{t} (1 + w_{t}^{⊤} r_{t + 1})$ . The trader who maximises the expected logarithm of terminal wealth — the Kelly criterion — is solving

in wordsThe portfolio that is most likely to survive and grow is the one that is the least surprised by the return distribution.

w^{*} = ar g w max E [lo g (1 + w^{⊤} r)]

Kelly's log-optimal portfolio. Note the shape.

The shape is the same. Kelly is an MLE: the trader assumes a return distribution $p (r)$ , and Kelly weights are the maximum-likelihood allocation under the log-utility model. Shannon saw this first, Kelly formalised it, Thorp made money with it, and every modern ML practitioner rewrites it weekly without noticing, because the cross-entropy loss is algebraically identical to the negative log-growth rate of a portfolio that bets proportionally to predicted probabilities.

2 · Same optimiser, different loss

Once both problems are written as minimisation, both fields reach for the same algorithm. Gradient descent. Not by coincidence — because for high-dimensional smooth objectives there is essentially nothing else that scales, and both fields eventually drag their objective into high dimensions the moment the problem becomes interesting.

in wordsMove against the slope. Do it often. Stop when you stop making progress.

θ_{k + 1} = θ_{k} - η \nabla_{θ} L (θ_{k})

Gradient descent. The universal engine of both fields.

The ML literature is unreasonably rich on the optimiser layer above this update — Adam, AdamW, Lion, Shampoo, AdaFactor, Muon — each one a clever way of estimating, preconditioning, or decaying the gradient. The portfolio-optimisation literature is unreasonably rich on the loss layer — Sharpe ratio, Sortino, Calmar, Omega, conditional value-at-risk, drawdown-constrained growth. The two conversations do not intersect often. They should.

A loss from the trading side, written in the ML grammar:

L_{Sharpe} (w) = - \frac{E [ w ^{⊤} r ] - r _{f}}{Var [ w ^{⊤} r ]}

Maximising Sharpe is minimising negative Sharpe. Same optimiser, different L.

Feed $L_{Sharpe}$ to an Adam implementation and it will happily march down it for you. Feed the cross-entropy loss to a sequential quadratic programming solver that was originally written for portfolio optimisation and it will happily march down that. The two engines are interchangeable. What is not interchangeable, and what the practitioner spends their life learning, is which loss is appropriate for the problem they actually face.

A subtler example. In ML, the natural gradient corrects the raw gradient by the inverse Fisher information matrix, $\tilde{\nabla} L = F^{- 1} \nabla L$ . In portfolio theory, Markowitz's optimal weights correct the expected-return vector by the inverse covariance matrix, $w^{*} \propto Σ^{- 1} μ$ . These are the same operation. The Fisher matrix is a covariance matrix — the covariance of the score function under the model. Both fields discovered, separately, that one should preconditon a descent direction by the inverse of a second-moment matrix. Neither field tends to tell the other.

3 · The covariance matrix, named twice

If the gradient is the atom of the optimisation step, the covariance matrix is the atom of structure. In both disciplines it is the single object that does the most work, and in both disciplines its eigendecomposition is the single most-used tool for dimensionality reduction.

in wordsDescribe the shape of your data's variation in a basis of uncorrelated directions.

Σ = E [(X - μ) (X - μ)^{⊤}] = Q Λ Q^{⊤}

The spectral theorem, the one piece of linear algebra that pays for itself every single day.

In machine learning, this decomposition is called principal component analysis. It compresses feature matrices, whitens inputs before a neural network, decorrelates channels in batch normalisation, and picks the directions of maximum variance for visualisation. Every preprocessing pipeline in every tabular ML project is secretly an eigendecomposition of a covariance matrix.

In quantitative finance, the same decomposition is called eigenportfolios. The first eigenvector of the asset return covariance matrix is, to a very good approximation, the market factor. The second is commonly a value/growth or size-related axis. The third downward is where the domain expertise earns its keep. The entire practice of factor investing is a story about which eigenvectors of $Σ$ earn compensated risk premia and which do not.

The two communities even agree that the raw sample covariance is a bad estimator in high dimensions — and they agree, independently, on the fix. In ML it is called ridge regression; we add $λ I$ to $X^{⊤} X$ before inverting. In quant finance it is called Ledoit-Wolf shrinkage; we shrink the sample covariance toward a structured target, typically the identity, with an analytically-derived intensity.

in wordsAverage the noisy sample covariance with a clean guess — more shrinkage when the noise dominates.

\hat{Σ}_{LW} = (1 - α) S + α ν I, α \in [0, 1]

Ledoit-Wolf. Also: ridge regularisation. Also: Bayesian posterior with a diffuse prior. One trick, three diplomas.

The ML regularisation coefficient $λ$ and the finance shrinkage intensity $α$ are cousins from the same family: both exist because unbiased estimators are often inadmissible in high dimensions, and a little bit of bias traded for a lot of variance is almost always the profitable trade. James-Stein (1961) knew this before either field had computers to misuse it on.

4 · Regularisation is humility, named twice over

Staying with the theme: both fields eventually discover that the raw optimum is a liar, and both fields impose the same family of penalties to stop the liar from winning.

\hat{θ} = ar g θ min L (θ) + λ ∥ θ ∥_{p}^{p}

Penalised estimation. Different p, different story.

Set $p = 2$ and you get ridge regression on the ML side and, on the finance side, mean-variance optimisation with the weight vector shrunk toward zero — a technique used to prevent extreme long/short concentrations that look wonderful in-sample and detonate in production. The math is indistinguishable.

Set $p = 1$ and you get LASSO on the ML side and, on the finance side, sparse-portfolio construction: only hold a handful of names, pay no attention to the rest. The reason both work is the same: the $ℓ_{1}$ ball has corners, and the corresponding optimum sits on a corner, zeroing out most coordinates. The domain-specific motivation (avoiding trading costs on tiny weights versus interpretable feature selection) is decoration around the same fact about convex geometry.

Other regularisers match up just as cleanly. Early stopping in neural-network training is a position limit: you stop updating weights before they overfit; the trader stops increasing exposure before the signal overfits. Dropout is an ensemble average over thinned networks; the desk running five half-correlated sub-strategies and averaging their signals is doing the same thing for the same statistical reason — reducing estimator variance at the price of a small amount of bias.

5 · In-sample vs out-of-sample — two fields, one mistake

Once you accept that both fields are doing empirical risk minimisation, the next problem becomes inevitable: the minimiser of empirical risk is not the minimiser of true risk. The generalisation gap is the quantity that separates the two, and both fields have spent decades discovering, under different names, how to measure and bound it.

In ML you hold out a validation set, then a test set, then perhaps a second test set to check that the first test set has not been overused by model selection, then you worry about label leakage, distribution shift, and the fact that academic benchmark numbers stopped improving out-of-distribution around 2017. In trading you run a backtest, then a walk-forward, then paper trading, then small-capital live, then scale cautiously — and you worry about survivorship bias, look-ahead bias, and the fact that your Sharpe ratio halves every time you look more carefully at it.

These are the same worry. A backtest is an in-sample fit. Paper trading is a held-out validation. Live trading is the true test set, and unlike ML's, it does not sit quietly; it fights back. The single most important number in both fields is not the in-sample score. It is the expected gap between in-sample and out-of-sample, and the literatures agree, surprisingly quantitatively, on how to estimate it.

In ML, the PAC-Bayes framework bounds generalisation error by a complexity term involving the KL divergence between posterior and prior over hypotheses. In trading, the deflated Sharpe ratio of Bailey and Lopez de Prado adjusts an observed Sharpe for the number of strategies tried, their correlation, and the moments of the return distribution.

in wordsHow many of the hundred strategies you tested would have printed this Sharpe by chance alone? Subtract them out.

DSR = Φ \frac{( SR - SR _{0} ) N - 1}{1 - γ ^ _{3} SR + \frac{γ ^ _{4} - 1}{4} SR ^{2}}

Deflated Sharpe Ratio. The trading world's PAC-Bayes correction.

The two formulas do not look alike on the page. They do the same thing: penalise the naive estimator by a correction that grows with the size of the hypothesis class explored, so that the quantity you report is the one you can actually reproduce out-of-sample. Every sufficiently mature quantitative field has eventually grown a theorem that says: the best of many tries looks better than it is; here is by how much. ML calls it the union bound over the hypothesis class. Trading calls it multiple-testing inflation. The statement is the same and so is the cure.

6 · Stochastic calculus: one set of equations, two applications

Up to here everything has lived on discrete time, finite samples, and familiar loss functions. Now the physics gets serious. Both quantitative finance and modern generative AI lean on the same branch of mathematics — Itô’s stochastic calculus — and on the same forward-backward duality between an SDE and the PDE that governs its marginal densities.

in wordsA deterministic drift plus an independent random kick, in continuous time.

d X_{t} = μ (X_{t}, t) d t + σ (X_{t}, t) d W_{t}

The Itô SDE. Picks out Brownian motion as the building block of continuous-time randomness.

On the finance side, set $μ = r S$ and $σ = σ S$ , apply Itô’s lemma to a twice-differentiable function $V (S, t)$ , enforce no-arbitrage via a replicating portfolio, and you obtain the Black-Scholes PDE that every derivatives desk in the world has solved, in one form or another, before breakfast since 1973.

\frac{\partial V}{\partial t} + \frac{1}{2} σ^{2} S^{2} \frac{\partial ^{2} V}{\partial S ^{2}} + r S \frac{\partial V}{\partial S} - r V = 0

Black-Scholes. A backward parabolic PDE run from expiry to today.

On the ML side, take the same forward SDE, run it on data instead of prices, and you obtain a diffusion model. The forward process progressively corrupts the data with Gaussian noise; the model learns to reverse the process by estimating the score function, $\nabla_{x} lo g p_{t} (x)$ . Sampling is integration of a reverse SDE derived from Anderson (1982), a forty-year-old result the generative AI literature rediscovered in 2020.

d X_{t} = [μ (X_{t}, t) - σ^{2} (t) \nabla_{x} lo g p_{t} (X_{t})] d t + σ (t) d \overset{ˉ}{W}_{t}

Anderson's reverse-time SDE. The engine of every Stable Diffusion you have ever run.

These equations — the forward SDE, the Fokker-Planck PDE that governs its density, and the backward SDE that reverses it — are pure classical analysis. Black-Scholes is a special case of Feynman-Kac applied to geometric Brownian motion. Score-based generative modelling is Anderson’s reverse-time formula applied to a tractable forward corruption. The Greeks that hedge a European call ( $Δ = \partial V / \partial S$ , $Γ = \partial^{2} V / \partial S^{2}$ ) and the score that guides a diffusion sample ( $s_{θ} (x, t) \approx \nabla_{x} lo g p_{t} (x)$ ) are siblings: partial derivatives of log-densities treated as control signals.

7 · The Kalman filter has three names and one equation

If one algorithm deserves to be the mascot of this essay it is the Kalman filter. It is the right answer to so many questions that three mostly-disjoint communities rediscovered it independently, named it after themselves, and spent the next half-century comparing notes.

Statistically, the filter is the optimal Bayesian state estimator for a linear-Gaussian dynamical system: a hidden state $x_{t}$ evolves as $x_{t + 1} = F x_{t} + w_{t}$ , and you observe a noisy linear projection $y_{t} = H x_{t} + v_{t}$ . The filter recursively updates a Gaussian posterior over $x_{t}$ given all past observations.

in wordsStart with your prior belief. Observe. Update toward the observation, weighted by how confident you were in each.

\overset{x}{^}_{t ∣ t} = \overset{x}{^}_{t ∣ t - 1} + K_{t} (y_{t} - H \overset{x}{^}_{t ∣ t - 1}), K_{t} = P_{t ∣ t - 1} H^{⊤} (H P_{t ∣ t - 1} H^{⊤} + R)^{- 1}

The Kalman update step. The rest of the filter is bookkeeping.

In signal processing this is a recursive least-squares estimator; in control engineering it is the observer half of the linear-quadratic-Gaussian controller; in the stats literature it is the Gaussian hidden Markov model inference algorithm.

In algorithmic trading, the filter is how you estimate a time-varying hedge ratio in a statistical arbitrage pair. The hidden state is the spread’s mean or the hedge coefficient; the observation is today’s cross-sectional price difference. Kalman is how you keep trading the pair even as the relationship between the two assets drifts — which, in real markets, it always does.

In deep learning, the Kalman filter is a linear recurrent neural network with analytic gradient. The forward recursion of an LSTM, stripped of its gating nonlinearities, is the prediction step of a Kalman filter on its internal state. The connection was made explicit in the 2020s with the rise of state-space models (S4, Mamba): learned linear recurrences whose initialisation is borrowed, deliberately, from classical state-estimation theory.

Four communities. Same recursion. If you learn the Kalman filter properly once — the innovation, the gain, the Riccati equation for the posterior covariance — you have also learned, for free, a foundational tool in each of the other three fields.

8 · Reinforcement learning is algorithmic trading, exactly

Of all the parallels between ML and algorithmic trading, the reinforcement-learning parallel is the one most likely to be missed, and the one most profitable to see. They are not analogous; they are the same problem. An RL agent and a trading strategy both face a Markov decision process. Both must choose actions to maximise an expected discounted stream of rewards. Both are governed by Bellman’s equation.

in wordsThe value of being in a state is the best action's immediate reward, plus the discounted value of where you land next.

V^{*} (s) = a max E [R (s, a) + γ V^{*} (s^{'}) ∣ s, a]

The Bellman optimality equation. The entire field of dynamic programming in one line.

Map the symbols. The state $s$ is the market state — prices, positions, inventory, time-of-day. The action $a$ is the portfolio weight change. The reward $R (s, a)$ is the realised risk-adjusted PnL over the next step. The transition $P (s^{'} ∣ s, a)$ is the market-impact model composed with the price process. The discount factor $γ$ encodes time preference or funding cost. Solve the Bellman equation; get the optimal trading policy. Merton did it for continuous-time consumption and portfolio choice in 1971 — forty-five years before AlphaGo.

Every RL algorithm has a direct analogue on the trading side, and many predate the ML rediscovery by decades. Value iteration is the numerical scheme that stochastic control has used since the 1960s. Policy gradients, derived as $\nabla_{θ} J (θ) = E [\nabla_{θ} lo g π_{θ} (a ∣ s) Q^{π} (s, a)]$ , are the same shape as the gradient of expected log-utility with respect to a policy’s parameters. Actor-critic decomposition is the trader plus the risk manager: one proposes actions, the other evaluates them. Off-policy learning is paper-trading a new strategy while running the old one on a live book.

The pathologies of RL match the pathologies of live trading with alarming precision. Distributional shift between training and deployment is regime change. Reward hacking is the strategy that exploits a bug in the backtester’s fill model. Exploration-exploitation tradeoff is how much capital to commit to a new signal before you trust it. Credit assignment across long time horizons is why nobody can reliably backtest a monthly strategy with a five-year sample.

9 · Where the analogy breaks — and why it matters

If the mathematical substrate is this uniform, why does naive transfer of ML techniques into trading fail so reliably, and so expensively? Because the distribution the math assumes is a nicer object than the one the market actually is. Three structural differences deserve a name.

Non-stationarity. Classical ML theory assumes the training data and the test data are drawn from the same distribution. A finite amount of distribution shift is tolerable and well-studied. But the market’s distribution is a moving target on every timescale: intraday microstructure changes as venues rebalance; weekly volatility regimes are punctuated by earnings; macro regimes rotate on a business-cycle scale. The sample size in which the distribution is approximately stationary is usually smaller than the one you need to fit your model. This is the why behind the result that tabular ML on ten years of daily returns almost never produces a strategy that survives in year eleven.

Reflexivity. In ML, the learner consumes the dataset but does not change it. In trading, your actions change the very distribution you are trying to learn from. A sufficient size trader moves the price. A sufficiently popular signal gets arbitraged away. The market is an adversarial environment in which every participant is trying to forecast every other, and the distribution is the endogenous equilibrium of that game. Soros called this reflexivity. Mathematically it means your Bellman equation’s transition kernel $P (s^{'} ∣ s, a)$ depends on $a$ , and in a live market the dependence is nontrivial enough that treating $P$ as exogenous is one of the ways a strategy becomes profitable in backtest and ruinous in production.

Adversarial data. Most ML datasets are cooperative: someone curated ImageNet so you could learn from it. The market is constructed by participants actively trying to keep information out of the price. Signals therefore live in the tails — the small, transient, hard-to-measure corners — and the signal-to-noise ratio is dramatically worse than even the grimmest ML benchmark. A daily return series with a genuine Sharpe of 1.0 is a sequence of Bernoulli trials with an edge on the order of one percent. The information content per observation is small enough that the overfitting budget is essentially zero — which is why the regularisation discussion above is not optional in trading; it is constitutive.

The correct response to these three structural differences is not to abandon ML in favour of hand-tuned rules. It is to pick, from the mathematical toolkit, the techniques that are robust to them. Bayesian models that carry their uncertainty forward. Ensembles of simple learners that respect the small-sample regime. Explicit regime-switching models rather than a single global fit. Time-series cross-validation that respects temporal ordering. Deflated performance metrics. The toolkit is shared. The selection criteria within it are what separate a strategy that survives from one that pays tuition.

10 · What the shared substrate means in practice

One immediate implication, and one that any practitioner can verify on themselves: the marginal hour spent deepening your understanding of a core mathematical principle — probability, linear algebra, stochastic calculus, convex optimisation — has a higher return than the marginal hour spent learning a framework. The framework will be out of fashion in five years. The Fisher information matrix will not. The Kalman recursion will not. The Bellman equation will not.

A second implication: the literatures are cross-readable, and rewarding to read that way. A quant who reads the score-matching papers of Song and Ermon as a piece of applied stochastic calculus — and recognises the Feynman-Kac machinery they’re wielding — learns things about hedging that the hedging literature has left on the table. An ML researcher who reads Merton (1971) and sees a policy-gradient derivation in half a page of undergraduate calculus learns that the modern RL toolkit is older, and in certain ways tighter, than the publication record suggests.

A third, more practical implication: if the underlying problem is the same, the failure modes transfer. The hyperparameter-tuning paranoia that a good ML engineer brings to a leaderboard is exactly the paranoia a good trader should bring to a Sharpe chart. The temptation to keep tweaking until the validation loss looks right is exactly the temptation to keep tweaking until the backtest Sharpe crosses two. Both sides of the profession have libraries of cautionary tales; reading both libraries immunises you twice.

11 · A personal note — why a love of mathematics quietly does the bridging

I was the child who underlined equations in textbooks. Not the worked examples; the equations themselves — that moment when a long argument collapsed into half a line of symbols and seemed, briefly, to explain something it had no right to explain. Two and a half decades later, the instinct that drew me to those equations is the same instinct I earn my living from. It is the part of the work I have never stopped enjoying for its own sake, and the part I have come to believe is the practitioner’s most undervalued asset.

The reason is exactly the thesis of this essay. If both fields are running the same machinery in different costumes, then the engineer who has internalised the machinery is the engineer who can move between costumes without re-learning the trade. I built fintech systems before I built AI systems. The transition was not as long as everyone insisted it would be, because the second job was the first job in a different vocabulary. The covariance matrix on the trading desk and the covariance matrix in the embedding-quality dashboard were the same matrix. I noticed because I had already, separately, loved the matrix.

Loving the mathematics has a second, more concrete benefit: it stops you from being intimidated by either field’s marketing. When a deep-learning paper unveils a new optimiser, you can read it as a particular preconditioner of a particular gradient and decide on your own whether the trick generalises. When a hedge fund advertises a proprietary risk model, you can read its prospectus as a particular shrinkage of a particular covariance estimator and decide how much of the fee is for the math. The marketing in both fields is loud; the mathematics is quiet. Whoever you trust to be quiet with you is whoever has read the same textbooks you have.

I am not arguing for a Platonist position on numbers, or for the supremacy of theory over practice. The practitioners I admire most are pragmatic almost to the point of impatience. But the pragmatism rests on a foundation, and the foundation is the same handful of theorems this essay has named twice — maximum likelihood, the spectral theorem, the contraction mapping that gives Bellman his fixed point, Itô’s lemma, the union bound. Each of them was first written down by someone who loved the abstraction enough to keep working on it long after it stopped being useful. Each of them is now a tool you use without thinking — provided that, at least once, you thought hard about the abstraction.

So when engineers ask me how to bridge machine learning and quantitative finance, my honest recommendation is: do not bridge them. Read mathematics in the morning, code applications in the afternoon, and the bridge will build itself, quietly, every time you notice a familiar derivation showing up in unfamiliar work. That noticing is the edge. And it is, in the end, the same noticing that made me underline the equations in the textbook the first time around — the small, durable joy of recognising the same shape behind two unrelated facts. I have followed that joy across two careers and through both of the disciplines this essay is about. Sometimes I think it is the only thing I have followed at all.

The vocabulary separating algorithmic trading from machine learning is a historical accident. The two fields grew up in different departments, wrote for different audiences, and published in different journals, and each built a jargon around a core of shared mathematics. When the jargon gets in the way of seeing the mathematics, the practitioner is poorer for it. When you can see through it — when weight decay and shrinkage and ridge and Tikhonov resolve into a single penalty, when policy gradient and Merton’s solution resolve into a single derivation, when the Kalman filter shows up in every other chapter without changing its shape — both fields become smaller, sharper, and far more useful. One discipline, two vocabularies. The practitioner’s job is to stop confusing them for two disciplines.

#algo-trading#trading#quant-finance#deep-learning#mathematics#linear-algebra#probability#optimisation#stochastic-calculus#time-series

Get the next essay in your inbox.

Tuesday weekly. Mathematics, finance, and AI — written like an engineer, not a marketer.

Free. Weekly. One click to unsubscribe. Hosted on Buttondown.

Found this useful?

Share it — it helps the next person find this work.

X LinkedIn