Why LightGBM Is Still the First Model I Train for a Trading Strategy

Three weeks ago a friend of mine, a quant at a multi-strategy fund whose name I can’t print, sent me a terse Slack at eleven at night: we just killed the transformer. He was three months into a project to replace the team’s daily-frequency LightGBM model with something more modern. In backtest the transformer had beaten LightGBM by three Sharpe points. In paper trading, one. In live trading it was negative. They reverted. The LightGBM went back on. Nobody was surprised. I’m writing this for him and for everyone else who has watched a quarter of research time die on the altar of a more fashionable model.

The argument I’m going to make is unfashionable and, I think, true. On most trading data, at most horizons, with the amount of history most shops actually have, LightGBM is still the model to reach for first. Not out of nostalgia. Not because of hype exhaustion, though there is some of that. Because the specific shape of trading data remains, in 2026, hostile to the deep-learning methods that have eaten most of the rest of machine learning.

1 · The state of tabular ML, honestly

The literature has been quietly consistent for four years, and the honest place to start is by naming it. In 2022 Grinsztajn, Oyallon, and Varoquaux published a paper whose title is itself the entire argument: Why do tree-based models still outperform deep learning on typical tabular data? It went around research Twitter. It produced some grumbling. It has, in the four years since, been confirmed rather than overturned.

In 2024 McElfresh and collaborators ran a bigger benchmark — When do neural nets outperform boosted trees on tabular data? — across one hundred and seventy-six datasets. Their finding: boosted trees beat neural networks on the majority; neural networks only pull ahead when the row count crosses a million with dense numerical features, and even there the gap is often within noise. The intuition that modern neural architectures should eventually dominate tabular data because they dominate everything else has not shown up in the peer-reviewed record.

In the domain we actually care about, the row counts per strategy run somewhere between one thousand and fifty thousand. The feature counts are in the fifty-to-two-hundred range. The signal-to-noise ratio is a bad joke. That is the sweet spot for boosted trees. It is the drought zone for everything that assumes more data is coming.

2 · Why trading data eats neural networks for breakfast

Six reasons, ticked off. None of them is a secret. The secret is that people keep trying anyway.

Low signal-to-noise

A Sharpe of 1.5 is good. It corresponds roughly to a 30% hit rate on binary classification. The model is wrong seven times in ten and you still get paid. Neural networks are designed to reduce loss; reducing loss on noise is overfitting. Boosted trees, with shallow learners and early stopping, are structurally more willing to refuse to learn things that aren’t really there.

Limited history

Five to fifteen years of daily data is a thousand to four thousand rows. Intraday you might get tens of thousands. Deep learning needs orders of magnitude more. A transformer hungry for a million sequence pairs starves on two thousand days of prices.

Non-stationarity

Markets change. Deep nets learn one distribution beautifully; trading data is a sequence of regimes loosely stapled together. Boosting is more forgiving here because no individual shallow tree commits deeply to any particular distributional quirk. The ensemble votes.

Missingness everywhere

Real trading features are littered with NaNs. Corporate actions blank out a column. A newly-listed issuer has no history. A data vendor goes down for a morning. A holiday shifts a whole market’s calendar by a day. LightGBM handles missing values natively — it learns the direction to route them during training. Neural networks require imputation, and imputation on financial data is a second research project in itself.

Heterogeneous feature types

Trading datasets mix numerics (returns, volumes), categoricals (sector, venue, day-of-week), lagged versions of both, rolling aggregates, engineered ratios. Neural nets want everything scaled, embedded, and dense. LightGBM takes the mixed bag as-is and finds the split that helps.

No useful pretraining target

Language models pretrain on text. Vision models pretrain on images. What would you pretrain a trading transformer on? The answer is supposed to be other markets, and the cross-market-generalisation story has been tried for a decade without much to show for it. The TimeGPT / Lag-Llama / Moirai wave is interesting on retail-demand and electricity-load datasets. On financial returns, the published results are not yet convincing.

Put those six together and you have the precise conditions under which LightGBM is hardest to beat. This is not an accident; it is the reason gradient-boosted trees keep winning.

3 · The LightGBM ledger — what actually makes it win

Concrete. Six items, every one of them practitioner-visible.

Speed. A thousand-trial Optuna sweep over a ten-thousand-row panel with a hundred and fifty features finishes in an afternoon on a single CPU. XGBoost has closed much of the gap, but on wide data LightGBM’s leaf-wise growth still wins by a meaningful factor. The quality-of-life difference is real: you can iterate on feature engineering three times a day instead of once.
Native missingness. Pass NaNs in and LightGBM learns the direction to route them. No imputation, no missing-indicator hacks, no “fill with the median and hope” anti-patterns. Your feature pipeline is shorter by a file or two.
Native categorical support. Mark a column categorical and LightGBM handles it via the Fisher (1958) optimal split for categorical features. One-hot expansion for a five-hundred-sector column is a memory problem on XGBoost; LightGBM handles it in-place.
SHAP is first-class. The TreeSHAP algorithm is fast and exact. Every prediction can be attributed to the features that caused it — which is the single most useful ability to have during a bad drawdown, when the PM is standing over your desk asking why did this fire.
Tiny artefact. A production LightGBM model is five to fifty megabytes on disk. Loads in milliseconds, predicts in microseconds, runs in a Cloudflare Worker if you are so inclined. The deployment story is almost insultingly simple.
The hyperparameter surface is charted. A decade of practitioners tuning LightGBM on financial data means you don’t need a research project to tune it. You need an afternoon and Optuna.

Every line on that list is something you feel in your hands every day the project is in flight. None of it shows up on a NeurIPS slide. All of it compounds.

4 · The contenders, rendered fairly

I owe the essay an honest comparison, because there are serious models on the list, and I don’t want to pretend otherwise.

XGBoost

Close sibling. Nearly equivalent in predictive power on most problems, a decade of battle-scars in finance, an enormous deployed footprint. The differences are mostly quality-of-life: LightGBM is faster on big data, better on categoricals, smaller on disk, cleaner on defaults. For a new project in 2026 I reach for LightGBM by default; for a legacy XGBoost pipeline there’s rarely a reason to migrate. Both are correct answers.

CatBoost

The Yandex implementation. Best-in-class categorical handling (ordered boosting plus target encoding with leakage prevention), and often the strongest on heavily-categorical problems. Slower to train than LightGBM, slightly larger artefact, less open-source momentum in 2026. Worth trying as a second model when your feature matrix is mostly categories.

Random forest

A capable baseline that has been outperformed by boosting for a decade. Still useful as a sanity check. Not a production choice in 2026.

Linear models (Ridge, Lasso, ElasticNet)

The underrated baseline every serious workflow should start with. If your LightGBM can’t beat Ridge, your features are the problem, not your model. Ship the Ridge, fix your features, then try LightGBM again.

LSTMs and temporal CNNs

Useful for intraday order-book microstructure signals where the sequence itself carries information beyond what hand-engineered summaries capture, and where the row count is high (millions of events). Overkill for anything daily. Under-tried for anything minute-scale.

Transformers

The modern sequence model. Beautiful on text, vision, and audio. They struggle on financial tabular data for all the reasons §2 has already enumerated. Worth revisiting whenever someone publishes a pretraining scheme that actually transfers across markets. As of this writing, nobody has.

LLMs

Not a direct competitor. Useful as an upstream feature encoder for news, filings, earnings calls, analyst notes — where the LLM summarises or embeds and the LightGBM decides. The fashionable assertion that you can just prompt an LLM to trade is the 2024 version of just use a neural network, and it has aged about as well.

Ranked by default-choice priority on a new trading project in 2026, the list is short: LightGBM, then CatBoost if categoricals dominate, with a linear baseline as the sanity floor. Everything else is a specialist tool for a specialist problem.

5 · The workflow — how to run this without losing a week

I’ll walk through the pipeline I run on every new trading research project. It takes a week to set up the first time. Every subsequent project gets through it in a day.

The baseline, always

Before anything, fit a Ridge. If your features can’t produce a signal a linear model can see, go fix your features. Do not proceed. Half the “our model doesn’t work” tickets I have seen in a decade were feature-engineering problems wearing a modelling problem’s costume.

Feature engineering matters more than the model

This is the sentence that nobody wants to hear and everyone should tattoo somewhere. Forward log returns as target; lagged log returns; rolling volatilities; rolling correlations; z-scores; sector and industry dummies; day-of-week, month-of-year, quarter-end flags. Use Polars, not pandas — the pandas tax at ten thousand rows and a hundred and fifty features becomes visible quickly. And please, for the love of god, apply your feature engineering after your train-test split. I know this is obvious. I have seen it done wrong in production code at a fund that shall remain nameless.

A default LightGBM config for trading

Don’t over-think the starting point. Here’s what I start every project with:

train.py · defaults worth defending

import lightgbm as lgb

params = {
    "objective":        "binary",   # or "regression" for forward-return regression
    "metric":           "auc",      # or "rmse"
    "num_leaves":       31,         # shallow is good; 15–127 is the tuning range
    "learning_rate":    0.03,       # slow and stable; pair with many rounds
    "feature_fraction": 0.7,        # column subsampling
    "bagging_fraction": 0.8,        # row subsampling
    "bagging_freq":     5,
    "min_data_in_leaf": 100,        # the knob that matters most on noisy data
    "lambda_l1":        0.1,
    "lambda_l2":        0.1,
    "verbose":          -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols),
    num_boost_round=2000,
    valid_sets=[lgb.Dataset(X_val, label=y_val)],
    callbacks=[lgb.early_stopping(100)],
)

the one knob that matters more than the others on financial data is min_data_in_leaf — push it higher than you think

The one hyperparameter that matters more than the others, on financial data specifically, is $min_data_in_leaf$ . Push it higher than you think. A hundred is a decent default. Three hundred is often better. A thousand is rarely too much. The purpose of that knob is to prevent the model from learning noise, and financial data is noisier than you think.

Walk-forward, not cross-validation

K-fold CV is what you reach for on i.i.d. data. You do not have i.i.d. data. You have a time series. Use $TimeSeriesSplit$ or, better, roll your own purged walk-forward following López de Prado (2018). Train on year one, test on year two. Train on years one and two, test on year three. Purge the overlap between train and test; embargo a few days after test to prevent label leakage from lagged features. If any of this sounds paranoid, it is because every one of those paranoias has cost real money at some fund somewhere.

Optuna sweep, but modestly

Two hundred trials over $num_leaves$ , $learning_rate$ , $min_data_in_leaf$ , $feature_fraction$ , $bagging_fraction$ , $lambda_l1$ , and $lambda_l2$ is enough for almost any tabular trading problem. A thousand-trial sweep rarely moves the needle over a two-hundred-trial one. Extra Sharpe found in the long tail is usually hallucinated.

SHAP, then cull

Fit SHAP on your final model. Rank features by mean absolute SHAP value. Drop the bottom thirty percent. Refit. Gains on out-of-sample are almost always real. This is the most reliable free Sharpe increment I know of.

Sanity checks

Does your Sharpe survive on a different period? A different universe? With transaction costs? With volatility targeting? With position-size constraints? If any of these breaks, your model is more fragile than it looked. Stop and figure out why before you fund it.

That’s the workflow. One week to set up. One day per project thereafter. A lot of quant teams run versions of something very close to it.

6 · Where LightGBM loses, honestly

The essay owes itself a counter-case. LightGBM is not the right answer to every signal. Five shapes where I reach for something else:

Sequential microstructure. Modelling the flow of an order book at ten-microsecond resolution to predict the next tick. A LightGBM on engineered features will be beaten by a properly-fit LSTM or temporal CNN. The sequence itself carries information no hand-engineered summary captures, and the data volume (billions of events) supports deep-learning training. Use the neural net.
NLP signals. Earnings call transcripts, 10-K filings, news sentiment, analyst notes. The embedding step is a transformer’s job. What you feed into LightGBM is the pooled embedding, or the topic-weighted vector, or the sentiment score. The stack is transformer-then-LightGBM; the transformer is a feature encoder and LightGBM is the decision layer.
Image-based signals. Chart-pattern recognition, satellite-derived features (parking-lot counts, oil-storage tank levels, crop density). CNNs encode, LightGBM decides. Same pattern.
Long-horizon macro reasoning. Modelling how the Fed’s stance evolves given this dot plot and these FOMC minutes. The feature space is narrative, not numeric, and the relevant reasoning is language-like. LLM-assisted workflow, not LightGBM.
Fast regimes on slow data. If your regime switches weekly and you only have monthly data, no model saves you. Not LightGBM, not transformers, not anything. The problem is that you don’t have enough data to identify the regime, and no amount of model sophistication manufactures missing information.

The defensible split is clear. When the problem is make sense of a tabular feature set, reach for LightGBM. When the problem is encode an unstructured input, reach for the right specialist, then land its output as a column in your LightGBM dataset. Mixing the two inside a single end-to-end model is usually the worst of both worlds.

7 · One observation I rarely say at conferences

The reason LightGBM keeps winning is not technical. It is sociological. Trading data is small, noisy, and non-stationary. The modelling problem is not to build a model expressive enough to fit the signal; it is to build a model humble enough to refuse to fit the noise. Gradient-boosted trees, with their shallow individual learners, their early stopping, their regularisation-as-culture, are almost humble by design. Deep nets are almost arrogant by design. On data where arrogance is punished, humility wins.

That is the observation. It is also, I suspect, why the LightGBM-versus-transformer question keeps getting re-asked and keeps getting answered the same way. The people asking are persuaded by architectural elegance. The market does not care about architectural elegance. The market cares about whether your model knows what it doesn’t know.

“In domains with a lot of noise, the winner is the one who refuses to react.”

— Nassim Taleb, paraphrased

LightGBM knows what it doesn’t know. It shrinks toward the mean. It refuses to overreact. It does not chase. In a discipline where the practitioner’s emotional problem is to stop overreacting, stop chasing, stop overfitting to last month’s regime, the model whose inductive bias happens to match the practitioner’s required discipline is the model that keeps winning. That is not an argument you will find in a NeurIPS paper. I think it explains more of the persistence of boosted trees on the frontier of a field otherwise eaten by transformers than any architectural analysis has yet managed.

Train the LightGBM. Respect the walk-forward. Drop the bottom features. Deploy the tiny file. Move on. The research afternoon you save is the research afternoon you will spend on a better feature, which is the thing that was actually going to move the Sharpe.

#algo-trading#trading#quant-finance#deep-learning#python#production

Get the next essay in your inbox.

Tuesday weekly. Mathematics, finance, and AI — written like an engineer, not a marketer.

Free. Weekly. One click to unsubscribe. Hosted on Buttondown.

Found this useful?

Share it — it helps the next person find this work.

X LinkedIn