Strip away the GPUs, the hype cycles, and the cover stories about artificial general intelligence, and what you find inside a modern neural network is surprisingly small. A dot product. A nonlinearity. A gradient. The chain rule, applied with a patience no human mathematician would tolerate. That’s roughly it. The elegance isn’t in the size of the math — it’s in how far a few calmly-chosen equations can be pushed.
This is a tour of the equations that actually run the show. I’ve tried to render them as they deserve — not as screenshots, not as vague hand-waves, but as real LaTeX — because the math is the argument, and skipping it means missing the point.
The neuron: a dot product and a squish
The atom of modern AI is the artificial neuron. It takes a vector of inputs , multiplies each input by a learned weight, sums the result, adds a bias, and runs the whole number through a nonlinear function . That is the entire operation.
Two choices of dominate modern networks: the rectified linear unit and its softer siblings.
Both are chosen for the same reason: they are nonlinear enough to escape the straightjacket of pure linear algebra, and their derivatives are cheap enough to compute a billion times per second. Elegance and efficiency, selected for together.
The layer: linear algebra doing the heavy lifting
Stack many neurons in parallel and the operation becomes a matrix multiplication. A layer takes an input vector, applies a weight matrix, adds a bias vector, and runs every component through the nonlinearity:
Stack layers and you get a deep network. The forward pass of an -layer network is just function composition:
That’s the architecture. The subtle part — the part that was famously hard for thirty years — is learning the weights.
The loss: a single number that means “wrong”
Training is optimization. To optimize, we need a number to minimize. That number is the loss function, and the art of choosing it is half of machine learning.
For regression, mean squared error is standard:
For classification across classes, with the model’s output passed through softmax, the cross-entropy loss is the honest answer:
Both losses have a property that makes everything else possible: they are differentiable. Every time the model produces an output, we can compute , and — crucially — we can ask how would change if we nudged any individual parameter.
Gradient descent: the entire learning algorithm
The loss is a function of the parameters . We want to find the that makes it small. The gradient points in the direction of steepest increase, so we step in the opposite direction:
The scalar is the learning rate. Too large, and the optimizer overshoots and oscillates; too small, and training crawls. Modern optimizers like Adam keep a running estimate of the gradient’s first and second moments and adjust the effective step size per parameter:
But all of this only works if we can actually compute . For a network with hundreds of billions of parameters and dozens of layers, that sounds impossible. It isn’t. The solution is one of the most elegant recursive algorithms in mathematics.
Backpropagation: the chain rule, taken seriously
The chain rule from elementary calculus says that if , then
A neural network is a deeply nested composition of differentiable operations. Backpropagation is what happens when you apply the chain rule to a composition layers deep and refuse to flinch.
Write the pre-activation of layer as and the activation as . Define the error signal at layer as the sensitivity of the loss to the pre-activation:
At the final layer, this is computed directly from the loss. The magic is in the recursion:
And from , the gradients with respect to the actual parameters fall out in two lines:
Attention: a dot product, reweighted
Transformers, the architecture behind GPT, Claude, and essentially every modern language model, replaced sequential recurrence with a single, devastatingly simple operation: attention. It asks, for each position in a sequence, which other positions should I be looking at, and how much?
Given queries , keys , and values — three learned projections of the input — scaled dot-product attention is:
Read this slowly. computes a similarity score between every query and every key. Dividing by keeps the scores from blowing up as the key dimension grows. Softmax turns them into a probability distribution per query. Multiplying by is a weighted average of the values, with the weights being “how much query wants to pay attention to position .”
That’s it. That single formula, applied in parallel over many heads, stacked in dozens of layers, trained on most of the text humans have ever written, is what you’re talking to when you talk to a modern language model.
Multi-head attention
Rather than computing one attention pattern, transformers compute of them in parallel and concatenate:
Each head gets its own learned projection of , , and . Different heads learn to attend to different things — some to syntax, some to coreference, some to long-range dependencies nobody has cleanly interpreted yet. The model is not programmed to specialize; it specializes because the gradient tells it to.
A worked example: softmax’s derivative
To convince yourself these equations aren’t magic, it’s worth deriving one by hand. The softmax function is used at the output of nearly every classifier and inside every attention head. Its derivative — which backprop needs — is unusually clean.
Start from the definition:
Differentiate with respect to . After a careful application of the quotient rule you land at:
Combine this with the cross-entropy loss and a miracle occurs: the messy-looking composition collapses into
Regularization: the geometry of not overfitting
A network with millions of parameters will happily memorize its training set. Regularization is the mathematics of preventing that. The simplest form adds a penalty on the weight norm:
Dropout, meanwhile, randomly zeros out a fraction of activations during training, so the network cannot rely too heavily on any single path. Its expectation-preserving form scales the surviving activations:
Both of these are techniques to nudge the parameter space so that the minimum the optimizer finds is flat — that is, surrounded by other nearly-as-good parameter settings. Flat minima generalize. Sharp minima memorize. Most of the deep-learning tricks of the last decade are, underneath, different ways of preferring flat over sharp.
Why the math has to be elegant
You could, in principle, build neural networks with uglier mathematics — non-differentiable activations, non-convex losses with no structure, update rules that don’t come from a gradient. People tried. None of it scaled.
What makes the current architecture work at planetary scale is that every piece — matrix multiplication, softmax, attention, cross-entropy, the chain rule — has two properties at once:
- A clean closed-form derivative. Without this, backpropagation is impossible and training a billion parameters is a fantasy.
- A parallelizable computational structure. Every one of these operations is a matrix operation that maps cleanly onto GPU and TPU hardware.
Elegance, in this context, is not decorative. It’s load-bearing. The reason modern AI works is that the mathematics it rests on is simple enough to differentiate automatically, structured enough to run on specialized silicon, and expressive enough to represent nearly any function we care about. The surprise isn’t that deep learning works. It’s that the math turned out to be so small.
“The purpose of computing is insight, not numbers.”
The insight at the bottom of all of this is, I think, that intelligence — the useful, measurable, engineerable kind — is a composition of differentiable functions fit to data by gradient descent. That might turn out to be incomplete. It might turn out to be wrong. But it is, right now, the most mathematically elegant theory of learning we have ever been able to run at scale. The equations above are what that elegance looks like, rendered honestly, with no marketing layer on top.
Get the next essay in your inbox.
Tuesday weekly. Mathematics, finance, and AI — written like an engineer, not a marketer.
Free. Weekly. One click to unsubscribe. Hosted on Buttondown.