How Neural Networks Use Elegant Mathematics

Strip away the GPUs, the hype cycles, and the cover stories about artificial general intelligence, and what you find inside a modern neural network is surprisingly small. A dot product. A nonlinearity. A gradient. The chain rule, applied with a patience no human mathematician would tolerate. That’s roughly it. The elegance isn’t in the size of the math — it’s in how far a few calmly-chosen equations can be pushed.

This is a tour of the equations that actually run the show. I’ve tried to render them as they deserve — not as screenshots, not as vague hand-waves, but as real LaTeX — because the math is the argument, and skipping it means missing the point.

The neuron: a dot product and a squish

The atom of modern AI is the artificial neuron. It takes a vector of inputs $x \in R^{n}$ , multiplies each input by a learned weight, sums the result, adds a bias, and runs the whole number through a nonlinear function $σ$ . That is the entire operation.

y = σ (i = 1 \sum n w_{i} x_{i} + b) = σ (w^{⊤} x + b)

One neuron. Everything else is composition.

Two choices of $σ$ dominate modern networks: the rectified linear unit and its softer siblings.

ReLU (z) = max (0, z) GELU (z) = z \cdot Φ (z)

ReLU is the workhorse; GELU, which weights z by the Gaussian CDF Φ, has quietly taken over most transformer stacks.

Both are chosen for the same reason: they are nonlinear enough to escape the straightjacket of pure linear algebra, and their derivatives are cheap enough to compute a billion times per second. Elegance and efficiency, selected for together.

The layer: linear algebra doing the heavy lifting

Stack many neurons in parallel and the operation becomes a matrix multiplication. A layer takes an input vector, applies a weight matrix, adds a bias vector, and runs every component through the nonlinearity:

h = σ (W x + b), W \in R^{m \times n}

A single layer — the same equation on every GPU in the world, running trillions of times per second.

Stack layers and you get a deep network. The forward pass of an $L$ -layer network is just function composition:

f (x) = σ_{L} (W_{L} σ_{L - 1} (W_{L - 1} \dots σ_{1} (W_{1} x + b_{1}) \dots + b_{L - 1}) + b_{L})

Deep learning, in one line. The parentheses do all the work.

That’s the architecture. The subtle part — the part that was famously hard for thirty years — is learning the weights.

The loss: a single number that means “wrong”

Training is optimization. To optimize, we need a number to minimize. That number is the loss function, and the art of choosing it is half of machine learning.

For regression, mean squared error is standard:

L_{MSE} (θ) = \frac{1}{N} i = 1 \sum N (f_{θ} (x_{i}) - y_{i})^{2}

Mean squared error — the loss you fall back to when you don't have a better idea.

For classification across $K$ classes, with the model’s output passed through softmax, the cross-entropy loss is the honest answer:

p_{k} = \frac{e ^{z_{k}}}{\sum _{j = 1}^{K} e ^{z_{j}}} L_{CE} = - k = 1 \sum K y_{k} lo g p_{k}

Softmax turns logits into probabilities; cross-entropy penalizes confident wrongness much more than hesitant wrongness.

Both losses have a property that makes everything else possible: they are differentiable. Every time the model produces an output, we can compute $L$ , and — crucially — we can ask how $L$ would change if we nudged any individual parameter.

Gradient descent: the entire learning algorithm

The loss is a function of the parameters $θ$ . We want to find the $θ$ that makes it small. The gradient $\nabla_{θ} L$ points in the direction of steepest increase, so we step in the opposite direction:

θ_{t + 1} = θ_{t} - η \nabla_{θ} L (θ_{t})

Stochastic gradient descent — arguably the most consequential single equation of the last twenty years.

The scalar $η$ is the learning rate. Too large, and the optimizer overshoots and oscillates; too small, and training crawls. Modern optimizers like Adam keep a running estimate of the gradient’s first and second moments and adjust the effective step size per parameter:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

Adam's moment estimates — exponentially-weighted running averages of the gradient and its square.

θ_{t + 1} = θ_{t} - \frac{η}{v ^ _{t} + ϵ} \overset{m}{^}_{t}

Adam's update. The division by √v̂ normalizes away the per-parameter scale — which matters enormously in deep networks.

But all of this only works if we can actually compute $\nabla_{θ} L$ . For a network with hundreds of billions of parameters and dozens of layers, that sounds impossible. It isn’t. The solution is one of the most elegant recursive algorithms in mathematics.

Backpropagation: the chain rule, taken seriously

The chain rule from elementary calculus says that if $y = f (g (x))$ , then

\frac{d y}{d x} = \frac{d y}{d g} \cdot \frac{d g}{d x}

The humble chain rule — Leibniz, 1676.

A neural network is a deeply nested composition of differentiable operations. Backpropagation is what happens when you apply the chain rule to a composition $L$ layers deep and refuse to flinch.

Write the pre-activation of layer $ℓ$ as $z^{(ℓ)} = W^{(ℓ)} a^{(ℓ - 1)} + b^{(ℓ)}$ and the activation as $a^{(ℓ)} = σ (z^{(ℓ)})$ . Define the error signal at layer $ℓ$ as the sensitivity of the loss to the pre-activation:

δ^{(ℓ)} = \frac{\partial L}{\partial z ^{(ℓ)}}

The δ trick — the quantity backprop propagates backwards through the network.

At the final layer, this is computed directly from the loss. The magic is in the recursion:

δ^{(ℓ)} = (W^{(ℓ + 1) ⊤} δ^{(ℓ + 1)}) ⊙ σ^{'} (z^{(ℓ)})

Backprop's recurrence relation — δ at one layer is determined by δ at the next layer, pulled back through the transpose of the weights.

And from $δ^{(ℓ)}$ , the gradients with respect to the actual parameters fall out in two lines:

\frac{\partial L}{\partial W ^{(ℓ)}} = δ^{(ℓ)} a^{(ℓ - 1) ⊤} \frac{\partial L}{\partial b ^{(ℓ)}} = δ^{(ℓ)}

Weight and bias gradients — a single outer product per layer. The entire training signal, in two equations.

Attention: a dot product, reweighted

Transformers, the architecture behind GPT, Claude, and essentially every modern language model, replaced sequential recurrence with a single, devastatingly simple operation: attention. It asks, for each position in a sequence, which other positions should I be looking at, and how much?

Given queries $Q$ , keys $K$ , and values $V$ — three learned projections of the input — scaled dot-product attention is:

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Vaswani et al., 2017 — one equation that replaced a decade of recurrent-network research.

Read this slowly. $Q K^{⊤}$ computes a similarity score between every query and every key. Dividing by $d_{k}$ keeps the scores from blowing up as the key dimension grows. Softmax turns them into a probability distribution per query. Multiplying by $V$ is a weighted average of the values, with the weights being “how much query $i$ wants to pay attention to position $j$ .”

That’s it. That single formula, applied in parallel over many heads, stacked in dozens of layers, trained on most of the text humans have ever written, is what you’re talking to when you talk to a modern language model.

Multi-head attention

Rather than computing one attention pattern, transformers compute $h$ of them in parallel and concatenate:

MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}

Multi-head attention — running several attention computations in parallel and mixing the results.

Each head gets its own learned projection of $Q$ , $K$ , and $V$ . Different heads learn to attend to different things — some to syntax, some to coreference, some to long-range dependencies nobody has cleanly interpreted yet. The model is not programmed to specialize; it specializes because the gradient tells it to.

A worked example: softmax’s derivative

To convince yourself these equations aren’t magic, it’s worth deriving one by hand. The softmax function is used at the output of nearly every classifier and inside every attention head. Its derivative — which backprop needs — is unusually clean.

Start from the definition:

p_{i} = \frac{e ^{z_{i}}}{\sum _{k = 1}^{K} e ^{z_{k}}}

Differentiate $p_{i}$ with respect to $z_{j}$ . After a careful application of the quotient rule you land at:

\frac{\partial p _{i}}{\partial z _{j}} = p_{i} (δ_{ij} - p_{j})

The Jacobian of softmax — where δᵢⱼ is the Kronecker delta, 1 if i=j and 0 otherwise.

Combine this with the cross-entropy loss $L = - \sum_{k} y_{k} lo g p_{k}$ and a miracle occurs: the messy-looking composition collapses into

\frac{\partial L}{\partial z _{i}} = p_{i} - y_{i}

The loss gradient at the output layer — just (prediction − target). Deep learning frameworks rely on this simplification to avoid numerical instability.

Regularization: the geometry of not overfitting

A network with millions of parameters will happily memorize its training set. Regularization is the mathematics of preventing that. The simplest form adds a penalty on the weight norm:

L_{reg} = L + \frac{λ}{2} ∥ θ ∥_{2}^{2}

L2 regularization — a.k.a. weight decay. Encourages the model to use small weights unless the data really demands otherwise.

Dropout, meanwhile, randomly zeros out a fraction $p$ of activations during training, so the network cannot rely too heavily on any single path. Its expectation-preserving form scales the surviving activations:

\tilde{a}_{i} = \frac{a _{i} \cdot m _{i}}{1 - p}, m_{i} \sim Bernoulli (1 - p)

Inverted dropout — the noise injection trick that forces networks to spread their bets.

Both of these are techniques to nudge the parameter space so that the minimum the optimizer finds is flat — that is, surrounded by other nearly-as-good parameter settings. Flat minima generalize. Sharp minima memorize. Most of the deep-learning tricks of the last decade are, underneath, different ways of preferring flat over sharp.

Why the math has to be elegant

You could, in principle, build neural networks with uglier mathematics — non-differentiable activations, non-convex losses with no structure, update rules that don’t come from a gradient. People tried. None of it scaled.

What makes the current architecture work at planetary scale is that every piece — matrix multiplication, softmax, attention, cross-entropy, the chain rule — has two properties at once:

A clean closed-form derivative. Without this, backpropagation is impossible and training a billion parameters is a fantasy.
A parallelizable computational structure. Every one of these operations is a matrix operation that maps cleanly onto GPU and TPU hardware.

Elegance, in this context, is not decorative. It’s load-bearing. The reason modern AI works is that the mathematics it rests on is simple enough to differentiate automatically, structured enough to run on specialized silicon, and expressive enough to represent nearly any function we care about. The surprise isn’t that deep learning works. It’s that the math turned out to be so small.

“The purpose of computing is insight, not numbers.”

— Richard Hamming, 1962

The insight at the bottom of all of this is, I think, that intelligence — the useful, measurable, engineerable kind — is a composition of differentiable functions fit to data by gradient descent. That might turn out to be incomplete. It might turn out to be wrong. But it is, right now, the most mathematically elegant theory of learning we have ever been able to run at scale. The equations above are what that elegance looks like, rendered honestly, with no marketing layer on top.

#ai#deep-learning#backpropagation#attention#calculus#mathematics#linear-algebra

Get the next essay in your inbox.

Tuesday weekly. Mathematics, finance, and AI — written like an engineer, not a marketer.

Free. Weekly. One click to unsubscribe. Hosted on Buttondown.

Found this useful?

Share it — it helps the next person find this work.

X LinkedIn