Minds Made of Math
A History of Neural Networks — from McCulloch & Pitts to the Transformer Era
The Biological Spark
1943 – 1956 · Dawn Era
The core question: Can biological intelligence be reduced to mathematical logic?
In 1943, McCulloch & Pitts proposed a formal neuron that fires based on a threshold of weighted inputs.
This structure—inputs, weights, and sum—remains the mathematical skeleton of all modern AI.
The Formal Neuron
y = 1 if Σ wᵢ xᵢ ≥ θ y = 0 otherwise
1943 McCulloch & Pitts paper
1949 Hebbian Learning: wᵢⱼ update
The Perceptron: Linear Geometry
1957 – 1969 · The First Wave
Frank Rosenblatt (1958) built on the formal neuron to create the Perceptron, adding a learning rule.
The Geometry of Decision: A single neuron defines a linear hyperplane in n-dimensional space. This flat boundary attempts to partition the data into two distinct regions.
Boundary Equation w · x + b = 0
Learning Rule Δw = η(y - ŷ)x
The XOR Wall & the First Winter
1969 – 1980 · First AI Winter
In 1969, Minsky & Papert proved that single-layer perceptrons cannot solve problems that are not linearly separable.
The XOR Problem: XOR requires a non-linear boundary. In a 2D space, no single straight line can separate (0,0) and (1,1) from (0,1) and (1,0).
XOR Logic Truth Table
x1 x2 | y
--------|--
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0
(No single straight line can divide the 1s from the 0s)
Backpropagation & the Chain Rule
1986 – 1998 · The Second Wave
The breakthrough came with the ability to calculate “credit” for errors in deep layers. Backpropagation (1986) treats the network as a computational graph and uses the multivariate chain rule to calculate the gradient of the loss.
Gradient Flow Calculation ∂L/∂w = (∂L/∂y) · (∂y/∂z) · (∂z/∂w)
Optimization now becomes a formal calculus process: we update weights via Gradient Descent: w ← w - η∇L
The Activation Wall: Sigmoid vs. ReLU
1990s – 2012 · Vanishing Gradients
The Saturation Trap
Early deep networks used Sigmoid functions. As inputs move away from zero, the derivative approaches 0 (peaking at only 0.25).
When multiplying these small gradients across many layers via the chain rule, the mathematical signal vanishes. The bottom layers simply stop learning.
The ReLU Solution
ReLU (max(0, x)) has a constant derivative of 1 for all positive values. This allows gradients to travel backward through hundreds of layers without shrinking.
Sigmoid’(z) ∈ (0, 0.25] ReLU’(z) = 1 for z > 0
This constant slope is the specific mathematical engine that allowed networks to finally grow deep.
Transformers: Attention as Alignment
Transformers (2017) discarded sequential processing for Self-Attention, allowing every word in a sequence to “look” at every other word simultaneously.
Attention(Q, K, V) = softmax( (Q · Kᵀ) / √d ) · V
Matrix
Scaling
The scaling factor 1/√d prevents the dot products from growing too large, which would otherwise push the softmax function into saturated regions with extremely small gradients.
The Through-Line
80 Years of One Equation
Every milestone shares the 1943 skeleton: weighted inputs, a non-linearity, and a learning signal.
- Depth: 1 layer → 1,000+ layers.
- Data: Toy sets → The entire Web.
- Learning: Manual rules → Self-supervised Gradient Descent.
Scale is the new variable. While the math is refined (ReLU, Attention), the primary driver of the current era is the discovery that these simple geometric rules exhibit emergent capabilities at extreme scale.