A History of Neural Networks — from McCulloch & Pitts to the Transformer Era
Minds Made of Math
A History of Neural Networks —
from McCulloch & Pitts to the Transformer Era
1943 – 1956 · Dawn Era
The core question: Can biological intelligence be reduced to mathematical logic?
In 1943, McCulloch & Pitts proposed a formal neuron that fires based on a threshold of weighted inputs.
Logic as Computation: By tuning weights (w) and a threshold (θ), this model can act as AND, OR, and NOT gates, suggesting that reasoning itself is a computational process.
This structure—inputs, weights, and sum—remains the mathematical skeleton of all modern AI.
The Formal Neuron
y = 1 if Σ wᵢ xᵢ ≥ θ y = 0 otherwise
1943 McCulloch & Pitts paper
1949 Hebbian Learning: wᵢⱼ update
1957 – 1969 · The First Wave
Frank Rosenblatt (1958) built on the formal neuron to create the Perceptron, adding a learning rule.
The Geometry of Decision: A single neuron defines a linear hyperplane in n-dimensional space. This flat boundary attempts to partition the data into two distinct regions.
Convergence Theorem: If the data is linearly separable, the model is mathematically guaranteed to find a boundary that correctly classifies every point in finite time.
Boundary Equation w · x + b = 0
Learning Rule Δw = η(y - ŷ)x
1969 – 1980 · First AI Winter
In 1969, Minsky & Papert proved that single-layer perceptrons cannot solve problems that are not linearly separable.
The XOR Problem: XOR requires a non-linear boundary. In a 2D space, no single straight line can separate (0,0) and (1,1) from (0,1) and (1,0).
Mathematical Deadlock: The inability to mathematically train hidden layers (which could theoretically warp the space to make it separable) led to the first “AI Winter.”
XOR Logic Truth Table
x1 x2 | y
--------|--
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0
(No single straight line can divide the 1s from the 0s)
1986 – 1998 · The Second Wave
The breakthrough came with the ability to calculate “credit” for errors in deep layers. Backpropagation (1986) treats the network as a computational graph and uses the multivariate chain rule to calculate the gradient of the loss.
Gradient Flow Calculation ∂L/∂w = (∂L/∂y) · (∂y/∂z) · (∂z/∂w)
Optimization now becomes a formal calculus process: we update weights via Gradient Descent: w ← w - η∇L
The Geometric Insight: Every hidden layer transforms the geometry of the data. A deep network warps the input space so that complex problems (like XOR) become linearly separable at the final output layer.
1990s – 2012 · Vanishing Gradients
Early deep networks used Sigmoid functions. As inputs move away from zero, the derivative approaches 0 (peaking at only 0.25).
When multiplying these small gradients across many layers via the chain rule, the mathematical signal vanishes. The bottom layers simply stop learning.
ReLU (max(0, x)) has a constant derivative of 1 for all positive values. This allows gradients to travel backward through hundreds of layers without shrinking.
Sigmoid’(z) ∈ (0, 0.25] ReLU’(z) = 1 for z > 0
This constant slope is the specific mathematical engine that allowed networks to finally grow deep.
Transformers (2017) discarded sequential processing for Self-Attention, allowing every word in a sequence to “look” at every other word simultaneously.
Attention(Q, K, V) = softmax( (Q · Kᵀ) / √d ) · V
The Similarity Matrix: Q·Kᵀ calculates a dot-product alignment score between Query and Key vectors. This dynamically determines how much mathematical “attention” one token should pay to another.
The scaling factor 1/√d prevents the dot products from growing too large, which would otherwise push the softmax function into saturated regions with extremely small gradients.
Every milestone shares the 1943 skeleton: weighted inputs, a non-linearity, and a learning signal.
Scale is the new variable. While the math is refined (ReLU, Attention), the primary driver of the current era is the discovery that these simple geometric rules exhibit emergent capabilities at extreme scale.