Minds Made of Math

A History of Neural Networks — from McCulloch & Pitts to the Transformer Era



Minds Made of Math

A History of Neural Networks —
from McCulloch & Pitts to the Transformer Era

The Biological Spark

1943 – 1956 · Dawn Era

The core question: Can biological intelligence be reduced to mathematical logic?
In 1943, McCulloch & Pitts proposed a formal neuron that fires based on a threshold of weighted inputs.

Logic as Computation: By tuning weights (w) and a threshold (θ), this model can act as AND, OR, and NOT gates, suggesting that reasoning itself is a computational process.

This structure—inputs, weights, and sum—remains the mathematical skeleton of all modern AI.

The Formal Neuron

y = 1 if Σ wᵢ xᵢ ≥ θ y = 0 otherwise

1943 McCulloch & Pitts paper

1949 Hebbian Learning: wᵢⱼ update

The Perceptron: Linear Geometry

1957 – 1969 · The First Wave

Frank Rosenblatt (1958) built on the formal neuron to create the Perceptron, adding a learning rule.

The Geometry of Decision: A single neuron defines a linear hyperplane in n-dimensional space. This flat boundary attempts to partition the data into two distinct regions.

Convergence Theorem: If the data is linearly separable, the model is mathematically guaranteed to find a boundary that correctly classifies every point in finite time.

Boundary Equation w · x + b = 0

Learning Rule Δw = η(y - ŷ)x

The XOR Wall & the First Winter

1969 – 1980 · First AI Winter

In 1969, Minsky & Papert proved that single-layer perceptrons cannot solve problems that are not linearly separable.

The XOR Problem: XOR requires a non-linear boundary. In a 2D space, no single straight line can separate (0,0) and (1,1) from (0,1) and (1,0).

Mathematical Deadlock: The inability to mathematically train hidden layers (which could theoretically warp the space to make it separable) led to the first “AI Winter.”

XOR Logic Truth Table
x1  x2  | y
--------|--
 0   0  | 0
 0   1  | 1
 1   0  | 1
 1   1  | 0

(No single straight line can divide the 1s from the 0s)

Backpropagation & the Chain Rule

1986 – 1998 · The Second Wave

The breakthrough came with the ability to calculate “credit” for errors in deep layers. Backpropagation (1986) treats the network as a computational graph and uses the multivariate chain rule to calculate the gradient of the loss.

Gradient Flow Calculation ∂L/∂w = (∂L/∂y) · (∂y/∂z) · (∂z/∂w)

Optimization now becomes a formal calculus process: we update weights via Gradient Descent: w ← w - η∇L

The Geometric Insight: Every hidden layer transforms the geometry of the data. A deep network warps the input space so that complex problems (like XOR) become linearly separable at the final output layer.

The Activation Wall: Sigmoid vs. ReLU

1990s – 2012 · Vanishing Gradients

The Saturation Trap

Early deep networks used Sigmoid functions. As inputs move away from zero, the derivative approaches 0 (peaking at only 0.25).

When multiplying these small gradients across many layers via the chain rule, the mathematical signal vanishes. The bottom layers simply stop learning.

The ReLU Solution

ReLU (max(0, x)) has a constant derivative of 1 for all positive values. This allows gradients to travel backward through hundreds of layers without shrinking.

Sigmoid’(z) ∈ (0, 0.25] ReLU’(z) = 1 for z > 0

This constant slope is the specific mathematical engine that allowed networks to finally grow deep.

Transformers: Attention as Alignment

Transformers (2017) discarded sequential processing for Self-Attention, allowing every word in a sequence to “look” at every other word simultaneously.

Attention(Q, K, V) = softmax( (Q · Kᵀ) / √d ) · V

Matrix

The Similarity Matrix: Q·Kᵀ calculates a dot-product alignment score between Query and Key vectors. This dynamically determines how much mathematical “attention” one token should pay to another.

Scaling

The scaling factor 1/√d prevents the dot products from growing too large, which would otherwise push the softmax function into saturated regions with extremely small gradients.

The Through-Line

80 Years of One Equation

Every milestone shares the 1943 skeleton: weighted inputs, a non-linearity, and a learning signal.

  • Depth: 1 layer → 1,000+ layers.
  • Data: Toy sets → The entire Web.
  • Learning: Manual rules → Self-supervised Gradient Descent.

Scale is the new variable. While the math is refined (ReLU, Attention), the primary driver of the current era is the discovery that these simple geometric rules exhibit emergent capabilities at extreme scale.