Definitions of human intelligence and artificial intelligence often differ across psychology and computer science.
The paper argues that this mismatch creates confusion about what current AI systems actually demonstrate.
It proposes a shared conceptual framework so both fields can discuss intelligence with more precision.
Core definitions proposed
Human intelligence: maximal capacity to complete novel goals successfully through perceptual-cognitive processes.
Artificial intelligence: maximal capacity to complete novel goals successfully through computational processes.
In both cases, intelligence is framed as a capacity, not just a record of past performance.
The authors also argue intelligence is better understood as multidimensional, not a single narrow skill.
Intelligence is not the same as achievement
The paper draws a sharp line between:
Intelligence = flexible capacity for success on novel goals
Achievement / expertise = strong performance built through training on specific tasks or domains
Many current AI systems may look impressive because they show artificial achievement rather than genuine artificial intelligence.
This distinction matters because benchmark success alone can overstate what systems actually generalize to.
Why “AI metrics” is needed
The authors argue AI evaluation should learn from psychometrics.
They call for an AI metrics discipline focused on:
reliability (consistent measurement)
validity (measuring the intended construct)
standardized procedures for comparing systems fairly
Without stronger measurement practice, claims about intelligence or AGI remain difficult to justify.
Main takeaway for AGI debates
The paper suggests AGI should be understood analogously to human general intelligence:
not merely broad competence,
but the shared variance across many system performances.
Bottom line: current evidence supports artificial achievement/expertise more strongly than artificial intelligence.
The path forward is interdisciplinary collaboration plus better definitions and measurement.
A Brief History
1. Origins: early neural ideas and the birth of AI
1940s–1950s - In 1943, Warren McCulloch and Walter Pitts published a simplified mathematical model of a neuron and showed how networks of such units could implement logical operations. - In 1956, the Dartmouth Summer Research Project helped define artificial intelligence as a field and popularized the term itself. - Early AI combined several ambitions: - symbolic reasoning - machine learning - search and planning - neural models inspired by brains
Sources: McCulloch & Pitts (1943); Dartmouth AI history pages.
2. Perceptrons, optimism, and the first setbacks
Late 1950s–1970s - Frank Rosenblatt’s perceptron (1958) made neural learning seem highly promising. - Perceptrons could learn linear decision boundaries from examples. - But enthusiasm outpaced capability: - early systems were limited - compute was weak - data were scarce - Critiques of single-layer perceptrons, especially their inability to solve some nonlinearly separable problems, contributed to a slowdown in neural-network enthusiasm.
Why this mattered - AI did not progress in a straight line. - Periods of excitement were followed by disappointment. - This helps explain later AI winters and why symbolic AI often overshadowed neural approaches for a time.
Sources: Rosenblatt (1958); historical reviews of neural networks and AI.
3. Backpropagation, expert systems, and statistical learning
1980s–1990s - In the 1980s, backpropagation became the key training method for multi-layer neural networks. - The famous 1986 Rumelhart, Hinton, and Williams paper showed how internal representations could be learned across layers. - At the same time, AI also advanced through: - expert systems - probabilistic models - statistical pattern recognition - In the 1990s, neural nets remained important, but other methods often dominated practical machine learning.
Backpropagation made deeper, multi-layer learning much more practical.
2000s–2010s - Three things changed the field: 1. much more digital data 2. far more compute, especially GPUs 3. better training methods and architectures - Neural networks began to dominate difficult perception tasks such as: - speech recognition - image classification - machine translation - By the mid-2010s, deep learning had become central to AI research and industry.
Key idea Deep learning systems learn layered representations automatically rather than relying only on hand-crafted features.
Consequence AI moved from brittle rule systems toward large-scale representation learning from data.
Sources: LeCun, Bengio & Hinton (2015); broad historical overviews of modern AI.
5. Transformers, foundation models, and the LLM era
Late 2010s–2020s - In 2017, the transformer architecture showed that attention could replace recurrence for many sequence tasks. - This architecture scaled extremely well with data and compute. - Large pretrained models then became capable of: - text generation - summarization - translation - coding assistance - question answering - Today’s LLMs sit at the intersection of: - neural-network history - large-scale optimization - language modeling - post-training alignment
Bottom line: modern LLMs are not a sudden break from history; they are the latest stage in a long sequence of ideas about representation, learning, and computation.
Sources: Vaswani et al. (2017); Google ML Crash Course LLM materials; Ouyang et al. (2022).
The LLM Defined and Explained
This Specific Form of Intelligence
LLMs now power chat, search, summarization, coding help, and tutoring.
Three ideas explain much of their behavior:
tokens
embeddings
transformers
Modern systems add a fourth idea too:
alignment / post-training
From text units to a deployed assistant.
Roadmap
What language models do
Tokens and embeddings
Context and attention
Transformer architecture
Training and alignment
Strengths, limits, and interpretation
What is a language model?
A language model estimates the probability of the next token from earlier tokens:
P(t_k \mid t_1, t_2, \ldots, t_{k-1})
flowchart LR
A["The"] --> B["cat"] --> C["sat"] --> D["on"] --> E["the"] --> F{"next token?"}
F --> G["mat"]
F --> H["floor"]
F --> I["chair"]
Core idea: generation is repeated next-token prediction.
Tokens: the model’s working units
LLMs usually do not read whole words directly.
They read tokens, such as:
full words
subwords
punctuation
number pieces
Tokenization helps with rare words and open vocabularies.
Example tokenizations
Text
Possible tokens
unbelievable
un, believ, able
2026
20, 26
can’t
can, 't
Different tokenizers split differently. The point is to turn raw text into reusable units.
From text to vectors
flowchart LR
A["Raw text"] --> B["Tokenizer"]
B --> C["Token IDs"]
C --> D["Embedding lookup"]
D --> E["Dense vectors"]
E --> F["Transformer layers"]
F --> G["Next-token probabilities"]
Embedding: a learned vector representation for a token or other object.
What is an embedding?
An embedding places items in a high-dimensional space so that related items often end up near one another.
Why useful:
similarity becomes geometric
clusters can emerge
models can generalize beyond exact memorization
Illustrative only: actual embeddings live in many dimensions.
Geometric intuition—and caution
Good intuition
Nearby points often mean similar usage or context.
But not perfect
Distance is learned from data, not hand-written meaning.
Also important
Embeddings can encode bias, noise, and spurious correlations.
Static vs contextual embeddings
Static embedding - one vector per word type - same bank everywhere
Contextual embedding - vector depends on surrounding text - bank changes with sentence context
flowchart TB
A["river bank was muddy"] --> B["bank"]
C["bank approved the loan"] --> D["bank"]
B --> E["contextual vector A"]
D --> F["contextual vector B"]
This shift toward context-sensitive representations is one reason transformer systems became so effective.
Why earlier sequence models struggled
Earlier approaches included:
n-gram models
RNNs
LSTMs / GRUs
Common problems:
weak long-range memory
harder parallel training
unstable signals over long sequences
Sequential dependence made training harder to parallelize.
The transformer breakthrough
The 2017 transformer paper replaced recurrence with attention-heavy computation.
flowchart LR
A["Input sequence"] --> B["Attention"]
B --> C["Feed-forward"]
C --> D["Repeated layers"]
D --> E["Output probabilities"]
Why this mattered:
stronger long-range interactions
better parallelization
excellent scaling with data and compute
A simple transformer picture
flowchart TB
A["Token embeddings + position"] --> B["Self-attention"]
B --> C["Add & normalize"]
C --> D["Feed-forward network"]
D --> E["Add & normalize"]
E --> F["Repeat many times"]
F --> G["Vocabulary scores"]
For text generation, many modern LLMs use a decoder-style transformer.
Why position must be added
Embeddings alone do not encode order.
Sentence A
Dog bites man.
Sentence B
Man bites dog.
Same words, different order, different meaning.
So transformers add positional information:
learned positional vectors
sinusoidal encodings
relative position methods in newer variants
Self-attention: the core mechanism
Self-attention lets each token look at other tokens and weight their relevance.
graph LR
A["The"] --- D["animal"]
B["tired"] --- D
C["dog"] --- D
E["slept"] --- D
When building the representation for one token, the model can ask:
You do not need every algebraic detail to get the idea:
compare token features
compute relevance weights
blend information from multiple positions
The result is a new representation that depends on context.
Multi-head attention
flowchart TB
A["Input representations"] --> B1["Head 1"]
A --> B2["Head 2"]
A --> B3["Head 3"]
A --> B4["Head 4"]
B1 --> C["Concatenate"]
B2 --> C
B3 --> C
B4 --> C
C --> D["Project to new representation"]
Different heads can learn different relational patterns, though their roles are not always cleanly interpretable.
From hidden states to next-token prediction
flowchart LR
A["Current hidden state"] --> B["Vocabulary scores (logits)"]
B --> C["Probabilities"]
C --> D["Select / sample next token"]
D --> E["Append token"]
E --> F["Repeat"]
This loop is autoregressive generation.
How LLMs are trained
Pretraining usually involves:
collect massive text data
tokenize it
predict held-out next tokens
update parameters to reduce error
repeat many times
What is learned?
grammar
discourse patterns
style regularities
broad factual patterns in the data
reusable internal features
Scaling helps—but it is not magic
Conceptual only: more scale often improves performance, but not without cost.
Trade-offs include compute cost, energy use, latency, and harder safety evaluation.
Fine-tuning, instruction tuning, and alignment
flowchart LR
A["Pretrained base model"] --> B["Supervised fine-tuning"]
B --> C["Preference / feedback tuning"]
C --> D["Aligned assistant behavior"]
Post-training aims to make the system:
follow instructions
be more useful in conversation
reduce unsafe or low-quality outputs
better match human preferences
How a system like ChatGPT differs from a raw base model
System layer prompting, tools, formatting, policies
A deployed assistant is usually more than just the pretrained model.
What “understanding” means here
Strong appearance of understanding - explanation - summarization - translation - coding help - question answering
Reasons for caution - no guaranteed grounding - confidence ≠ truth - behavior can be brittle - “understanding” remains debated
Why LLMs hallucinate and fail
Hallucination
plausible but false output
Bias / distortion
training data patterns reappear
Reasoning brittleness
multi-step tasks can break
Context misses
important prompt detail is ignored
Knowledge limits
post-training events may be unknown
Overconfidence
style can sound firmer than evidence
A compact mental model
flowchart LR
A["Text"] --> B["Tokens"]
B --> C["Embeddings"]
C --> D["Attention + transformer layers"]
D --> E["Predicted next token"]
E --> F["Repeat"]
D --> G["Post-training / alignment"]
G --> H["Assistant behavior"]
Discussion questions
In what sense are embeddings a geometry of meaning?
Why did transformers scale better than RNNs?
Does next-token prediction produce reasoning, or simulate it?
Why can a model sound confident while being wrong?
Which matters more in practice: pretraining scale or post-training alignment?
Takeaways
Tokenization breaks text into manageable units.
Embeddings map those units into vector space.
Attention lets tokens influence one another contextually.
Transformers scale this idea very effectively.
Training + post-training turn a predictor into a usable assistant.