Intelligence, AI, and LLM’s

Robert W. Walker

2026-03-31

Outline

  1. Intelligence.
  2. The Basics of LLM’s.
    • Embeddings.
    • Attention is all you need.
  3. The Prediction.
    • The Context.

Summary: Intelligence

  • Definitions of human intelligence and artificial intelligence often differ across psychology and computer science.
  • The paper argues that this mismatch creates confusion about what current AI systems actually demonstrate.
  • It proposes a shared conceptual framework so both fields can discuss intelligence with more precision.

Core definitions proposed

  • Human intelligence: maximal capacity to complete novel goals successfully through perceptual-cognitive processes.
  • Artificial intelligence: maximal capacity to complete novel goals successfully through computational processes.
  • In both cases, intelligence is framed as a capacity, not just a record of past performance.
  • The authors also argue intelligence is better understood as multidimensional, not a single narrow skill.

Intelligence is not the same as achievement

  • The paper draws a sharp line between:
    • Intelligence = flexible capacity for success on novel goals
    • Achievement / expertise = strong performance built through training on specific tasks or domains
  • Many current AI systems may look impressive because they show artificial achievement rather than genuine artificial intelligence.
  • This distinction matters because benchmark success alone can overstate what systems actually generalize to.

Why “AI metrics” is needed

  • The authors argue AI evaluation should learn from psychometrics.
  • They call for an AI metrics discipline focused on:
    • reliability (consistent measurement)
    • validity (measuring the intended construct)
    • standardized procedures for comparing systems fairly
  • Without stronger measurement practice, claims about intelligence or AGI remain difficult to justify.

Main takeaway for AGI debates

  • The paper suggests AGI should be understood analogously to human general intelligence:
    • not merely broad competence,
    • but the shared variance across many system performances.
  • Bottom line: current evidence supports artificial achievement/expertise more strongly than artificial intelligence.
  • The path forward is interdisciplinary collaboration plus better definitions and measurement.

A Brief History

1. Origins: early neural ideas and the birth of AI

1940s–1950s - In 1943, Warren McCulloch and Walter Pitts published a simplified mathematical model of a neuron and showed how networks of such units could implement logical operations. - In 1956, the Dartmouth Summer Research Project helped define artificial intelligence as a field and popularized the term itself. - Early AI combined several ambitions: - symbolic reasoning - machine learning - search and planning - neural models inspired by brains

1943 McCulloch–Pitts 1956 Dartmouth workshop

Sources: McCulloch & Pitts (1943); Dartmouth AI history pages.

2. Perceptrons, optimism, and the first setbacks

Late 1950s–1970s - Frank Rosenblatt’s perceptron (1958) made neural learning seem highly promising. - Perceptrons could learn linear decision boundaries from examples. - But enthusiasm outpaced capability: - early systems were limited - compute was weak - data were scarce - Critiques of single-layer perceptrons, especially their inability to solve some nonlinearly separable problems, contributed to a slowdown in neural-network enthusiasm.

Why this mattered - AI did not progress in a straight line. - Periods of excitement were followed by disappointment. - This helps explain later AI winters and why symbolic AI often overshadowed neural approaches for a time.

Sources: Rosenblatt (1958); historical reviews of neural networks and AI.

3. Backpropagation, expert systems, and statistical learning

1980s–1990s - In the 1980s, backpropagation became the key training method for multi-layer neural networks. - The famous 1986 Rumelhart, Hinton, and Williams paper showed how internal representations could be learned across layers. - At the same time, AI also advanced through: - expert systems - probabilistic models - statistical pattern recognition - In the 1990s, neural nets remained important, but other methods often dominated practical machine learning.

Backpropagation made deeper, multi-layer learning much more practical.

Sources: Rumelhart, Hinton & Williams (1986); LeCun, Bengio & Hinton (2015).

4. Deep learning’s resurgence

2000s–2010s - Three things changed the field: 1. much more digital data 2. far more compute, especially GPUs 3. better training methods and architectures - Neural networks began to dominate difficult perception tasks such as: - speech recognition - image classification - machine translation - By the mid-2010s, deep learning had become central to AI research and industry.

Key idea Deep learning systems learn layered representations automatically rather than relying only on hand-crafted features.

Consequence AI moved from brittle rule systems toward large-scale representation learning from data.

Sources: LeCun, Bengio & Hinton (2015); broad historical overviews of modern AI.

5. Transformers, foundation models, and the LLM era

Late 2010s–2020s - In 2017, the transformer architecture showed that attention could replace recurrence for many sequence tasks. - This architecture scaled extremely well with data and compute. - Large pretrained models then became capable of: - text generation - summarization - translation - coding assistance - question answering - Today’s LLMs sit at the intersection of: - neural-network history - large-scale optimization - language modeling - post-training alignment

Neurons Perceptrons Deep nets LLMs

Bottom line: modern LLMs are not a sudden break from history; they are the latest stage in a long sequence of ideas about representation, learning, and computation.

Sources: Vaswani et al. (2017); Google ML Crash Course LLM materials; Ouyang et al. (2022).

The LLM Defined and Explained

This Specific Form of Intelligence

  • LLMs now power chat, search, summarization, coding help, and tutoring.
  • Three ideas explain much of their behavior:
    • tokens
    • embeddings
    • transformers
  • Modern systems add a fourth idea too:
    • alignment / post-training
Tokens Embeddings Transformers LLM Assistant
From text units to a deployed assistant.

Roadmap

  1. What language models do
  2. Tokens and embeddings
  3. Context and attention
  4. Transformer architecture
  5. Training and alignment
  6. Strengths, limits, and interpretation

What is a language model?

A language model estimates the probability of the next token from earlier tokens:

P(t_k \mid t_1, t_2, \ldots, t_{k-1})

flowchart LR
    A["The"] --> B["cat"] --> C["sat"] --> D["on"] --> E["the"] --> F{"next token?"}
    F --> G["mat"]
    F --> H["floor"]
    F --> I["chair"]

Core idea: generation is repeated next-token prediction.

Tokens: the model’s working units

  • LLMs usually do not read whole words directly.
  • They read tokens, such as:
    • full words
    • subwords
    • punctuation
    • number pieces
  • Tokenization helps with rare words and open vocabularies.

Example tokenizations

Text Possible tokens
unbelievable un, believ, able
2026 20, 26
can’t can, 't

Different tokenizers split differently. The point is to turn raw text into reusable units.

From text to vectors

flowchart LR
    A["Raw text"] --> B["Tokenizer"]
    B --> C["Token IDs"]
    C --> D["Embedding lookup"]
    D --> E["Dense vectors"]
    E --> F["Transformer layers"]
    F --> G["Next-token probabilities"]

Embedding: a learned vector representation for a token or other object.

What is an embedding?

An embedding places items in a high-dimensional space so that related items often end up near one another.

Why useful:

  • similarity becomes geometric
  • clusters can emerge
  • models can generalize beyond exact memorization
dimension 1 dimension 2 cat kitten dog Paris London Tokyo
Illustrative only: actual embeddings live in many dimensions.

Geometric intuition—and caution

Good intuition

Nearby points often mean similar usage or context.

But not perfect

Distance is learned from data, not hand-written meaning.

Also important

Embeddings can encode bias, noise, and spurious correlations.

Static vs contextual embeddings

Static embedding - one vector per word type - same bank everywhere

Contextual embedding - vector depends on surrounding text - bank changes with sentence context

flowchart TB
    A["river bank was muddy"] --> B["bank"]
    C["bank approved the loan"] --> D["bank"]
    B --> E["contextual vector A"]
    D --> F["contextual vector B"]

This shift toward context-sensitive representations is one reason transformer systems became so effective.

Why earlier sequence models struggled

Earlier approaches included:

  • n-gram models
  • RNNs
  • LSTMs / GRUs

Common problems:

  • weak long-range memory
  • harder parallel training
  • unstable signals over long sequences
h₁ h₂ h₃ h₄ token 1 token 2 token 3 token 4
Sequential dependence made training harder to parallelize.

The transformer breakthrough

The 2017 transformer paper replaced recurrence with attention-heavy computation.

flowchart LR
    A["Input sequence"] --> B["Attention"]
    B --> C["Feed-forward"]
    C --> D["Repeated layers"]
    D --> E["Output probabilities"]

Why this mattered:

  • stronger long-range interactions
  • better parallelization
  • excellent scaling with data and compute

A simple transformer picture

flowchart TB
    A["Token embeddings + position"] --> B["Self-attention"]
    B --> C["Add & normalize"]
    C --> D["Feed-forward network"]
    D --> E["Add & normalize"]
    E --> F["Repeat many times"]
    F --> G["Vocabulary scores"]

For text generation, many modern LLMs use a decoder-style transformer.

Why position must be added

Embeddings alone do not encode order.

Sentence A
Dog bites man.

Sentence B
Man bites dog.

Same words, different order, different meaning.

So transformers add positional information:

  • learned positional vectors
  • sinusoidal encodings
  • relative position methods in newer variants

Self-attention: the core mechanism

Self-attention lets each token look at other tokens and weight their relevance.

graph LR
    A["The"] --- D["animal"]
    B["tired"] --- D
    C["dog"] --- D
    E["slept"] --- D

When building the representation for one token, the model can ask:

  • Which earlier words matter most here?
  • Which words resolve ambiguity?
  • Which words define topic or syntax?

Query, key, and value intuition

  • Query: what this token is looking for
  • Key: what this token offers as a match
  • Value: the information it can contribute
Query Key Value 1. compare Q to K 2. turn scores into weights 3. mix V using those weights → context-sensitive representation

Attention as weighted influence

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

You do not need every algebraic detail to get the idea:

  • compare token features
  • compute relevance weights
  • blend information from multiple positions

The result is a new representation that depends on context.

Multi-head attention

flowchart TB
    A["Input representations"] --> B1["Head 1"]
    A --> B2["Head 2"]
    A --> B3["Head 3"]
    A --> B4["Head 4"]
    B1 --> C["Concatenate"]
    B2 --> C
    B3 --> C
    B4 --> C
    C --> D["Project to new representation"]

Different heads can learn different relational patterns, though their roles are not always cleanly interpretable.

From hidden states to next-token prediction

flowchart LR
    A["Current hidden state"] --> B["Vocabulary scores (logits)"]
    B --> C["Probabilities"]
    C --> D["Select / sample next token"]
    D --> E["Append token"]
    E --> F["Repeat"]

This loop is autoregressive generation.

How LLMs are trained

Pretraining usually involves:

  1. collect massive text data
  2. tokenize it
  3. predict held-out next tokens
  4. update parameters to reduce error
  5. repeat many times

What is learned?

  • grammar
  • discourse patterns
  • style regularities
  • broad factual patterns in the data
  • reusable internal features

Scaling helps—but it is not magic

compute / data / parameters capability
Conceptual only: more scale often improves performance, but not without cost.

Trade-offs include compute cost, energy use, latency, and harder safety evaluation.

Fine-tuning, instruction tuning, and alignment

flowchart LR
    A["Pretrained base model"] --> B["Supervised fine-tuning"]
    B --> C["Preference / feedback tuning"]
    C --> D["Aligned assistant behavior"]

Post-training aims to make the system:

  • follow instructions
  • be more useful in conversation
  • reduce unsafe or low-quality outputs
  • better match human preferences

How a system like ChatGPT differs from a raw base model

Base model
predicts next tokens

Post-training
instruction following, preference shaping, safety

System layer
prompting, tools, formatting, policies

A deployed assistant is usually more than just the pretrained model.

What “understanding” means here

Strong appearance of understanding - explanation - summarization - translation - coding help - question answering

Reasons for caution - no guaranteed grounding - confidence ≠ truth - behavior can be brittle - “understanding” remains debated

Why LLMs hallucinate and fail

Hallucination
plausible but false output

Bias / distortion
training data patterns reappear

Reasoning brittleness
multi-step tasks can break

Context misses
important prompt detail is ignored

Knowledge limits
post-training events may be unknown

Overconfidence
style can sound firmer than evidence

A compact mental model

flowchart LR
    A["Text"] --> B["Tokens"]
    B --> C["Embeddings"]
    C --> D["Attention + transformer layers"]
    D --> E["Predicted next token"]
    E --> F["Repeat"]
    D --> G["Post-training / alignment"]
    G --> H["Assistant behavior"]

Discussion questions

  1. In what sense are embeddings a geometry of meaning?
  2. Why did transformers scale better than RNNs?
  3. Does next-token prediction produce reasoning, or simulate it?
  4. Why can a model sound confident while being wrong?
  5. Which matters more in practice: pretraining scale or post-training alignment?

Takeaways

  • Tokenization breaks text into manageable units.
  • Embeddings map those units into vector space.
  • Attention lets tokens influence one another contextually.
  • Transformers scale this idea very effectively.
  • Training + post-training turn a predictor into a usable assistant.
  • Limits remain fundamental, not incidental.

References