flowchart LR
A["The"] --> B["cat"] --> C["sat"] --> D["on"] --> E["the"] --> F{"next token?"}
F --> G["mat"]
F --> H["floor"]
F --> I["chair"]
Intelligence, AI, and LLM’s
Outline
- Intelligence.
- The Basics of LLM’s.
- Embeddings.
- Attention is all you need.
- Embeddings.
- The Prediction.
- The Context.
Summary: Intelligence
- Definitions of human intelligence and artificial intelligence often differ across psychology and computer science.
- The paper argues that this mismatch creates confusion about what current AI systems actually demonstrate.
- It proposes a shared conceptual framework so both fields can discuss intelligence with more precision.
Core definitions proposed
- Human intelligence: maximal capacity to complete novel goals successfully through perceptual-cognitive processes.
- Artificial intelligence: maximal capacity to complete novel goals successfully through computational processes.
- In both cases, intelligence is framed as a capacity, not just a record of past performance.
- The authors also argue intelligence is better understood as multidimensional, not a single narrow skill.
Intelligence is not the same as achievement
- The paper draws a sharp line between:
- Intelligence = flexible capacity for success on novel goals
- Achievement / expertise = strong performance built through training on specific tasks or domains
- Many current AI systems may look impressive because they show artificial achievement rather than genuine artificial intelligence.
- This distinction matters because benchmark success alone can overstate what systems actually generalize to.
Why “AI metrics” is needed
- The authors argue AI evaluation should learn from psychometrics.
- They call for an AI metrics discipline focused on:
- reliability (consistent measurement)
- validity (measuring the intended construct)
- standardized procedures for comparing systems fairly
- Without stronger measurement practice, claims about intelligence or AGI remain difficult to justify.
Main takeaway for AGI debates
- The paper suggests AGI should be understood analogously to human general intelligence:
- not merely broad competence,
- but the shared variance across many system performances.
- Bottom line: current evidence supports artificial achievement/expertise more strongly than artificial intelligence.
- The path forward is interdisciplinary collaboration plus better definitions and measurement.
A Brief History
1. Origins: early neural ideas and the birth of AI
1940s–1950s - In 1943, Warren McCulloch and Walter Pitts published a simplified mathematical model of a neuron and showed how networks of such units could implement logical operations. - In 1956, the Dartmouth Summer Research Project helped define artificial intelligence as a field and popularized the term itself. - Early AI combined several ambitions: - symbolic reasoning - machine learning - search and planning - neural models inspired by brains
Sources: McCulloch & Pitts (1943); Dartmouth AI history pages.
2. Perceptrons, optimism, and the first setbacks
Late 1950s–1970s - Frank Rosenblatt’s perceptron (1958) made neural learning seem highly promising. - Perceptrons could learn linear decision boundaries from examples. - But enthusiasm outpaced capability: - early systems were limited - compute was weak - data were scarce - Critiques of single-layer perceptrons, especially their inability to solve some nonlinearly separable problems, contributed to a slowdown in neural-network enthusiasm.
Why this mattered - AI did not progress in a straight line. - Periods of excitement were followed by disappointment. - This helps explain later AI winters and why symbolic AI often overshadowed neural approaches for a time.
Sources: Rosenblatt (1958); historical reviews of neural networks and AI.
3. Backpropagation, expert systems, and statistical learning
1980s–1990s - In the 1980s, backpropagation became the key training method for multi-layer neural networks. - The famous 1986 Rumelhart, Hinton, and Williams paper showed how internal representations could be learned across layers. - At the same time, AI also advanced through: - expert systems - probabilistic models - statistical pattern recognition - In the 1990s, neural nets remained important, but other methods often dominated practical machine learning.
Sources: Rumelhart, Hinton & Williams (1986); LeCun, Bengio & Hinton (2015).
4. Deep learning’s resurgence
2000s–2010s - Three things changed the field: 1. much more digital data 2. far more compute, especially GPUs 3. better training methods and architectures - Neural networks began to dominate difficult perception tasks such as: - speech recognition - image classification - machine translation - By the mid-2010s, deep learning had become central to AI research and industry.
Key idea Deep learning systems learn layered representations automatically rather than relying only on hand-crafted features.
Consequence AI moved from brittle rule systems toward large-scale representation learning from data.
Sources: LeCun, Bengio & Hinton (2015); broad historical overviews of modern AI.
5. Transformers, foundation models, and the LLM era
Late 2010s–2020s - In 2017, the transformer architecture showed that attention could replace recurrence for many sequence tasks. - This architecture scaled extremely well with data and compute. - Large pretrained models then became capable of: - text generation - summarization - translation - coding assistance - question answering - Today’s LLMs sit at the intersection of: - neural-network history - large-scale optimization - language modeling - post-training alignment
Bottom line: modern LLMs are not a sudden break from history; they are the latest stage in a long sequence of ideas about representation, learning, and computation.
Sources: Vaswani et al. (2017); Google ML Crash Course LLM materials; Ouyang et al. (2022).
The LLM Defined and Explained
This Specific Form of Intelligence
- LLMs now power chat, search, summarization, coding help, and tutoring.
- Three ideas explain much of their behavior:
- tokens
- embeddings
- transformers
- Modern systems add a fourth idea too:
- alignment / post-training
Roadmap
- What language models do
- Tokens and embeddings
- Context and attention
- Transformer architecture
- Training and alignment
- Strengths, limits, and interpretation
What is a language model?
A language model estimates the probability of the next token from earlier tokens:
\[ P(t_k \mid t_1, t_2, \ldots, t_{k-1}) \]
Core idea: generation is repeated next-token prediction.
Tokens: the model’s working units
- LLMs usually do not read whole words directly.
- They read tokens, such as:
- full words
- subwords
- punctuation
- number pieces
- Tokenization helps with rare words and open vocabularies.
Example tokenizations
| Text | Possible tokens |
|---|---|
| unbelievable | un, believ, able |
| 2026 | 20, 26 |
| can’t | can, 't |
Different tokenizers split differently. The point is to turn raw text into reusable units.
From text to vectors
flowchart LR
A["Raw text"] --> B["Tokenizer"]
B --> C["Token IDs"]
C --> D["Embedding lookup"]
D --> E["Dense vectors"]
E --> F["Transformer layers"]
F --> G["Next-token probabilities"]
Embedding: a learned vector representation for a token or other object.
What is an embedding?
An embedding places items in a high-dimensional space so that related items often end up near one another.
Why useful:
- similarity becomes geometric
- clusters can emerge
- models can generalize beyond exact memorization
Geometric intuition—and caution
Good intuition
Nearby points often mean similar usage or context.
But not perfect
Distance is learned from data, not hand-written meaning.
Also important
Embeddings can encode bias, noise, and spurious correlations.
Static vs contextual embeddings
Static embedding - one vector per word type - same bank everywhere
Contextual embedding - vector depends on surrounding text - bank changes with sentence context
flowchart TB
A["river bank was muddy"] --> B["bank"]
C["bank approved the loan"] --> D["bank"]
B --> E["contextual vector A"]
D --> F["contextual vector B"]
This shift toward context-sensitive representations is one reason transformer systems became so effective.
Why earlier sequence models struggled
Earlier approaches included:
- n-gram models
- RNNs
- LSTMs / GRUs
Common problems:
- weak long-range memory
- harder parallel training
- unstable signals over long sequences
The transformer breakthrough
The 2017 transformer paper replaced recurrence with attention-heavy computation.
flowchart LR
A["Input sequence"] --> B["Attention"]
B --> C["Feed-forward"]
C --> D["Repeated layers"]
D --> E["Output probabilities"]
Why this mattered:
- stronger long-range interactions
- better parallelization
- excellent scaling with data and compute
A simple transformer picture
flowchart TB
A["Token embeddings + position"] --> B["Self-attention"]
B --> C["Add & normalize"]
C --> D["Feed-forward network"]
D --> E["Add & normalize"]
E --> F["Repeat many times"]
F --> G["Vocabulary scores"]
For text generation, many modern LLMs use a decoder-style transformer.
Why position must be added
Embeddings alone do not encode order.
Sentence A
Dog bites man.
Sentence B
Man bites dog.
Same words, different order, different meaning.
So transformers add positional information:
- learned positional vectors
- sinusoidal encodings
- relative position methods in newer variants
Self-attention: the core mechanism
Self-attention lets each token look at other tokens and weight their relevance.
graph LR
A["The"] --- D["animal"]
B["tired"] --- D
C["dog"] --- D
E["slept"] --- D
When building the representation for one token, the model can ask:
- Which earlier words matter most here?
- Which words resolve ambiguity?
- Which words define topic or syntax?
Query, key, and value intuition
- Query: what this token is looking for
- Key: what this token offers as a match
- Value: the information it can contribute
Attention as weighted influence
\[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]
You do not need every algebraic detail to get the idea:
- compare token features
- compute relevance weights
- blend information from multiple positions
The result is a new representation that depends on context.
Multi-head attention
flowchart TB
A["Input representations"] --> B1["Head 1"]
A --> B2["Head 2"]
A --> B3["Head 3"]
A --> B4["Head 4"]
B1 --> C["Concatenate"]
B2 --> C
B3 --> C
B4 --> C
C --> D["Project to new representation"]
Different heads can learn different relational patterns, though their roles are not always cleanly interpretable.
How LLMs are trained
Pretraining usually involves:
- collect massive text data
- tokenize it
- predict held-out next tokens
- update parameters to reduce error
- repeat many times
What is learned?
- grammar
- discourse patterns
- style regularities
- broad factual patterns in the data
- reusable internal features
Scaling helps—but it is not magic
Trade-offs include compute cost, energy use, latency, and harder safety evaluation.
Fine-tuning, instruction tuning, and alignment
flowchart LR
A["Pretrained base model"] --> B["Supervised fine-tuning"]
B --> C["Preference / feedback tuning"]
C --> D["Aligned assistant behavior"]
Post-training aims to make the system:
- follow instructions
- be more useful in conversation
- reduce unsafe or low-quality outputs
- better match human preferences
How a system like ChatGPT differs from a raw base model
Base model
predicts next tokens
Post-training
instruction following, preference shaping, safety
System layer
prompting, tools, formatting, policies
A deployed assistant is usually more than just the pretrained model.
What “understanding” means here
Strong appearance of understanding - explanation - summarization - translation - coding help - question answering
Reasons for caution - no guaranteed grounding - confidence ≠ truth - behavior can be brittle - “understanding” remains debated
Why LLMs hallucinate and fail
Hallucination
plausible but false output
Bias / distortion
training data patterns reappear
Reasoning brittleness
multi-step tasks can break
Context misses
important prompt detail is ignored
Knowledge limits
post-training events may be unknown
Overconfidence
style can sound firmer than evidence
A compact mental model
flowchart LR
A["Text"] --> B["Tokens"]
B --> C["Embeddings"]
C --> D["Attention + transformer layers"]
D --> E["Predicted next token"]
E --> F["Repeat"]
D --> G["Post-training / alignment"]
G --> H["Assistant behavior"]
Discussion questions
- In what sense are embeddings a geometry of meaning?
- Why did transformers scale better than RNNs?
- Does next-token prediction produce reasoning, or simulate it?
- Why can a model sound confident while being wrong?
- Which matters more in practice: pretraining scale or post-training alignment?
Takeaways
- Tokenization breaks text into manageable units.
- Embeddings map those units into vector space.
- Attention lets tokens influence one another contextually.
- Transformers scale this idea very effectively.
- Training + post-training turn a predictor into a usable assistant.
- Limits remain fundamental, not incidental.
References
- Google. Machine Learning Crash Course: Embeddings module. https://developers.google.com/machine-learning/crash-course/embeddings
- Google. Machine Learning Crash Course: Introduction to Large Language Models. https://developers.google.com/machine-learning/crash-course/llm
- Google. LLMs: What’s a large language model? https://developers.google.com/machine-learning/crash-course/llm/transformers
- Vaswani, Ashish, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762
- Ouyang, Long, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS. https://arxiv.org/abs/2203.02155