flowchart TB
A["Input representations"] --> B1["Head 1"]
A --> B2["Head 2"]
A --> B3["Head 3"]
A --> B4["Head 4"]
B1 --> C["Concatenate"]
B2 --> C
B3 --> C
B4 --> C
C --> D["Project to new representation"]
2026-04-02
A language model estimates the probability of the next token from earlier tokens:
P(t_k \mid t_1, t_2, \ldots, t_{k-1})
\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
You do not need every algebraic detail to get the idea:
The result is a new representation that depends on context.
flowchart TB
A["Input representations"] --> B1["Head 1"]
A --> B2["Head 2"]
A --> B3["Head 3"]
A --> B4["Head 4"]
B1 --> C["Concatenate"]
B2 --> C
B3 --> C
B4 --> C
C --> D["Project to new representation"]
Different heads can learn different relational patterns, though their roles are not always cleanly interpretable.
flowchart LR
A["Current hidden state"] --> B["Vocabulary scores (logits)"]
B --> C["Probabilities"]
C --> D["Select / sample next token"]
D --> E["Append token"]
E --> F["Repeat"]
This loop is autoregressive generation.
flowchart LR
A["Pretrained base model"] --> B["Supervised fine-tuning"]
B --> C["Preference / feedback tuning"]
C --> D["Aligned assistant behavior"]
Post-training aims to make the system:
Strong appearance of understanding - explanation - summarization - translation - coding help - question answering
Reasons for caution - no guaranteed grounding - confidence ≠ truth - behavior can be brittle - “understanding” remains debated
flowchart LR
A["Text"] --> B["Tokens"]
B --> C["Embeddings"]
C --> D["Attention + transformer layers"]
D --> E["Predicted next token"]
E --> F["Repeat"]
D --> G["Post-training / alignment"]
G --> H["Assistant behavior"]
The Structural “Skeleton”
Before training, engineers define the model’s capacity and complexity. These values are fixed once training starts.
The “Learning Recipe”
These settings control the optimization process as the model learns from massive datasets.
Selection Strategies
Researchers use various methods to find the most efficient hyperparameter combinations.
| Method | Description |
|---|---|
| Grid Search | Testing every possible combination (exhaustive but slow). |
| Random Search | Randomly sampling combinations; surprisingly efficient. |
| Bayesian Optimization | Mathematical modeling to predict optimal settings based on past trials. |
| Scaling Laws | Using formulas (like Chinchilla) to predict performance before scaling up. |
The “Personality” Settings
These values are adjustable after training to control how the model generates responses.
Temperature (\theta) is a hyperparameter used to control the randomness and confidence of a language model’s output.
Temperature modifies the standard Softmax function applied to the model’s raw output scores (logits).
\sigma(z_i) = \frac{e^{z_i/\theta}}{\sum_{j=0}^{N} e^{z_j/\theta}}
Choosing the right temperature depends entirely on the specific task:
| Task Type | Temp Setting | Benefit |
|---|---|---|
| Fact-based Q&A | Low (< 0.3) | Increases accuracy and reduces hallucinations. |
| Code Generation | Low (0.0 - 0.2) | Ensures syntactic correctness and stability. |
| Creative Writing | High (0.7 - 1.2) | Produces diverse, varied, and interesting prose. |
| Brainstorming | High (> 1.0) | Explores unconventional or “edge-case” ideas. |
From an LLM architecture perspective:
For many things, isn’t temperature 0 ideal?
* When?
* When not?
In your own case, what is the model depth, width, how many attention heads, and what vocabulary size?
The “Flash” Efficiency Factor
Instead of a single “dense” block, I use a Sparse Mixture-of-Experts (MoE) architecture.
Breadth of Understanding
My vocabulary isn’t just a list of words; it’s a bridge between different types of data.
Industry Estimates for Flash-Class Models
While specific counts are proprietary, “Flash-class” architectures typically operate within these ranges:
Summary of Gemini 3 Flash Specs
| Hyperparameter | Specification |
|---|---|
| Model Type | Sparse Mixture-of-Experts (MoE) |
| Vocabulary Size | 256,000 tokens |
| Context Window | Up to 1,000,000 tokens |
| Distillation | Optimized from larger Gemini 3 variants |
Key Takeaway: My performance is the result of distillation, where I am trained to mimic the logic of larger models while maintaining a lean “active” parameter count.

BUS 1301-SP26