From raw text to next-token prediction — through a full transformer block: LN · attention · Wo · residual · MLP · LM head. A single forward pass, every step exposed.
Language models can't see letters or words — only numbers. So the first job: turn text into ids via a vocabulary (stoi).
tok_emb — a table holding a 4-dim vector for every vocab word. Each token id "reads" its own row. Meaning begins here: words get positioned in space.
| id | d0 | d1 | d2 | d3 |
|---|---|---|---|---|
| 0 · the | 0.2 | −0.1 | 0.7 | 0.2 |
| 1 · cat | 0.1 | 0.4 | −0.3 | 0.6 |
| 2 · sat | 0.5 | 0.2 | 0.1 | −0.3 |
| 3 · on | −0.2 | 0.2 | 0.4 | 0.1 |
| 4 · mat | 0.4 | 0.1 | 0.3 | 0.5 |
| 5 · dog | 0.2 | 0.3 | −0.2 | 0.4 |
On its own, attention treats tokens like a "bag of words" — order is lost. The fix: add a separate vector per position (pos_emb). Now the model knows who came first and who came last.
| pos | d0 | d1 | d2 | d3 |
|---|---|---|---|---|
| 0 | 0.1 | 0.0 | 0.1 | 0.0 |
| 1 | 0.0 | 0.1 | 0.0 | 0.1 |
| 2 | 0.1 | 0.0 | 0.0 | −0.1 |
| 3 | 0.0 | 0.1 | 0.1 | 0.0 |
Attention is sensitive to activations that grow or shrink — training gets unstable. LayerNorm fixes this: rescale each token's vector so its mean = 0 and std ≈ 1. In the modern "pre-norm" design, LN sits at the input of every sub-module (attention, MLP).
The heart of the transformer. Each token computes its own Q, dots it with everyone's K, softmaxes the scores into attention weights, then takes the weighted sum of V's. The result: a contextual vector per token — ci.
[[ 0.5, 0.2, −0.3, 0.1], [−0.2, 0.4, 0.1, 0.3], [ 0.1, −0.1, 0.5, 0.2], [ 0.3, 0.2, −0.1, 0.4]]
[[ 0.4, −0.2, 0.3, 0.1], [ 0.2, 0.5, −0.1, 0.3], [−0.1, 0.3, 0.4, −0.2], [ 0.3, 0.1, 0.2, 0.5]]
[[ 0.2, 0.4, −0.1, 0.3], [ 0.5, −0.1, 0.2, 0.4], [−0.2, 0.3, 0.5, −0.1], [ 0.1, 0.2, 0.3, 0.6]]
The attention output gets a final linear transform via Wo (where multi-head attention learns to mix across heads). Then the residual connection: the original x is added back. Without skip connections, deep networks can't be trained.
W_o (4 × 4) = [[ 0.8, 0.1, −0.1, 0.2], [ 0.1, 0.7, 0.1, 0.1], [−0.1, 0.2, 0.8, 0.1], [ 0.2, 0.1, 0.1, 0.7]]
x_on = [ −0.20, 0.30, 0.50, 0.10 ] + attn_out_on = [ 0.14, 0.19, 0.22, 0.24 ] ──────────────────────────────────────────── = h1_on = [ −0.06, 0.49, 0.72, 0.34 ]
Attention mixes across tokens; MLP applies a non-linear transform to each token independently — the model's "thinking capacity". The sequence: LN → MLP → Residual.
[[ 0.3, −0.1, 0.2, 0.1], [ 0.1, 0.4, −0.1, 0.2], [−0.2, 0.1, 0.3, −0.1], [ 0.1, 0.2, 0.1, 0.3]]
[[ 0.2, 0.1, −0.1, 0.3], [ 0.1, 0.4, 0.2, −0.1], [−0.1, 0.2, 0.3, 0.1], [ 0.3, −0.1, 0.1, 0.2]]
h1_on = [ −0.06, 0.49, 0.72, 0.34 ] + mlp_out_on = [ 0.04, 0.17, 0.09, −0.04 ] ──────────────────────────────────────────── = h2_on = [ −0.02, 0.66, 0.81, 0.30 ]
h2on (4-dim) × W_lm (4 × 6) = 6 numbers — one logit (raw score) per vocab word. Each column is a word's "signature"; whichever h2on resembles most gets the highest score.
| the | cat | sat | on | mat | dog | |
|---|---|---|---|---|---|---|
| r0 | 0.1 | 0.2 | 0.2 | 0.2 | 0.1 | 0.0 |
| r1 | 0.1 | 0.4 | 0.2 | 0.2 | 0.1 | 0.1 |
| r2 | 0.1 | 0.5 | 0.2 | 0.3 | 0.2 | 0.0 |
| r3 | 0.1 | 0.5 | 0.2 | 0.2 | 0.2 | 0.1 |
softmax exponentiates each logit, then divides by their sum — the result is a distribution that sums to 1. Then argmax: the word with the highest probability = the model's prediction.
The target is "mat" (natural continuation: the cat sat on mat). But the model said "cat". Loss measures how little probability the model assigned to the correct answer.