mini transformer · forward pass

A token's
journey.

From raw text to next-token prediction — through a full transformer block: LN · attention · Wo · residual · MLP · LM head. A single forward pass, every step exposed.

tokenize embed attention mlp predict loss
scroll
00
Tokenization

Break the text into pieces.

Language models can't see letters or words — only numbers. So the first job: turn text into ids via a vocabulary (stoi).

"the cat sat on"
theID · 0
catID · 1
satID · 2
onID · 3
stoi = { "the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4, "dog": 5 }
vocab_size = 6 · tokens = 4
01
Token embedding

Turn each id into a vector.

tok_emb — a table holding a 4-dim vector for every vocab word. Each token id "reads" its own row. Meaning begins here: words get positioned in space.

idd0d1d2d3
0 · the0.2−0.10.70.2
1 · cat0.10.4−0.30.6
2 · sat0.50.20.1−0.3
3 · on−0.20.20.40.1
4 · mat0.40.10.30.5
5 · dog0.20.3−0.20.4
Selected rows — ei = tok_emb[idi]
e_the[ 0.2, −0.1, 0.7, 0.2 ]
e_cat[ 0.1, 0.4, −0.3, 0.6 ]
e_sat[ 0.5, 0.2, 0.1, −0.3 ]
e_on[ −0.2, 0.2, 0.4, 0.1 ]
i
Words with similar meanings end up close together in this space after training. Values start random; each iteration nudges meanings into coordinates.
02
Positional encoding

Add the order.

On its own, attention treats tokens like a "bag of words" — order is lost. The fix: add a separate vector per position (pos_emb). Now the model knows who came first and who came last.

posd0d1d2d3
00.10.00.10.0
10.00.10.00.1
20.10.00.0−0.1
30.00.10.10.0
xi = ei + pos_emb[i] → attention input
x_the[ 0.2, −0.1, 0.7, 0.2 ]+[ 0.1, 0.0, 0.1, 0.0 ]=[ 0.3, −0.1, 0.8, 0.2 ]
x_cat[ 0.1, 0.4, −0.3, 0.6 ]+[ 0.0, 0.1, 0.0, 0.1 ]=[ 0.1, 0.5, −0.3, 0.7 ]
x_sat[ 0.5, 0.2, 0.1, −0.3 ]+[ 0.1, 0.0, 0.0,−0.1 ]=[ 0.6, 0.2, 0.1,−0.4 ]
x_on[−0.2, 0.2, 0.4, 0.1 ]+[ 0.0, 0.1, 0.1, 0.0 ]=[−0.2, 0.3, 0.5, 0.1 ]
03
LayerNorm #1

Normalize before
attending.

Attention is sensitive to activations that grow or shrink — training gets unstable. LayerNorm fixes this: rescale each token's vector so its mean = 0 and std ≈ 1. In the modern "pre-norm" design, LN sits at the input of every sub-module (attention, MLP).

formula
LN(x) = γ · (x − μ) / √(σ² + ε) + β
μ = mean(x) · σ² = var(x) · γ, β are learnable (here γ = 1, β = 0)
example · step-by-step for x_on
x_on[ −0.20, 0.30, 0.50, 0.10 ] μ 0.175 x − μ[ −0.375, 0.125, 0.325, −0.075 ] σ² · σ 0.0669 · 0.259 LN(x_on)[ −1.45, 0.48, 1.26, −0.29 ]
i
Note: In the Attention section below, raw x values are used directly for numerical continuity (i.e., Wq/Wk/Wv are assumed calibrated for LN(x)). In a real transformer, Q, K, V = LN(x) · W.
04
Scaled dot-product attention

How much should each token
"attend" to the others?

The heart of the transformer. Each token computes its own Q, dots it with everyone's K, softmaxes the scores into attention weights, then takes the weighted sum of V's. The result: a contextual vector per token — ci.

Wq query
[[ 0.5,  0.2, −0.3,  0.1],
 [−0.2,  0.4,  0.1,  0.3],
 [ 0.1, −0.1,  0.5,  0.2],
 [ 0.3,  0.2, −0.1,  0.4]]
Wk key
[[ 0.4, −0.2,  0.3,  0.1],
 [ 0.2,  0.5, −0.1,  0.3],
 [−0.1,  0.3,  0.4, −0.2],
 [ 0.3,  0.1,  0.2,  0.5]]
Wv value
[[ 0.2,  0.4, −0.1,  0.3],
 [ 0.5, −0.1,  0.2,  0.4],
 [−0.2,  0.3,  0.5, −0.1],
 [ 0.1,  0.2,  0.3,  0.6]]
the
pos 0
x · Wq
Qthe
[ 0.3, 0.0, 0.3, 0.2]
x · Wk
Kthe
[ 0.1, 0.2, 0.5, −0.1]
x · Wv
Vthe
[−0.1, 0.4, 0.4, 0.1]
Q · K_the[ 0.1, 0.2, 0.5, −0.1]=0.16
Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.24
Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=0.07
Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.03
Athe = [ 0.16, 0.24, 0.07, 0.03 ]
attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).
softmax(A_the)
raw[ 0.16, 0.24, 0.07, 0.03]
exp[ 1.17, 1.27, 1.07, 1.03]Σ = 4.55
÷ Σ[ 0.26, 0.28, 0.24, 0.23]Σ ≈ 1
Σ softmax · V = context
0.26 × V_the + 0.28 × V_cat + 0.24 × V_sat + 0.23 × V_on
cthe=[ 0.13, 0.17, 0.23, 0.25 ]
context vector — the token's relationship with all others (weighted sum of V's — contextual representation).
cat
pos 1
x · Wq
Qcat
[ 0.1, 0.4, −0.2, 0.4]
x · Wk
Kcat
[ 0.4, 0.2, 0.0, 0.6]
x · Wv
Vcat
[ 0.4, 0.0, 0.2, 0.7]
Q · K_the[ 0.1, 0.2, 0.5, −0.1]=−0.05
Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.36
Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=−0.04
Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.14
Acat = [−0.05, 0.36, −0.04, 0.14 ]
attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).
softmax(A_cat)
raw[−0.05, 0.36, −0.04, 0.14]
exp[ 0.95, 1.43, 0.96, 1.15]Σ = 4.50
÷ Σ[ 0.21, 0.32, 0.21, 0.26]Σ = 1.00
Σ softmax · V
highest score is cat → cat (0.36), so softmax gives it 0.32 weight. ranking is preserved.
ccat=[ 0.15, 0.15, 0.22, 0.27 ]
context vector — the token's relationship with all others (weighted sum of V's — contextual representation).
sat
pos 2
x · Wq
Qsat
[ 0.2, 0.1, −0.1, 0.0]
x · Wk
Ksat
[ 0.2, 0.0, 0.1, −0.1]
x · Wv
Vsat
[ 0.2, 0.2, −0.1, 0.0]
Q · K_the[ 0.1, 0.2, 0.5, −0.1]=−0.01
Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.10
Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=0.03
Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.03
Asat = [−0.01, 0.10, 0.03, 0.03 ]
attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).
softmax(A_sat)
raw[−0.01, 0.10, 0.03, 0.03]
exp[ 0.99, 1.11, 1.03, 1.03]Σ = 4.16
÷ Σ[ 0.24, 0.27, 0.25, 0.25]Σ = 1.00
Σ softmax · V
scores are close → distribution is nearly uniform. "sat" attends roughly equally to everyone.
csat=[ 0.14, 0.18, 0.22, 0.24 ]
context vector — the token's relationship with all others (weighted sum of V's — contextual representation).
on
pos 3 · final token
x · Wq
Qon
[−0.1, 0.1, 0.3, 0.2]
x · Wk
Kon
[ 0.0, 0.4, 0.1, 0.0]
x · Wv
Von
[ 0.0, 0.1, 0.4, 0.1]
Q · K_the[ 0.1, 0.2, 0.5, −0.1]=0.14
Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.10
Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=−0.01
Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.07
Aon = [ 0.14, 0.10, −0.01, 0.07 ]
attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).
softmax(A_on)
raw[ 0.14, 0.10, −0.01, 0.07]
exp[ 1.15, 1.11, 0.99, 1.07]Σ = 4.32
÷ Σ[ 0.27, 0.26, 0.23, 0.25]Σ ≈ 1
Σ softmax · V = context (final token)
on attends most to "the" (0.27) — it had the highest dot product. all context now lives here.
con=[ 0.12, 0.19, 0.24, 0.24 ]
context vector — the token's relationship with all others (weighted sum of V's — contextual representation).
Step 04 output
c_the = [ 0.13, 0.17, 0.23, 0.25 ]
c_cat = [ 0.15, 0.15, 0.22, 0.27 ]
c_sat = [ 0.14, 0.18, 0.22, 0.24 ]
c_on = [ 0.12, 0.19, 0.24, 0.24 ]
Since we're predicting the next word, con flows to the next layers — first Wo, then residual, then MLP.
05
Output projection + residual

Project, then
reconnect.

The attention output gets a final linear transform via Wo (where multi-head attention learns to mix across heads). Then the residual connection: the original x is added back. Without skip connections, deep networks can't be trained.

1 · W_o · c_on
W_o  (4 × 4) =
[[ 0.8,  0.1, −0.1,  0.2],
 [ 0.1,  0.7,  0.1,  0.1],
 [−0.1,  0.2,  0.8,  0.1],
 [ 0.2,  0.1,  0.1,  0.7]]
c_on[ 0.12, 0.19, 0.24, 0.24 ] c_on · W_o[ 0.14, 0.19, 0.22, 0.24 ] = attn_out_on
2 · residual #1 · h1 = x + attn_out
    x_on          = [ −0.20,   0.30,   0.50,   0.10 ]
  + attn_out_on   = [  0.14,   0.19,   0.22,   0.24 ]
  ────────────────────────────────────────────
  = h1_on         = [ −0.06,   0.49,   0.72,   0.34 ]
+
The residual lets attention "add nothing" without losing information — x passes through. It also keeps gradient flow safe across deep networks.
06
Feed-forward block

Per-token
non-linear transform.

Attention mixes across tokens; MLP applies a non-linear transform to each token independently — the model's "thinking capacity". The sequence: LN → MLP → Residual.

1 · LN #2 — normalize before MLP
h1_on[ −0.06, 0.49, 0.72, 0.34 ] μ · σ 0.3725 · 0.284 LN2(h1_on)[ −1.52, 0.41, 1.22, −0.11 ]
2 · MLP · linear → ReLU → linear
W₁ (4 × 4)
[[ 0.3, −0.1,  0.2,  0.1],
 [ 0.1,  0.4, −0.1,  0.2],
 [−0.2,  0.1,  0.3, −0.1],
 [ 0.1,  0.2,  0.1,  0.3]]
W₂ (4 × 4)
[[ 0.2,  0.1, −0.1,  0.3],
 [ 0.1,  0.4,  0.2, −0.1],
 [−0.1,  0.2,  0.3,  0.1],
 [ 0.3, −0.1,  0.1,  0.2]]
LN2(h1_on)[ −1.52, 0.41, 1.22, −0.11 ] · W₁ (pre-activ)[ −0.67, 0.42, 0.01, −0.23 ] ReLU(...)[ 0.00, 0.42, 0.01, 0.00 ] · W₂ → mlp_out[ 0.04, 0.17, 0.09, −0.04 ]
3 · residual #2 · h2 = h1 + mlp_out · block output
    h1_on         = [ −0.06,   0.49,   0.72,   0.34 ]
  + mlp_out_on    = [  0.04,   0.17,   0.09,  −0.04 ]
  ────────────────────────────────────────────
  = h2_on         = [ −0.02,   0.66,   0.81,   0.30 ]
λ
h2_on is now the output of a full transformer block. In a real LLM this would flow through 12–96 blocks (each doing LN → Attn → Res → LN → MLP → Res). In this demo, a single block feeds directly into the LM head.
07
Language model head

From block output to vocab.

h2on (4-dim) × W_lm (4 × 6) = 6 numbers — one logit (raw score) per vocab word. Each column is a word's "signature"; whichever h2on resembles most gets the highest score.

input h2_on = [ −0.02, 0.66, 0.81, 0.30 ] shape [1, 4]
W_lm (4 × 6) — columns: the · cat · sat · on · mat · dog
thecatsatonmatdog
r00.10.20.20.20.10.0
r10.10.40.20.20.10.1
r20.10.50.20.30.20.0
r30.10.50.20.20.20.1
logits = h2_on · W_lm · shape [1, 6]
the
0.18
cat
0.82
sat
0.35
on
0.43
mat
0.29
dog
0.10
Logits aren't probabilities yet — they can be positive or negative, and don't sum to 1. Next step: softmax turns these raw scores into a proper probability distribution in [0, 1].
08
Softmax & argmax

Raw scores to probabilities,
probabilities to a prediction.

softmax exponentiates each logit, then divides by their sum — the result is a distribution that sums to 1. Then argmax: the word with the highest probability = the model's prediction.

softmax(logits) — in 3 steps
raw[ 0.18, 0.82, 0.35, 0.43, 0.29, 0.10]
exp[ 1.20, 2.27, 1.42, 1.54, 1.34, 1.11]Σ = 8.88
÷ Σ[ 0.14, 0.26, 0.16, 0.17, 0.15, 0.12]Σ ≈ 1
probability distribution (vocab)
the0.14
cat0.26
sat0.16
on0.17
mat0.15
dog0.12
argmax → prediction cat P = 0.26
09
Cross-entropy loss

Measure the error.

The target is "mat" (natural continuation: the cat sat on mat). But the model said "cat". Loss measures how little probability the model assigned to the correct answer.

Prediction vs. Target
prediction cat P = 0.26
target mat P(mat) = 0.15
Prediction wrong — the model is untrained.
Cross-entropy loss
L = −log( P(target) )
= −log( P(mat) )
= −log( 0.15 )
1.89
Why this value? The model is untrained — it gave only 0.15 probability to the correct answer (mat). Ideal case: P(target) = 1 → L = −log(1) = 0. A random guess would give P = 1/6 ≈ 0.17, so L ≈ 1.79. This single forward pass is complete; next comes the backward pass — gradients update Wq, Wk, Wv, Wo, W1, W2, Wlm, P(mat) rises, and loss drops.