A token's journey — mini transformer, single forward pass

00

Tokenization

Break the text into pieces.

Language models can't see letters or words — only numbers. So the first job: turn text into ids via a vocabulary (stoi).

"the cat sat on"

theID · 0

catID · 1

satID · 2

onID · 3

stoi = { "the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4, "dog": 5 }
vocab_size = 6 · tokens = 4

01

Token embedding

Turn each id into a vector.

tok_emb — a table holding a 4-dim vector for every vocab word. Each token id "reads" its own row. Meaning begins here: words get positioned in space.

id	d0	d1	d2	d3
0 · the	0.2	−0.1	0.7	0.2
1 · cat	0.1	0.4	−0.3	0.6
2 · sat	0.5	0.2	0.1	−0.3
3 · on	−0.2	0.2	0.4	0.1
4 · mat	0.4	0.1	0.3	0.5
5 · dog	0.2	0.3	−0.2	0.4

Selected rows — e_i = tok_emb[id_i]

e_the[ 0.2, −0.1, 0.7, 0.2 ]

e_cat[ 0.1, 0.4, −0.3, 0.6 ]

e_sat[ 0.5, 0.2, 0.1, −0.3 ]

e_on[ −0.2, 0.2, 0.4, 0.1 ]

i

Words with similar meanings end up close together in this space after training. Values start random; each iteration nudges meanings into coordinates.

02

Positional encoding

Add the order.

On its own, attention treats tokens like a "bag of words" — order is lost. The fix: add a separate vector per position (pos_emb). Now the model knows who came first and who came last.

pos	d0	d1	d2	d3
0	0.1	0.0	0.1	0.0
1	0.0	0.1	0.0	0.1
2	0.1	0.0	0.0	−0.1
3	0.0	0.1	0.1	0.0

x_i = e_i + pos_emb[i] → attention input

x_the[ 0.2, −0.1, 0.7, 0.2 ]+[ 0.1, 0.0, 0.1, 0.0 ]=[ 0.3, −0.1, 0.8, 0.2 ]

x_cat[ 0.1, 0.4, −0.3, 0.6 ]+[ 0.0, 0.1, 0.0, 0.1 ]=[ 0.1, 0.5, −0.3, 0.7 ]

x_sat[ 0.5, 0.2, 0.1, −0.3 ]+[ 0.1, 0.0, 0.0,−0.1 ]=[ 0.6, 0.2, 0.1,−0.4 ]

x_on[−0.2, 0.2, 0.4, 0.1 ]+[ 0.0, 0.1, 0.1, 0.0 ]=[−0.2, 0.3, 0.5, 0.1 ]

03

LayerNorm #1

Normalize before
attending.

Attention is sensitive to activations that grow or shrink — training gets unstable. LayerNorm fixes this: rescale each token's vector so its mean = 0 and std ≈ 1. In the modern "pre-norm" design, LN sits at the input of every sub-module (attention, MLP).

formula

LN(x) = γ · (x − μ) / √(σ² + ε) + β
μ = mean(x) · σ² = var(x) · γ, β are learnable (here γ = 1, β = 0)

example · step-by-step for x_on

x_on[ −0.20, 0.30, 0.50, 0.10 ] μ 0.175 x − μ[ −0.375, 0.125, 0.325, −0.075 ] σ² · σ 0.0669 · 0.259 LN(x_on)[ −1.45, 0.48, 1.26, −0.29 ]

i

Note: In the Attention section below, raw x values are used directly for numerical continuity (i.e., W_q/W_k/W_v are assumed calibrated for LN(x)). In a real transformer, Q, K, V = LN(x) · W.

04

Scaled dot-product attention

How much should each token
"attend" to the others?

The heart of the transformer. Each token computes its own Q, dots it with everyone's K, softmaxes the scores into attention weights, then takes the weighted sum of V's. The result: a contextual vector per token — c_i.

W_q query

[[ 0.5,  0.2, −0.3,  0.1],
 [−0.2,  0.4,  0.1,  0.3],
 [ 0.1, −0.1,  0.5,  0.2],
 [ 0.3,  0.2, −0.1,  0.4]]

W_k key

[[ 0.4, −0.2,  0.3,  0.1],
 [ 0.2,  0.5, −0.1,  0.3],
 [−0.1,  0.3,  0.4, −0.2],
 [ 0.3,  0.1,  0.2,  0.5]]

W_v value

[[ 0.2,  0.4, −0.1,  0.3],
 [ 0.5, −0.1,  0.2,  0.4],
 [−0.2,  0.3,  0.5, −0.1],
 [ 0.1,  0.2,  0.3,  0.6]]

the

pos 0

x · Wq

Q_the

[ 0.3, 0.0, 0.3, 0.2]

x · Wk

K_the

[ 0.1, 0.2, 0.5, −0.1]

x · Wv

V_the

[−0.1, 0.4, 0.4, 0.1]

Q · K_the[ 0.1, 0.2, 0.5, −0.1]=0.16

Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.24

Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=0.07

Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.03

A_the = [ 0.16, 0.24, 0.07, 0.03 ]

attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).

softmax(A_the)

raw[ 0.16, 0.24, 0.07, 0.03]

exp[ 1.17, 1.27, 1.07, 1.03]Σ = 4.55

÷ Σ[ 0.26, 0.28, 0.24, 0.23]Σ ≈ 1

Σ softmax · V = context

0.26 × V_the + 0.28 × V_cat + 0.24 × V_sat + 0.23 × V_on

c_the=[ 0.13, 0.17, 0.23, 0.25 ]

context vector — the token's relationship with all others (weighted sum of V's — contextual representation).

cat

pos 1

x · Wq

Q_cat

[ 0.1, 0.4, −0.2, 0.4]

x · Wk

K_cat

[ 0.4, 0.2, 0.0, 0.6]

x · Wv

V_cat

[ 0.4, 0.0, 0.2, 0.7]

Q · K_the[ 0.1, 0.2, 0.5, −0.1]=−0.05

Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.36

Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=−0.04

Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.14

A_cat = [−0.05, 0.36, −0.04, 0.14 ]

attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).

softmax(A_cat)

raw[−0.05, 0.36, −0.04, 0.14]

exp[ 0.95, 1.43, 0.96, 1.15]Σ = 4.50

÷ Σ[ 0.21, 0.32, 0.21, 0.26]Σ = 1.00

Σ softmax · V

highest score is cat → cat (0.36), so softmax gives it 0.32 weight. ranking is preserved.

c_cat=[ 0.15, 0.15, 0.22, 0.27 ]

context vector — the token's relationship with all others (weighted sum of V's — contextual representation).

sat

pos 2

x · Wq

Q_sat

[ 0.2, 0.1, −0.1, 0.0]

x · Wk

K_sat

[ 0.2, 0.0, 0.1, −0.1]

x · Wv

V_sat

[ 0.2, 0.2, −0.1, 0.0]

Q · K_the[ 0.1, 0.2, 0.5, −0.1]=−0.01

Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.10

Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=0.03

Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.03

A_sat = [−0.01, 0.10, 0.03, 0.03 ]

attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).

softmax(A_sat)

raw[−0.01, 0.10, 0.03, 0.03]

exp[ 0.99, 1.11, 1.03, 1.03]Σ = 4.16

÷ Σ[ 0.24, 0.27, 0.25, 0.25]Σ = 1.00

Σ softmax · V

scores are close → distribution is nearly uniform. "sat" attends roughly equally to everyone.

c_sat=[ 0.14, 0.18, 0.22, 0.24 ]

context vector — the token's relationship with all others (weighted sum of V's — contextual representation).

on

pos 3 · final token

x · Wq

Q_on

[−0.1, 0.1, 0.3, 0.2]

x · Wk

K_on

[ 0.0, 0.4, 0.1, 0.0]

x · Wv

V_on

[ 0.0, 0.1, 0.4, 0.1]

Q · K_the[ 0.1, 0.2, 0.5, −0.1]=0.14

Q · K_cat[ 0.4, 0.2, 0.0, 0.6]=0.10

Q · K_sat[ 0.2, 0.0, 0.1, −0.1]=−0.01

Q · K_on[ 0.0, 0.4, 0.1, 0.0]=0.07

A_on = [ 0.14, 0.10, −0.01, 0.07 ]

attention vector — how much the token looks at each other token's V (raw scores; becomes probabilities after softmax).

softmax(A_on)

raw[ 0.14, 0.10, −0.01, 0.07]

exp[ 1.15, 1.11, 0.99, 1.07]Σ = 4.32

÷ Σ[ 0.27, 0.26, 0.23, 0.25]Σ ≈ 1

Σ softmax · V = context (final token)

on attends most to "the" (0.27) — it had the highest dot product. all context now lives here.

c_on=[ 0.12, 0.19, 0.24, 0.24 ]

context vector — the token's relationship with all others (weighted sum of V's — contextual representation).

Step 04 output

c_the = [ 0.13, 0.17, 0.23, 0.25 ]

c_cat = [ 0.15, 0.15, 0.22, 0.27 ]

c_sat = [ 0.14, 0.18, 0.22, 0.24 ]

c_on = [ 0.12, 0.19, 0.24, 0.24 ]

Since we're predicting the next word, c_on flows to the next layers — first W_o, then residual, then MLP.

05

Output projection + residual

Project, then
reconnect.

The attention output gets a final linear transform via W_o (where multi-head attention learns to mix across heads). Then the residual connection: the original x is added back. Without skip connections, deep networks can't be trained.

1 · W_o · c_on

W_o  (4 × 4) =
[[ 0.8,  0.1, −0.1,  0.2],
 [ 0.1,  0.7,  0.1,  0.1],
 [−0.1,  0.2,  0.8,  0.1],
 [ 0.2,  0.1,  0.1,  0.7]]

c_on[ 0.12, 0.19, 0.24, 0.24 ] c_on · W_o[ 0.14, 0.19, 0.22, 0.24 ] = attn_out_on

2 · residual #1 · h1 = x + attn_out

    x_on          = [ −0.20,   0.30,   0.50,   0.10 ]
  + attn_out_on   = [  0.14,   0.19,   0.22,   0.24 ]
  ────────────────────────────────────────────
  = h1_on         = [ −0.06,   0.49,   0.72,   0.34 ]

+

The residual lets attention "add nothing" without losing information — x passes through. It also keeps gradient flow safe across deep networks.

06

Feed-forward block

Per-token
non-linear transform.

Attention mixes across tokens; MLP applies a non-linear transform to each token independently — the model's "thinking capacity". The sequence: LN → MLP → Residual.

1 · LN #2 — normalize before MLP

h1_on[ −0.06, 0.49, 0.72, 0.34 ] μ · σ 0.3725 · 0.284 LN2(h1_on)[ −1.52, 0.41, 1.22, −0.11 ]

2 · MLP · linear → ReLU → linear

W₁ (4 × 4)

[[ 0.3, −0.1,  0.2,  0.1],
 [ 0.1,  0.4, −0.1,  0.2],
 [−0.2,  0.1,  0.3, −0.1],
 [ 0.1,  0.2,  0.1,  0.3]]

W₂ (4 × 4)

[[ 0.2,  0.1, −0.1,  0.3],
 [ 0.1,  0.4,  0.2, −0.1],
 [−0.1,  0.2,  0.3,  0.1],
 [ 0.3, −0.1,  0.1,  0.2]]

LN2(h1_on)[ −1.52, 0.41, 1.22, −0.11 ] · W₁ (pre-activ)[ −0.67, 0.42, 0.01, −0.23 ] ReLU(...)[ 0.00, 0.42, 0.01, 0.00 ] · W₂ → mlp_out[ 0.04, 0.17, 0.09, −0.04 ]

3 · residual #2 · h2 = h1 + mlp_out · block output

    h1_on         = [ −0.06,   0.49,   0.72,   0.34 ]
  + mlp_out_on    = [  0.04,   0.17,   0.09,  −0.04 ]
  ────────────────────────────────────────────
  = h2_on         = [ −0.02,   0.66,   0.81,   0.30 ]

λ

h2_on is now the output of a full transformer block. In a real LLM this would flow through 12–96 blocks (each doing LN → Attn → Res → LN → MLP → Res). In this demo, a single block feeds directly into the LM head.

07

Language model head

From block output to vocab.

h2_on (4-dim) × W_lm (4 × 6) = 6 numbers — one logit (raw score) per vocab word. Each column is a word's "signature"; whichever h2_on resembles most gets the highest score.

input h2_on = [ −0.02, 0.66, 0.81, 0.30 ] shape [1, 4]

W_lm (4 × 6) — columns: the · cat · sat · on · mat · dog

	the	cat	sat	on	mat	dog
r0	0.1	0.2	0.2	0.2	0.1	0.0
r1	0.1	0.4	0.2	0.2	0.1	0.1
r2	0.1	0.5	0.2	0.3	0.2	0.0
r3	0.1	0.5	0.2	0.2	0.2	0.1

logits = h2_on · W_lm · shape [1, 6]

the

0.18

cat

0.82

sat

0.35

on

0.43

mat

0.29

dog

0.10

∴

Logits aren't probabilities yet — they can be positive or negative, and don't sum to 1. Next step: softmax turns these raw scores into a proper probability distribution in [0, 1].

08

Softmax & argmax

Raw scores to probabilities,
probabilities to a prediction.

softmax exponentiates each logit, then divides by their sum — the result is a distribution that sums to 1. Then argmax: the word with the highest probability = the model's prediction.

softmax(logits) — in 3 steps

raw[ 0.18,  0.82,  0.35,  0.43,  0.29,  0.10]
exp[ 1.20,  2.27,  1.42,  1.54,  1.34,  1.11]Σ = 8.88
÷ Σ[ 0.14,  0.26,  0.16,  0.17,  0.15,  0.12]Σ ≈ 1

probability distribution (vocab)

the0.14

cat0.26

sat0.16

on0.17

mat0.15

dog0.12

argmax → prediction cat P = 0.26

09

Cross-entropy loss

Measure the error.

The target is "mat" (natural continuation: the cat sat on mat). But the model said "cat". Loss measures how little probability the model assigned to the correct answer.

Prediction vs. Target

prediction cat P = 0.26

target mat P(mat) = 0.15

Prediction wrong — the model is untrained.

Cross-entropy loss

L = −log( P(target) )
= −log( P(mat) )
= −log( 0.15 )

1.89 ≈

Why this value? The model is untrained — it gave only 0.15 probability to the correct answer (mat). Ideal case: P(target) = 1 → L = −log(1) = 0. A random guess would give P = 1/6 ≈ 0.17, so L ≈ 1.79. This single forward pass is complete; next comes the backward pass — gradients update W_q, W_k, W_v, W_o, W₁, W₂, W_lm, P(mat) rises, and loss drops.

	the	cat	sat	on	mat	dog
r0	0.1	0.2	0.2	0.2	0.1	0.0
r1	0.1	0.4	0.2	0.2	0.1	0.1
r2	0.1	0.5	0.2	0.3	0.2	0.0
r3	0.1	0.5	0.2	0.2	0.2	0.1

	the	cat	sat	on	mat	dog
r0	0.1	0.2	0.2	0.2	0.1	0.0
r1	0.1	0.4	0.2	0.2	0.1	0.1
r2	0.1	0.5	0.2	0.3	0.2	0.0
r3	0.1	0.5	0.2	0.2	0.2	0.1

Break the text into pieces.

Turn each id into a vector.

Add the order.

Normalize beforeattending.

How much should each token"attend" to the others?

Project, thenreconnect.

Per-tokennon-linear transform.

From block output to vocab.

Raw scores to probabilities,probabilities to a prediction.

Measure the error.

Normalize before
attending.

How much should each token
"attend" to the others?

Project, then
reconnect.

Per-token
non-linear transform.

Raw scores to probabilities,
probabilities to a prediction.

	the	cat	sat	on	mat	dog
r0	0.1	0.2	0.2	0.2	0.1	0.0
r1	0.1	0.4	0.2	0.2	0.1	0.1
r2	0.1	0.5	0.2	0.3	0.2	0.0
r3	0.1	0.5	0.2	0.2	0.2	0.1