All You Need Is V?

Q, K, V explanations tend to fall back on the same shortcut. Q says what it’s looking for, K says what it has, V carries the information to be passed along. Not bad at first glance.

But on closer look, this framing actually obscures the real mechanism. Because when it’s put that way, it feels like there are three separate things whose meanings were defined from the start. Whereas the matter settles more like this: the model is splitting a single token representation into three different views so it can use it in three different roles. Once that becomes the starting point, Q, K, V start to look a little less mystical.

The first question to surface, naturally, is this:

If the token vector already exists, why produce three more representations?

A little thought makes something clear: there isn’t really one job here. On one side there’s the problem of establishing relationships between tokens. On the other side there’s the problem of deciding what gets carried across those relationships. Instead of trying to do everything with a single vector, the model separates these roles.

Once that split becomes visible, the question “why three?” starts looking less strange. A single representation isn’t being asked to do relationship-making, relationship-answering, and content-carrying all at once.

The first part that clicks into place is Q and K. Because the problem being solved there is clearer.

There’s a token. A calculation is being made between its Q representation and the K representations of other tokens. And the calculation used is fixed from the outset: dot product.

What dot product does here is compute the relationship between one token and the others. It does this by taking both direction and magnitude into account at once. It combines how much two vectors are pointing the same way and how strong they are into a single number.

And the goal is this: using where the vectors sit in space, figure out how much information the current token should pull from the other tokens in its context. Over training, this is how the model learns to place the vectors in the right spots.

No information is being carried yet. Only the relationship is being set up. Which token will orient toward which, and to what degree.

Thinking this way, the Q and K side starts to look like the skeleton of the mechanism.

Dot products are taken, scores come out, they get normalized, and in the end a distribution forms. Roughly a table saying “a little contribution from here, a lot from there, medium from the other.” At this point there’s structure, there are proportions, there’s direction.

But there’s no content yet.

And the critical break happens right here. Because at some point, the Q and K side starts to feel complete. The relationship has been found. Then the question naturally arises:

Okay, how much contribution is coming has been figured out. But what exactly is coming?

The V side starts to open up right there. The simplest version of the question is this:

If the token’s own vector already exists, why is V needed at all? If the attention weights are found, why not just continue with the original token representation?

The answer only settles like this.

The input token vector tells what the token is. Its identity. But what passes from one token to another during attention isn’t only identity. The model wants to decide separately what the token will offer to the mix.

Thinking of V as the content a token offers to the mix puts a lot of things in place. The token itself and what gets transferred from that token to others don’t have to be the same thing. The model wants the freedom to separate these two, which is why it keeps V as a distinct representation.

Contextuality itself doesn’t live in V either — it emerges in the actual attention output. Each token produces its own V, but what comes out is the weighted sum of these Vs by the attention weights. Context is born right there.

The individual Vs are still vectors on their own. But once the weighted sum is taken, what comes out is a representation shaped by the current context. That’s why continuing directly with the input vector starts to feel wrong. The model doesn’t want to carry just what the token is; it wants to carry what that token should contribute to this particular context.

Once that becomes visible, the structure gets clearer.

Q and K give the contribution proportions. They set up the relationship. They work out who will be influenced by whom and by how much. V gives the content of that contribution.

So on the Q and K side there’s only the answer to “how much.” On the V side, for the first time, the answer to “what” enters the picture. Without Q and K there’s no direction; without V there’s no content.

The moment that feels like something clicking is right here. At first attention looks more like the question “which token is being looked at?” But that’s incomplete. Because finding the direction to look in is one thing, and deciding which representation to pull from that direction is another.

The real power of the mechanism is precisely in keeping these two separate. First the relational order is set up, then content is given to that relationship.

Looking at it this way, Q, K, V no longer feel like three technical letters with meanings assigned from the start. They look more like three views of the same token representation, carved out for three different jobs.

Q and K set up the skeleton. V actually puts something inside that skeleton.

The joke in the title works exactly here. Of course it’s not all you need is V. Without Q and K there’s no attention to begin with. But once the Q and K side is in place, the feeling that the knot really gets untied at V is still very strong.

Because that’s where the structure finally takes on real content.

All You Need Is V?

Author

Read next

Reasoning Effort

Is Evolution Going Somewhere? Notes on Humans, Consciousness, and Information

From Newton to gradient descent: how artificial intelligence learned to learn

A token's journey

Subscribe to Newsletter