Reasoning effort tends to be described as a technical setting. A budget that decides how many “thinking” tokens the model is allowed to spend before producing the final answer. Low, medium, high. More budget → deeper thinking → better results on hard problems. Not wrong at first glance.
But settling for that framing keeps the actual mechanism out of view. Because once it’s called a “budget,” it starts to feel as if there’s a control valve sitting somewhere inside the model. As if a piece of architecture has been wired up to regulate reasoning effort, and turning the knob makes the model think longer or shorter. The matter actually sits somewhere else entirely.
A better starting point is this:
If the model can choose to think longer or shorter, where did that ability come from?
The answer isn’t in the architecture, it’s in the training process. More precisely: reasoning effort is not an architectural feature, it’s a behavior the model acquires through reinforcement learning. Once that’s accepted, the rest of the pieces start clicking into place.
First, an observation. A standard LLM produces its answer token by token. At each step it computes a probability distribution, samples a token from it, continues. That’s the entirety of what’s called autoregressive generation. The same mechanic applies to reasoning models; no extra engine has been bolted on.
The only difference is that some models wrap part of this generation inside <think>...</think> tags. Tokens between the tags stay hidden, tokens outside become visible. But the underlying mechanic is one and the same: token prediction, token prediction, token prediction.
So “thinking longer” isn’t really a separate mechanism. There’s only the late arrival of the </think> token. If the model produces it early, it thought briefly. If it produces it late, it thought at length. The whole story is locked inside that single decision.
A new question follows naturally:
How does the model decide when to produce </think>?The decision comes from training. More specifically, from reinforcement learning.
Pretraining ends. The model has learned a distribution over language, the world, roughly everything. But its approach to a question is inconsistent. Sometimes it cuts to the answer, sometimes it walks through intermediate steps, sometimes it catches a mistake, sometimes it doesn’t.
At this point, what you want to tell the model is: behaviors that lead to correct answers should be reinforced, behaviors that don’t should fade. Easy to say. The matter is how this is implemented technically.
The first thing that comes to mind is to write correct reasoning examples by hand and have the model imitate them. Classical supervised fine-tuning. It works. But the ceiling is low because the model only imitates human-authored reasoning patterns, never discovering its own. And every example takes human labor; it doesn’t scale.
RL breaks past this. Because the logic of RL isn’t imitation:
Give the model a question. Let it produce its own answer. Check automatically whether the result is right or wrong. Nudge its weights toward the reward.
The crucial thing here is that the model learns from its own rollouts. There’s no pre-written reasoning trace. There’s only the question and the final answer. The path of thinking in between is something the model picks for itself.
The mechanic, roughly, goes like this. A batch of questions is pulled. For each question, the model produces 16 different answers. Same question, but thanks to sampling randomness, 16 different paths. Some land on the correct answer, others don’t. A verifier — an answer-checker for math, a sandbox for code — assigns each answer a score. One if correct, zero if not.
The scores are then normalized within the group. Correct ones get a positive advantage, incorrect ones get a negative advantage. The advantage determines the direction of the gradient. The probability of every token choice inside a positive-advantage answer goes up a notch; the probability of every token choice inside a negative-advantage answer goes down a notch. The loss is built on this principle, the gradient flows back into the transformer’s weights. One step has been taken. Then another batch, other rollouts, another step. Thousands of times.
The mechanic might look unfamiliar somewhere in there. But mathematically there’s nothing new to it. What RL is doing here is classical Monte Carlo gradient estimation.
What the model wants to maximize is expected reward. Expected reward is an integral; it has to be taken over all possible answer sequences. Not something that can be computed directly — the number of possible sequences is astronomical. So instead, N samples are drawn, and their average is used as the approximation.
The 16 rollouts are exactly those samples. Expensive to produce, but without them there’s no gradient signal. Within-group normalization is also a classical variance reduction technique — it doesn’t change the expected value, it just reduces the noise of the estimate. The last thirty years of RL research can largely be read as “ways to estimate this gradient with less variance.”
So the mystery of RL is smaller than it seems. A gradient estimate, a variance-reduction trick, a layer of engineering built on top.
A small SFT phase called cold-start is also done right before this loop. A few thousand reasoning examples settle the <think>...</think> format into the model. The real work doesn’t happen at this stage; it just brings the model to a point where RL can begin.
RL without cold-start is also possible — DeepSeek-R1-Zero is the proof. It works, but the output quality is poor; languages get mixed, the format wobbles. Cold-start is an initial push given to RL’s engine before it starts running. Not to teach knowledge, but to settle the format.
The important distinction is here: the contents of the cold-start data are taken as “correct” because that’s how they’re presented. That’s the basic assumption of supervised learning. But the assumption isn’t teaching much; it’s just establishing a starting point so the model can produce more useful rollouts during RL. The actual reasoning capability develops inside the RL loop, with the model earning rewards from its own rollouts.
There’s something worth looking at more closely here. Mathematically, 16 rollouts are a sample for gradient estimation, yes. But mechanically they do another job as well.
Producing 16 answers to the same question means trying 16 different approaches. Not just word variations. On “137 × 24 = ?” one tries distribution, another approximation, another column multiplication, another a factoring attempt. Each is a distinct reasoning template.
Where does this diversity come from? Pretraining. The model ate the internet, and every kind of reasoning pattern was already passing through there. Discriminant, factoring, “start with an approximation and correct it,” “let me check that again” — all of these are sitting in the pretraining pool. The model knows the patterns; it just hasn’t learned, systematically, when each one works.
Sampling randomness acts as the exploration engine. In each rollout the model picks a slightly different pattern from this pool. Some lead to the correct answer, some don’t. The reward signal catches the distinction and reflects it in the gradient. Which approach works on what kind of question — the model learns this over thousands of iterations.
An interesting consequence follows from this:
Through RL, the model can learn to use reasoning paths that were never shown to it in training data.
In classical supervised learning, the model is bound by the ceiling of the examples it sees. If the dataset contains examples using the discriminant, it learns that; if not, it doesn’t. RL is different because the reward doesn’t say “use this method,” it says “here’s where you need to land.” Method selection is left to the model. Which method works, the distribution of rewards tells it.
The structural consequence is significant. SFT’s ceiling is bounded by what human trainers wrote. RL’s ceiling is bounded by the breadth of the pretraining pool. The pretraining pool is far wider than what humans can write one by one.
Even so, exploration isn’t unbounded. RL can’t discover a concept never seen in pretraining. RL’s job isn’t to reinvent existing patterns from scratch but to teach which one functionally works when. A more accurate phrasing might be “awakening” — teaching dormant capacities to engage in the appropriate situations.
This awakening is still powerful. The pretraining pool is so wide that simply learning when each capacity should kick in already produces an enormous gain in capability.
What emerges once all this is set up is interesting.
As training progresses, the average answer length grows on its own. No one tells the model to “think longer.” A statistical fact simply takes hold:
Long answers with verification steps and self-questioning land on correct answers more often. Those behaviors get reinforced because they get rewarded.
At some point, models spontaneously start writing things like “wait, let me reconsider.” These patterns were already present in pretraining; the internet is full of such text. But RL teaches the model to use them functionally — not just imitating them, but actually using them to catch and correct mistakes.
That’s why this is called emergent behavior. It wasn’t programmed. It surfaced on its own under the pressure of the reward.
Once training ends, the picture simplifies. Model weights are frozen. While the model serves user requests, none of that elaborate machinery is running — no rollout pool, no reward function, no advantage computation. There’s only the transformer, doing forward passes, producing tokens. Plain autoregressive generation.
Confusion is natural at this point. A reasoning model must be doing something special at inference time, one might think. It isn’t. The behavior learned during training sits etched into the weights, and it shows itself at inference.
When a request comes in, the effort parameter is translated into a conditioning signal — into the system prompt or a special control token. During training, the model had learned to think long under the “high effort” signal; the same signal now plays a triggering role. There’s no real “switching” going on. In the presence of that signal, due to training, the token distributions the model produces become distributions that yield a late </think>. That’s it.
Generation begins. Token by token. Forward pass at each step, sampling from logits, moving on to the next. The behaviors reinforced during training surface here — intermediate calculations, verification steps, hypothesis trials, “let me check that again.” Not programmed, flowing from the weights.
There’s a token budget, and a counter is kept. Two management strategies are possible. In the soft approach the model is trusted; training has taught it to stop at certain lengths, and it’s expected to produce its own </think>. In the hard approach, when the budget is exhausted, an artificial bias is injected into the logit of the </think> token, or that token is written in deterministically. The thinking is forced to end and the model shifts into summarization mode.
After </think> closes, the final answer streams out to the user. The hidden thinking tokens are still on the bill — even if the user only sees the final answer, they’re paying for the thousands of tokens behind it. That’s why cost rises sharply as reasoning effort goes up.
There’s an interesting symmetry as well. The engine producing 16 rollouts during training is the same system, with the same optimizations, that handles production inference. When training says “produce parallel answers, reward the best ones,” systems doing best-of-N or self-consistency at inference time are doing the same thing — producing parallel rollouts, applying a selection mechanism on top. Both are variations on the theme of “draw more samples from the same model, filter them with gradients or with selection.” The line between them is thinner than it looks.
Coming back to the original question, reasoning effort feels less mysterious now.
It’s not a control inside the architecture. It’s the intensity dial of a behavior learned through RL. During training, the model was taught to reach correct answers under different effort signals. At inference, switching that signal selects a behavioral mode.
There is a token budget, yes. But a budget on its own produces nothing; it’s only a limit. The model’s ability to think longer or shorter comes from having learned, during training, to make that choice.
Reasoning is not a capability, it’s a behavior. And behavior comes not from architecture, but from the training signal. Change the shape of the reward, and the way the model thinks changes with it.
That’s why the matter doesn’t really live in the architecture — it lives in the training pipeline.