Next-Token Prediction
The core mechanism of language models: given a sequence of tokens, compute a probability distribution over the entire vocabulary and select the next token — then repeat.
What is it?
Every response a language model generates — whether it is a poem, a code function, or a factual answer — is produced by one operation repeated hundreds or thousands of times: predict the next token.1
The model receives a sequence of tokens (your prompt, plus any tokens it has already generated). It processes this sequence through its neural network layers and produces a probability distribution — a list of every token in its vocabulary, each with a probability between 0 and 1, all summing to 1.0. The model then selects one token from this distribution, appends it to the sequence, and runs the entire process again to predict the token after that.2
This is not a metaphor. It is literally all the model does. There is no separate “understanding” module, no “reasoning” engine, no “planning” system. Everything the model produces emerges from repeated next-token prediction. The sophistication comes from the scale of the patterns it learned during training and the depth of the neural network that computes the probabilities.1
In plain terms
Next-token prediction is the world’s most powerful autocomplete. Your phone suggests three words; a language model ranks its entire vocabulary of 100,000+ tokens by likelihood and picks one. It does this once per token in the response, building the output one piece at a time.
At a glance
The autoregressive generation loop (click to expand)
graph LR A[Input tokens] --> B[Model] B --> C[Probability distribution] C --> D[Select token] D -->|append to input| A D --> E[Output token] style B fill:#4a9ede,color:#fffKey: The model takes the current token sequence, produces probabilities for the next token, selects one, and feeds the extended sequence back in. This loop is called autoregressive generation — each prediction depends on all previous predictions.
How does it work?
1. From tokens to logits
The model receives a sequence of token IDs and processes them through its layers (embeddings, attention, feed-forward networks). The final layer produces a logit for each token in the vocabulary — a raw numerical score indicating how likely that token is to come next.2
Logits are not probabilities. They are unbounded numbers: a high logit means the model considers that token very likely; a negative logit means unlikely. The logit for “mat” after “The cat sat on the” might be 5.2, while the logit for “quantum” might be -3.8.3
2. Softmax — turning scores into probabilities
The softmax function converts logits into a proper probability distribution:3
- Exponentiate each logit (making all values positive)
- Divide each value by the sum of all values
- The result: every token gets a probability between 0 and 1, and all probabilities sum to 1.0
| Token | Logit | After softmax |
|---|---|---|
| mat | 5.2 | 0.18 |
| floor | 4.1 | 0.12 |
| bed | 3.5 | 0.09 |
| table | 3.1 | 0.07 |
| … | … | … |
| quantum | -3.8 | 0.0001 |
The softmax amplifies differences: tokens with high logits get disproportionately more probability than tokens with low logits.
3. Sampling — choosing a token
With probabilities in hand, the model must pick a single token. Several strategies exist:4
Greedy decoding: Always pick the highest-probability token. Deterministic but often repetitive and dull.
Random sampling: Pick a token randomly according to the probabilities. “Mat” (0.18) gets picked 18% of the time, “floor” (0.12) gets picked 12% of the time. More varied but can produce incoherent text.
Temperature sampling: Divide all logits by a temperature value before applying softmax. This reshapes the distribution:5
- Temperature < 1.0: Probabilities concentrate on the top tokens. Output becomes more focused and predictable.
- Temperature = 1.0: The original distribution. Balanced.
- Temperature > 1.0: Probabilities flatten out. Low-probability tokens get a real chance. Output becomes more creative but potentially erratic.
graph LR A[Low temp 0.2] -->|focused| B[Nearly deterministic] C[Medium temp 1.0] -->|balanced| D[Natural variety] E[High temp 1.8] -->|flat| F[Wild and creative] style B fill:#4a9ede,color:#fff style D fill:#5cb85c,color:#fff style F fill:#e8b84b,color:#fff
Think of it like...
A weighted lottery. Every token in the vocabulary has a ticket, but high-probability tokens have more tickets. Temperature adjusts how many extra tickets the favourites get. Low temperature gives the favourite almost all the tickets. High temperature redistributes tickets more evenly, giving underdogs a chance.
4. Top-k and top-p — filtering before sampling
Two additional controls restrict which tokens are eligible:4
Top-k: Only consider the k most probable tokens. If k=40, the model ignores everything outside the top 40, no matter how reasonable they might be.
Top-p (nucleus sampling): Only consider tokens whose cumulative probability reaches p. If p=0.9, sort tokens by probability and include tokens until their probabilities sum to 0.9. This adapts: when the model is confident, few tokens are included; when it is uncertain, more are included.
These filters prevent the model from selecting wildly improbable tokens (like “quantum” after “The cat sat on the”) while still allowing variety among reasonable options.
5. The autoregressive loop
Generation is fundamentally sequential. The model predicts one token, appends it, then predicts the next. This means:1
- Each token is influenced by every token before it (including tokens the model itself generated)
- Errors compound: if the model generates a wrong token early, subsequent predictions are conditioned on that error
- Generation speed is limited by this sequential dependency — the model cannot predict token 50 until it has predicted tokens 1 through 49
Why do we use it?
Key reasons
1. Universality. Next-token prediction is a single objective that produces models capable of translation, summarisation, code generation, question answering, and creative writing. No task-specific architecture is needed — the same prediction mechanism handles all of them.1
2. Scalability. The training objective is simple (predict the next token) and the data is abundant (all text ever written). This combination allows training at a scale impossible for more complex objectives.2
3. Emergent capabilities. When next-token prediction is trained at sufficient scale, models develop abilities that were never explicitly taught — reasoning, arithmetic, code generation, multilingual translation. These emerge from the statistical patterns in language.1
When do we use it?
- When generating any text from an LLM — every word in the response was selected through this process
- When adjusting temperature or sampling parameters — you are directly controlling how the probability distribution is shaped
- When understanding hallucination — the model produces confident-sounding text because it selects high-probability tokens, even when those tokens describe things that are not true
- When evaluating model output — knowing that generation is probabilistic helps you understand why the same prompt produces different outputs each time
Rule of thumb
The model does not “decide” to write something. It predicts the most likely next token, given everything before it. If the output is wrong, it means the probability distribution pointed the wrong way — not that the model “knew better.”
How can I think about it?
The sentence-completion game
Imagine a game where one person starts a sentence and another person must continue it, one word at a time, choosing the most natural continuation.
- “The cat sat on the…” — most people say “mat” or “floor”
- The model plays this game 100,000 times per response, predicting one token at a time
- Each prediction uses every word so far — not just the last few
- Temperature controls how adventurous the player is: low temperature means they always pick the obvious word; high temperature means they sometimes pick surprising ones
- The game never ends until a stop condition is met (a special end-of-sequence token, a maximum length, or a stop word)
The probability waterfall
Picture a waterfall of coloured water. At the top, the model pours probability (a total of 1.0) into a wide basin. The basin has thousands of channels, one for each token in the vocabulary.
- Most water flows through a few channels — the tokens the model considers most likely
- A trickle goes through many channels — tokens that are technically possible but unlikely
- Temperature widens or narrows the basin — high temperature spreads water more evenly; low temperature funnels it into fewer channels
- Top-k blocks all but k channels — only the widest channels remain open
- The model picks one channel (one token) proportional to how much water flows through it
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| transformer-architecture | The neural network architecture that computes the probability distribution | complete |
| pre-training | How the model learns to predict the next token from billions of text pages | complete |
| hallucination | What happens when high-probability tokens describe things that are not true | complete |
Check your understanding
Test yourself (click to expand)
- Explain what “autoregressive generation” means and why the model must generate tokens sequentially.
- Describe the three steps that turn a token sequence into a probability distribution: logits, softmax, and sampling.
- Distinguish between greedy decoding, temperature sampling, and top-p sampling. In what situations would you prefer each?
- Interpret this scenario: you set temperature to 2.0 and the model produces nonsensical output. Explain what happened to the probability distribution and why the output quality degraded.
- Connect next-token prediction to hallucination. Why does a model sometimes produce confident-sounding false statements, given that it is selecting high-probability tokens?
Where this concept fits
Position in the knowledge graph
graph TD AI[AI and Machine Learning] --> NTP[Next-Token Prediction] AI --> TOK[Tokenization] AI --> EMB[Embeddings] AI --> TRA[Transformer Architecture] AI --> HAL[Hallucination] TOK -.->|feeds| NTP EMB -.->|feeds| NTP TRA -.->|computes| NTP style NTP fill:#4a9ede,color:#fffRelated concepts:
- hallucination — hallucination occurs when next-token prediction produces plausible-sounding but factually incorrect sequences
- transformer-architecture — the transformer is the neural network that computes the probability distribution over the vocabulary
- pre-training — the model learns to predict the next token during pre-training on massive text corpora
Sources
Further reading
Resources
- LLM Temperature, Top-P, and Top-K Explained (Machine Learning Plus) — Practical guide with Python simulations showing how each parameter reshapes the distribution
- LLM breakdown 2/6: Logits and next-token prediction (Mike X Cohen) — Clear walkthrough of the math from logits to probabilities
- How Temperature, Top-K, Top-P, and Min-P Control LLM Output (Ken Muse) — Technical reference covering all major sampling parameters
- LLM Sampling Parameters Guide (smcleod.net) — Comprehensive guide to all parameters that control generation
Footnotes
-
Wolfe, C. (2025). Next token prediction thread. X/Twitter. ↩ ↩2 ↩3 ↩4 ↩5
-
WWW Insights. (2025). How Large Language Models Really Work: Next-Token Prediction at the Core. WWW Insights. ↩ ↩2 ↩3
-
Cohen, M. (2025). LLM breakdown 2/6: Logits and next-token prediction. Mike X Cohen’s Substack. ↩ ↩2
-
Muse, K. (2025). How Temperature, Top-K, Top-P, and Min-P Control LLM Output. Ken Muse Blog. ↩ ↩2
-
IBM. (2025). What is LLM Temperature?. IBM. ↩
