Next-Token Prediction

The core mechanism of language models: given a sequence of tokens, compute a probability distribution over the entire vocabulary and select the next token — then repeat.


What is it?

Every response a language model generates — whether it is a poem, a code function, or a factual answer — is produced by one operation repeated hundreds or thousands of times: predict the next token.1

The model receives a sequence of tokens (your prompt, plus any tokens it has already generated). It processes this sequence through its neural network layers and produces a probability distribution — a list of every token in its vocabulary, each with a probability between 0 and 1, all summing to 1.0. The model then selects one token from this distribution, appends it to the sequence, and runs the entire process again to predict the token after that.2

This is not a metaphor. It is literally all the model does. There is no separate “understanding” module, no “reasoning” engine, no “planning” system. Everything the model produces emerges from repeated next-token prediction. The sophistication comes from the scale of the patterns it learned during training and the depth of the neural network that computes the probabilities.1

In plain terms

Next-token prediction is the world’s most powerful autocomplete. Your phone suggests three words; a language model ranks its entire vocabulary of 100,000+ tokens by likelihood and picks one. It does this once per token in the response, building the output one piece at a time.


At a glance


How does it work?

1. From tokens to logits

The model receives a sequence of token IDs and processes them through its layers (embeddings, attention, feed-forward networks). The final layer produces a logit for each token in the vocabulary — a raw numerical score indicating how likely that token is to come next.2

Logits are not probabilities. They are unbounded numbers: a high logit means the model considers that token very likely; a negative logit means unlikely. The logit for “mat” after “The cat sat on the” might be 5.2, while the logit for “quantum” might be -3.8.3

2. Softmax — turning scores into probabilities

The softmax function converts logits into a proper probability distribution:3

  1. Exponentiate each logit (making all values positive)
  2. Divide each value by the sum of all values
  3. The result: every token gets a probability between 0 and 1, and all probabilities sum to 1.0
TokenLogitAfter softmax
mat5.20.18
floor4.10.12
bed3.50.09
table3.10.07
quantum-3.80.0001

The softmax amplifies differences: tokens with high logits get disproportionately more probability than tokens with low logits.

3. Sampling — choosing a token

With probabilities in hand, the model must pick a single token. Several strategies exist:4

Greedy decoding: Always pick the highest-probability token. Deterministic but often repetitive and dull.

Random sampling: Pick a token randomly according to the probabilities. “Mat” (0.18) gets picked 18% of the time, “floor” (0.12) gets picked 12% of the time. More varied but can produce incoherent text.

Temperature sampling: Divide all logits by a temperature value before applying softmax. This reshapes the distribution:5

  • Temperature < 1.0: Probabilities concentrate on the top tokens. Output becomes more focused and predictable.
  • Temperature = 1.0: The original distribution. Balanced.
  • Temperature > 1.0: Probabilities flatten out. Low-probability tokens get a real chance. Output becomes more creative but potentially erratic.
graph LR
    A[Low temp 0.2] -->|focused| B[Nearly deterministic]
    C[Medium temp 1.0] -->|balanced| D[Natural variety]
    E[High temp 1.8] -->|flat| F[Wild and creative]

    style B fill:#4a9ede,color:#fff
    style D fill:#5cb85c,color:#fff
    style F fill:#e8b84b,color:#fff

Think of it like...

A weighted lottery. Every token in the vocabulary has a ticket, but high-probability tokens have more tickets. Temperature adjusts how many extra tickets the favourites get. Low temperature gives the favourite almost all the tickets. High temperature redistributes tickets more evenly, giving underdogs a chance.

4. Top-k and top-p — filtering before sampling

Two additional controls restrict which tokens are eligible:4

Top-k: Only consider the k most probable tokens. If k=40, the model ignores everything outside the top 40, no matter how reasonable they might be.

Top-p (nucleus sampling): Only consider tokens whose cumulative probability reaches p. If p=0.9, sort tokens by probability and include tokens until their probabilities sum to 0.9. This adapts: when the model is confident, few tokens are included; when it is uncertain, more are included.

These filters prevent the model from selecting wildly improbable tokens (like “quantum” after “The cat sat on the”) while still allowing variety among reasonable options.

5. The autoregressive loop

Generation is fundamentally sequential. The model predicts one token, appends it, then predicts the next. This means:1

  • Each token is influenced by every token before it (including tokens the model itself generated)
  • Errors compound: if the model generates a wrong token early, subsequent predictions are conditioned on that error
  • Generation speed is limited by this sequential dependency — the model cannot predict token 50 until it has predicted tokens 1 through 49

Why do we use it?

Key reasons

1. Universality. Next-token prediction is a single objective that produces models capable of translation, summarisation, code generation, question answering, and creative writing. No task-specific architecture is needed — the same prediction mechanism handles all of them.1

2. Scalability. The training objective is simple (predict the next token) and the data is abundant (all text ever written). This combination allows training at a scale impossible for more complex objectives.2

3. Emergent capabilities. When next-token prediction is trained at sufficient scale, models develop abilities that were never explicitly taught — reasoning, arithmetic, code generation, multilingual translation. These emerge from the statistical patterns in language.1


When do we use it?

  • When generating any text from an LLM — every word in the response was selected through this process
  • When adjusting temperature or sampling parameters — you are directly controlling how the probability distribution is shaped
  • When understanding hallucination — the model produces confident-sounding text because it selects high-probability tokens, even when those tokens describe things that are not true
  • When evaluating model output — knowing that generation is probabilistic helps you understand why the same prompt produces different outputs each time

Rule of thumb

The model does not “decide” to write something. It predicts the most likely next token, given everything before it. If the output is wrong, it means the probability distribution pointed the wrong way — not that the model “knew better.”


How can I think about it?

The sentence-completion game

Imagine a game where one person starts a sentence and another person must continue it, one word at a time, choosing the most natural continuation.

  • “The cat sat on the…” — most people say “mat” or “floor”
  • The model plays this game 100,000 times per response, predicting one token at a time
  • Each prediction uses every word so far — not just the last few
  • Temperature controls how adventurous the player is: low temperature means they always pick the obvious word; high temperature means they sometimes pick surprising ones
  • The game never ends until a stop condition is met (a special end-of-sequence token, a maximum length, or a stop word)

The probability waterfall

Picture a waterfall of coloured water. At the top, the model pours probability (a total of 1.0) into a wide basin. The basin has thousands of channels, one for each token in the vocabulary.

  • Most water flows through a few channels — the tokens the model considers most likely
  • A trickle goes through many channels — tokens that are technically possible but unlikely
  • Temperature widens or narrows the basin — high temperature spreads water more evenly; low temperature funnels it into fewer channels
  • Top-k blocks all but k channels — only the widest channels remain open
  • The model picks one channel (one token) proportional to how much water flows through it

Concepts to explore next

ConceptWhat it coversStatus
transformer-architectureThe neural network architecture that computes the probability distributioncomplete
pre-trainingHow the model learns to predict the next token from billions of text pagescomplete
hallucinationWhat happens when high-probability tokens describe things that are not truecomplete

Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AI[AI and Machine Learning] --> NTP[Next-Token Prediction]
    AI --> TOK[Tokenization]
    AI --> EMB[Embeddings]
    AI --> TRA[Transformer Architecture]
    AI --> HAL[Hallucination]
    TOK -.->|feeds| NTP
    EMB -.->|feeds| NTP
    TRA -.->|computes| NTP
    style NTP fill:#4a9ede,color:#fff

Related concepts:

  • hallucination — hallucination occurs when next-token prediction produces plausible-sounding but factually incorrect sequences
  • transformer-architecture — the transformer is the neural network that computes the probability distribution over the vocabulary
  • pre-training — the model learns to predict the next token during pre-training on massive text corpora

Sources


Further reading

Resources

Footnotes

  1. Wolfe, C. (2025). Next token prediction thread. X/Twitter. 2 3 4 5

  2. WWW Insights. (2025). How Large Language Models Really Work: Next-Token Prediction at the Core. WWW Insights. 2 3

  3. Cohen, M. (2025). LLM breakdown 2/6: Logits and next-token prediction. Mike X Cohen’s Substack. 2

  4. Muse, K. (2025). How Temperature, Top-K, Top-P, and Min-P Control LLM Output. Ken Muse Blog. 2

  5. IBM. (2025). What is LLM Temperature?. IBM.