Next-Token Prediction

The core mechanism of language models: given a sequence of tokens, compute a probability distribution over the entire vocabulary and select the next token — then repeat.

What is it?

Every response a language model generates — whether it is a poem, a code function, or a factual answer — is produced by one operation repeated hundreds or thousands of times: predict the next token.¹

The model receives a sequence of tokens (your prompt, plus any tokens it has already generated). It processes this sequence through its neural network layers and produces a probability distribution — a list of every token in its vocabulary, each with a probability between 0 and 1, all summing to 1.0. The model then selects one token from this distribution, appends it to the sequence, and runs the entire process again to predict the token after that.²

This is not a metaphor. It is literally all the model does. There is no separate “understanding” module, no “reasoning” engine, no “planning” system. Everything the model produces emerges from repeated next-token prediction. The sophistication comes from the scale of the patterns it learned during training and the depth of the neural network that computes the probabilities.¹

In plain terms

Next-token prediction is the world’s most powerful autocomplete. Your phone suggests three words; a language model ranks its entire vocabulary of 100,000+ tokens by likelihood and picks one. It does this once per token in the response, building the output one piece at a time.

At a glance

The autoregressive generation loop (click to expand)
graph LR
    A[Input tokens] --> B[Model]
    B --> C[Probability distribution]
    C --> D[Select token]
    D -->|append to input| A
    D --> E[Output token]

    style B fill:#4a9ede,color:#fff
Key: The model takes the current token sequence, produces probabilities for the next token, selects one, and feeds the extended sequence back in. This loop is called autoregressive generation — each prediction depends on all previous predictions.

How does it work?

1. From tokens to logits

The model receives a sequence of token IDs and processes them through its layers (embeddings, attention, feed-forward networks). The final layer produces a logit for each token in the vocabulary — a raw numerical score indicating how likely that token is to come next.²

Logits are not probabilities. They are unbounded numbers: a high logit means the model considers that token very likely; a negative logit means unlikely. The logit for “mat” after “The cat sat on the” might be 5.2, while the logit for “quantum” might be -3.8.³

2. Softmax — turning scores into probabilities

The softmax function converts logits into a proper probability distribution:³

Exponentiate each logit (making all values positive)
Divide each value by the sum of all values
The result: every token gets a probability between 0 and 1, and all probabilities sum to 1.0

Token	Logit	After softmax
mat	5.2	0.18
floor	4.1	0.12
bed	3.5	0.09
table	3.1	0.07
…	…	…
quantum	-3.8	0.0001

The softmax amplifies differences: tokens with high logits get disproportionately more probability than tokens with low logits.

3. Sampling — choosing a token

With probabilities in hand, the model must pick a single token. Several strategies exist:⁴

Greedy decoding: Always pick the highest-probability token. Deterministic but often repetitive and dull.

Random sampling: Pick a token randomly according to the probabilities. “Mat” (0.18) gets picked 18% of the time, “floor” (0.12) gets picked 12% of the time. More varied but can produce incoherent text.

Temperature sampling: Divide all logits by a temperature value before applying softmax. This reshapes the distribution:⁵

Temperature < 1.0: Probabilities concentrate on the top tokens. Output becomes more focused and predictable.
Temperature = 1.0: The original distribution. Balanced.
Temperature > 1.0: Probabilities flatten out. Low-probability tokens get a real chance. Output becomes more creative but potentially erratic.

graph LR
    A[Low temp 0.2] -->|focused| B[Nearly deterministic]
    C[Medium temp 1.0] -->|balanced| D[Natural variety]
    E[High temp 1.8] -->|flat| F[Wild and creative]

    style B fill:#4a9ede,color:#fff
    style D fill:#5cb85c,color:#fff
    style F fill:#e8b84b,color:#fff

Think of it like...

A weighted lottery. Every token in the vocabulary has a ticket, but high-probability tokens have more tickets. Temperature adjusts how many extra tickets the favourites get. Low temperature gives the favourite almost all the tickets. High temperature redistributes tickets more evenly, giving underdogs a chance.

4. Top-k and top-p — filtering before sampling

Two additional controls restrict which tokens are eligible:⁴

Top-k: Only consider the k most probable tokens. If k=40, the model ignores everything outside the top 40, no matter how reasonable they might be.

Top-p (nucleus sampling): Only consider tokens whose cumulative probability reaches p. If p=0.9, sort tokens by probability and include tokens until their probabilities sum to 0.9. This adapts: when the model is confident, few tokens are included; when it is uncertain, more are included.

These filters prevent the model from selecting wildly improbable tokens (like “quantum” after “The cat sat on the”) while still allowing variety among reasonable options.

5. The autoregressive loop

Generation is fundamentally sequential. The model predicts one token, appends it, then predicts the next. This means:¹

Each token is influenced by every token before it (including tokens the model itself generated)
Errors compound: if the model generates a wrong token early, subsequent predictions are conditioned on that error
Generation speed is limited by this sequential dependency — the model cannot predict token 50 until it has predicted tokens 1 through 49

Why do we use it?

Key reasons

1. Universality. Next-token prediction is a single objective that produces models capable of translation, summarisation, code generation, question answering, and creative writing. No task-specific architecture is needed — the same prediction mechanism handles all of them.¹

2. Scalability. The training objective is simple (predict the next token) and the data is abundant (all text ever written). This combination allows training at a scale impossible for more complex objectives.²

3. Emergent capabilities. When next-token prediction is trained at sufficient scale, models develop abilities that were never explicitly taught — reasoning, arithmetic, code generation, multilingual translation. These emerge from the statistical patterns in language.¹

When do we use it?

When generating any text from an LLM — every word in the response was selected through this process
When adjusting temperature or sampling parameters — you are directly controlling how the probability distribution is shaped
When understanding hallucination — the model produces confident-sounding text because it selects high-probability tokens, even when those tokens describe things that are not true
When evaluating model output — knowing that generation is probabilistic helps you understand why the same prompt produces different outputs each time

Rule of thumb

The model does not “decide” to write something. It predicts the most likely next token, given everything before it. If the output is wrong, it means the probability distribution pointed the wrong way — not that the model “knew better.”

How can I think about it?

The sentence-completion game

Imagine a game where one person starts a sentence and another person must continue it, one word at a time, choosing the most natural continuation.

“The cat sat on the…” — most people say “mat” or “floor”

The model plays this game 100,000 times per response, predicting one token at a time

Each prediction uses every word so far — not just the last few

Temperature controls how adventurous the player is: low temperature means they always pick the obvious word; high temperature means they sometimes pick surprising ones

The game never ends until a stop condition is met (a special end-of-sequence token, a maximum length, or a stop word)

The probability waterfall

Picture a waterfall of coloured water. At the top, the model pours probability (a total of 1.0) into a wide basin. The basin has thousands of channels, one for each token in the vocabulary.

Most water flows through a few channels — the tokens the model considers most likely

A trickle goes through many channels — tokens that are technically possible but unlikely

Temperature widens or narrows the basin — high temperature spreads water more evenly; low temperature funnels it into fewer channels

Top-k blocks all but k channels — only the widest channels remain open

The model picks one channel (one token) proportional to how much water flows through it

Concepts to explore next

Concept	What it covers	Status
transformer-architecture	The neural network architecture that computes the probability distribution	complete
pre-training	How the model learns to predict the next token from billions of text pages	complete
hallucination	What happens when high-probability tokens describe things that are not true	complete

Check your understanding

Test yourself (click to expand)

Explain what “autoregressive generation” means and why the model must generate tokens sequentially.

Describe the three steps that turn a token sequence into a probability distribution: logits, softmax, and sampling.

Distinguish between greedy decoding, temperature sampling, and top-p sampling. In what situations would you prefer each?

Interpret this scenario: you set temperature to 2.0 and the model produces nonsensical output. Explain what happened to the probability distribution and why the output quality degraded.

Connect next-token prediction to hallucination. Why does a model sometimes produce confident-sounding false statements, given that it is selecting high-probability tokens?

Where this concept fits

Position in the knowledge graph
graph TD
    AI[AI and Machine Learning] --> NTP[Next-Token Prediction]
    AI --> TOK[Tokenization]
    AI --> EMB[Embeddings]
    AI --> TRA[Transformer Architecture]
    AI --> HAL[Hallucination]
    TOK -.->|feeds| NTP
    EMB -.->|feeds| NTP
    TRA -.->|computes| NTP
    style NTP fill:#4a9ede,color:#fff
Related concepts:

hallucination — hallucination occurs when next-token prediction produces plausible-sounding but factually incorrect sequences

transformer-architecture — the transformer is the neural network that computes the probability distribution over the vocabulary

pre-training — the model learns to predict the next token during pre-training on massive text corpora

Explorer

Next-Token Prediction

Next-Token Prediction

What is it?

At a glance

How does it work?

1. From tokens to logits

2. Softmax — turning scores into probabilities

3. Sampling — choosing a token

4. Top-k and top-p — filtering before sampling

5. The autoregressive loop

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Next-Token Prediction

Next-Token Prediction

What is it?

At a glance

How does it work?

1. From tokens to logits

2. Softmax — turning scores into probabilities

3. Sampling — choosing a token

4. Top-k and top-p — filtering before sampling

5. The autoregressive loop

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks