Tokenization

The process of breaking text into smaller units called tokens — numbers that a language model can process — because models work with mathematics, not letters.


What is it?

Language models cannot read. They are mathematical systems that operate on numbers. Before a model can do anything with your text, it needs to convert every character into a numerical representation. Tokenization is that conversion step: it splits raw text into a sequence of tokens and maps each token to an integer ID from a fixed vocabulary.1

A token is not always a word. The word “the” is a single token. The word “unhappiness” might be split into [“un”, “happiness”] — two tokens. A rare technical term like “defenestration” could be split into [“def”, “en”, “est”, “ration”] — four tokens. Common words stay whole; uncommon words are broken into familiar pieces.2

This design is deliberate. A model that stored every possible word as a single token would need an impossibly large vocabulary (English alone has over a million word forms). A model that stored every individual character would need very few tokens but would lose the efficiency of recognising common patterns. Tokenization finds the middle ground: a vocabulary of typically 32,000 to 128,000 tokens that can represent any text efficiently.3

The most common algorithm for building these vocabularies is byte-pair-encoding (BPE), which learns the optimal set of tokens from training data by iteratively merging frequent character pairs.

In plain terms

Tokenization is like a stenographer’s shorthand. Common phrases get a single symbol. Less common words are spelled out in shorter abbreviations. Rare words are written letter by letter. The goal is the same: represent any text using a manageable set of symbols, as efficiently as possible.


At a glance


How does it work?

1. The vocabulary — a fixed dictionary of tokens

Every tokenizer has a vocabulary: a fixed list of tokens, each mapped to an integer ID. GPT-4’s tokenizer (cl100k_base) has about 100,000 tokens. Llama 3’s tokenizer has 128,000. The vocabulary is built once during training and frozen — it does not change when you use the model.3

The vocabulary includes:

  • Whole common words: “the” (ID 1820), “cat” (ID 4344)
  • Common subwords: “ing” (ID 278), “tion” (ID 1159)
  • Individual characters and bytes: “x” (ID 87), ” ” (ID 220)
  • Special tokens: beginning-of-text, end-of-text, padding

Think of it like...

A set of rubber stamps. The tokenizer has a box of 100,000 stamps, each representing a piece of text — some for whole words, some for word fragments, some for single characters. To “tokenize” a sentence, it finds the combination of stamps that covers the text using the fewest stamps possible.

2. Splitting rules — how text becomes tokens

When you send text to a model, the tokenizer scans it left to right and greedily matches the longest token in its vocabulary at each position:1

Input textTokensToken count
”The cat sat”[“The”, ” cat”, ” sat”]3
”unhappiness”[“un”, “happiness”]2
”ChatGPT”[“Chat”, “G”, “PT”]3
”12345”[“123”, “45”]2

Notice that spaces are often attached to the following word (” cat” is one token, not “cat” with a separate space). This is an implementation detail that affects how the model handles whitespace and formatting.2

3. Why tokenization explains model quirks

Several well-known LLM behaviours trace directly to tokenization:2

Letter counting fails. Ask a model to count the letters in “strawberry” and it often gets it wrong. The model never sees individual letters — it sees tokens like [“str”, “aw”, “berry”]. It has no direct access to the character-level representation.4

Code and names split unpredictably. A function name like calculateTotalPrice might become [“calculate”, “Total”, “Price”] or [“calc”, “ulate”, “Total”, “Price”] depending on the tokenizer. The model must learn that these fragments form a single name.

Non-English text costs more tokens. Tokenizers trained primarily on English text have more English subwords in their vocabulary. The same concept expressed in Japanese or Arabic often requires more tokens, using more of the context window and costing more.3

4. Vocabulary size — a design tradeoff

Vocabulary size is a fundamental design choice:3

SizeAdvantageDisadvantage
Small (32K)Smaller model, faster trainingMore tokens per text, shorter effective context
Large (128K)Fewer tokens per text, more fits in contextLarger embedding table, more memory

Llama 3 expanded from 32K to 128K tokens, and this alone improved performance — more text fits in the same context window because each piece of text requires fewer tokens.3


Why do we use it?

Key reasons

1. Mathematical necessity. Neural networks compute with numbers. Text must become numbers before any processing can happen. Tokenization is the bridge between human language and machine computation.1

2. Efficiency. A vocabulary of 100,000 tokens compresses text far more efficiently than character-level encoding. Fewer tokens means faster processing and more text fits in the model’s context window.3

3. Handling any text. Because tokenizers can fall back to individual characters or bytes, they can represent any text in any language, including code, emoji, and Unicode — no special handling required.2


When do we use it?

  • When sending text to any LLM — tokenization happens automatically, but understanding it helps you manage context window limits
  • When estimating costs — API pricing is per token, not per word or character
  • When debugging unexpected model behaviour — character-level tasks, arithmetic, and multilingual text are all affected by how text is tokenized
  • When choosing a model — different tokenizers handle different languages and domains with different efficiency

Rule of thumb

If the model behaves oddly on a task that seems simple to you, check how the input is tokenized. The model may be seeing something very different from what you typed.


How can I think about it?

The Scrabble tile rack

Imagine you have a rack of Scrabble tiles, but instead of individual letters, your tiles have common letter combinations printed on them: “THE”, “ING”, “TION”, “CAT”, “A”, “B”, “C”…

  • Spelling “CATCHING” = place “CAT” + “CH” + “ING” — three tiles
  • Spelling “XYLOPHONE” = place “X” + “Y” + “LO” + “PHONE” — four tiles (less common combinations require more tiles)
  • Your tile rack = the tokenizer’s vocabulary
  • Each tile = one token
  • Spelling a word = tokenizing it
  • Fewer tiles is better = fewer tokens means more efficient processing

The compression algorithm

Think of text compression like ZIP files. A ZIP file finds repeated patterns in data and replaces them with shorter codes. “THE” appears thousands of times in English text, so it gets a very short code. “DEFENESTRATION” appears rarely, so it gets a longer code or is broken into pieces that each have their own short codes.

  • Frequent patterns = single tokens (short codes)
  • Rare patterns = split into subword tokens (assembled from shorter codes)
  • The codebook = the vocabulary
  • Compressed size = token count
  • BPE builds the codebook = the algorithm decides which patterns deserve their own code

Concepts to explore next

ConceptWhat it coversStatus
byte-pair-encodingThe algorithm that builds tokenizer vocabularies by iteratively merging frequent pairscomplete
embeddingsHow token IDs become meaningful vectors the model can compute withcomplete
next-token-predictionWhat the model does with tokens once it has themcomplete

Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AI[AI and Machine Learning] --> TOK[Tokenization]
    TOK --> BPE[Byte-Pair Encoding]
    AI --> EMB[Embeddings]
    AI --> NTP[Next-Token Prediction]
    style TOK fill:#4a9ede,color:#fff

Related concepts:

  • embeddings — after tokenization produces integer IDs, the embedding layer converts each ID into a vector that encodes meaning
  • next-token-prediction — the model’s core task is predicting the next token in a sequence; tokenization determines what counts as a “token”
  • structured-data-vs-prose — tokenization is one way that unstructured prose gets converted into a structured, machine-processable format

Sources


Further reading

Resources

Footnotes

  1. Trott, S. (2024). Tokenization in large language models, explained. Sean Trott’s Substack. 2 3

  2. Sierra Cloud. (2025). How Large Language Models Work: From tokenization to output. Sierra Cloud. 2 3 4

  3. Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. 2 3 4 5 6

  4. Vizuara. (2025). Understanding Byte Pair Encoding (BPE) in Large Language Models. Vizuara Substack.