Tokenization
The process of breaking text into smaller units called tokens — numbers that a language model can process — because models work with mathematics, not letters.
What is it?
Language models cannot read. They are mathematical systems that operate on numbers. Before a model can do anything with your text, it needs to convert every character into a numerical representation. Tokenization is that conversion step: it splits raw text into a sequence of tokens and maps each token to an integer ID from a fixed vocabulary.1
A token is not always a word. The word “the” is a single token. The word “unhappiness” might be split into [“un”, “happiness”] — two tokens. A rare technical term like “defenestration” could be split into [“def”, “en”, “est”, “ration”] — four tokens. Common words stay whole; uncommon words are broken into familiar pieces.2
This design is deliberate. A model that stored every possible word as a single token would need an impossibly large vocabulary (English alone has over a million word forms). A model that stored every individual character would need very few tokens but would lose the efficiency of recognising common patterns. Tokenization finds the middle ground: a vocabulary of typically 32,000 to 128,000 tokens that can represent any text efficiently.3
The most common algorithm for building these vocabularies is byte-pair-encoding (BPE), which learns the optimal set of tokens from training data by iteratively merging frequent character pairs.
In plain terms
Tokenization is like a stenographer’s shorthand. Common phrases get a single symbol. Less common words are spelled out in shorter abbreviations. Rare words are written letter by letter. The goal is the same: represent any text using a manageable set of symbols, as efficiently as possible.
At a glance
From text to token IDs (click to expand)
graph LR A[Raw text] --> B[Tokenizer] B --> C[Token sequence] C --> D[Integer IDs] D --> E[Model input] style B fill:#4a9ede,color:#fffKey: The tokenizer splits text into tokens and assigns each one a numerical ID from its vocabulary. These IDs are what the model actually receives.
How does it work?
1. The vocabulary — a fixed dictionary of tokens
Every tokenizer has a vocabulary: a fixed list of tokens, each mapped to an integer ID. GPT-4’s tokenizer (cl100k_base) has about 100,000 tokens. Llama 3’s tokenizer has 128,000. The vocabulary is built once during training and frozen — it does not change when you use the model.3
The vocabulary includes:
- Whole common words: “the” (ID 1820), “cat” (ID 4344)
- Common subwords: “ing” (ID 278), “tion” (ID 1159)
- Individual characters and bytes: “x” (ID 87), ” ” (ID 220)
- Special tokens: beginning-of-text, end-of-text, padding
Think of it like...
A set of rubber stamps. The tokenizer has a box of 100,000 stamps, each representing a piece of text — some for whole words, some for word fragments, some for single characters. To “tokenize” a sentence, it finds the combination of stamps that covers the text using the fewest stamps possible.
2. Splitting rules — how text becomes tokens
When you send text to a model, the tokenizer scans it left to right and greedily matches the longest token in its vocabulary at each position:1
| Input text | Tokens | Token count |
|---|---|---|
| ”The cat sat” | [“The”, ” cat”, ” sat”] | 3 |
| ”unhappiness” | [“un”, “happiness”] | 2 |
| ”ChatGPT” | [“Chat”, “G”, “PT”] | 3 |
| ”12345” | [“123”, “45”] | 2 |
Notice that spaces are often attached to the following word (” cat” is one token, not “cat” with a separate space). This is an implementation detail that affects how the model handles whitespace and formatting.2
3. Why tokenization explains model quirks
Several well-known LLM behaviours trace directly to tokenization:2
Letter counting fails. Ask a model to count the letters in “strawberry” and it often gets it wrong. The model never sees individual letters — it sees tokens like [“str”, “aw”, “berry”]. It has no direct access to the character-level representation.4
Code and names split unpredictably. A function name like calculateTotalPrice might become [“calculate”, “Total”, “Price”] or [“calc”, “ulate”, “Total”, “Price”] depending on the tokenizer. The model must learn that these fragments form a single name.
Non-English text costs more tokens. Tokenizers trained primarily on English text have more English subwords in their vocabulary. The same concept expressed in Japanese or Arabic often requires more tokens, using more of the context window and costing more.3
Example: tokenizing a real sentence (click to expand)
Using GPT-4’s tokenizer on: “Tokenization is surprisingly important.”
Token ID ”Token” 3962 ”ization” 2065 ” is” 374 ” surprisingly” 29700 ” important” 3062 ”.“ 13 6 tokens for 5 words. The word “Tokenization” splits into two pieces because it is less common than “Token” or “ization” individually. Every other word is common enough to be a single token.
4. Vocabulary size — a design tradeoff
Vocabulary size is a fundamental design choice:3
| Size | Advantage | Disadvantage |
|---|---|---|
| Small (32K) | Smaller model, faster training | More tokens per text, shorter effective context |
| Large (128K) | Fewer tokens per text, more fits in context | Larger embedding table, more memory |
Llama 3 expanded from 32K to 128K tokens, and this alone improved performance — more text fits in the same context window because each piece of text requires fewer tokens.3
Why do we use it?
Key reasons
1. Mathematical necessity. Neural networks compute with numbers. Text must become numbers before any processing can happen. Tokenization is the bridge between human language and machine computation.1
2. Efficiency. A vocabulary of 100,000 tokens compresses text far more efficiently than character-level encoding. Fewer tokens means faster processing and more text fits in the model’s context window.3
3. Handling any text. Because tokenizers can fall back to individual characters or bytes, they can represent any text in any language, including code, emoji, and Unicode — no special handling required.2
When do we use it?
- When sending text to any LLM — tokenization happens automatically, but understanding it helps you manage context window limits
- When estimating costs — API pricing is per token, not per word or character
- When debugging unexpected model behaviour — character-level tasks, arithmetic, and multilingual text are all affected by how text is tokenized
- When choosing a model — different tokenizers handle different languages and domains with different efficiency
Rule of thumb
If the model behaves oddly on a task that seems simple to you, check how the input is tokenized. The model may be seeing something very different from what you typed.
How can I think about it?
The Scrabble tile rack
Imagine you have a rack of Scrabble tiles, but instead of individual letters, your tiles have common letter combinations printed on them: “THE”, “ING”, “TION”, “CAT”, “A”, “B”, “C”…
- Spelling “CATCHING” = place “CAT” + “CH” + “ING” — three tiles
- Spelling “XYLOPHONE” = place “X” + “Y” + “LO” + “PHONE” — four tiles (less common combinations require more tiles)
- Your tile rack = the tokenizer’s vocabulary
- Each tile = one token
- Spelling a word = tokenizing it
- Fewer tiles is better = fewer tokens means more efficient processing
The compression algorithm
Think of text compression like ZIP files. A ZIP file finds repeated patterns in data and replaces them with shorter codes. “THE” appears thousands of times in English text, so it gets a very short code. “DEFENESTRATION” appears rarely, so it gets a longer code or is broken into pieces that each have their own short codes.
- Frequent patterns = single tokens (short codes)
- Rare patterns = split into subword tokens (assembled from shorter codes)
- The codebook = the vocabulary
- Compressed size = token count
- BPE builds the codebook = the algorithm decides which patterns deserve their own code
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| byte-pair-encoding | The algorithm that builds tokenizer vocabularies by iteratively merging frequent pairs | complete |
| embeddings | How token IDs become meaningful vectors the model can compute with | complete |
| next-token-prediction | What the model does with tokens once it has them | complete |
Check your understanding
Test yourself (click to expand)
- Explain why a language model cannot process raw text directly and what tokenization does to solve this.
- Describe the tradeoff between small and large vocabularies. What improves and what gets worse as vocabulary size increases?
- Distinguish between a word, a token, and a token ID. Give an example where a single word becomes multiple tokens.
- Interpret this scenario: you ask a model “How many letters are in the word ‘strawberry’?” and it answers “8.” Using what you know about tokenization, explain why the model might get this wrong.
- Connect tokenization to context window limits. Why does a more efficient tokenizer let the model process more text in a single request?
Where this concept fits
Position in the knowledge graph
graph TD AI[AI and Machine Learning] --> TOK[Tokenization] TOK --> BPE[Byte-Pair Encoding] AI --> EMB[Embeddings] AI --> NTP[Next-Token Prediction] style TOK fill:#4a9ede,color:#fffRelated concepts:
- embeddings — after tokenization produces integer IDs, the embedding layer converts each ID into a vector that encodes meaning
- next-token-prediction — the model’s core task is predicting the next token in a sequence; tokenization determines what counts as a “token”
- structured-data-vs-prose — tokenization is one way that unstructured prose gets converted into a structured, machine-processable format
Sources
Further reading
Resources
- Tokenization in Large Language Models, Explained (Sean Trott) — Clear, beginner-friendly walkthrough of why tokenization matters and how it shapes model behaviour
- Understanding Byte Pair Encoding in Large Language Models (Vizuara) — Visual explanation of BPE with step-by-step examples
- BPE Tokenizer: Training and Tokenization Explained (Langformers) — Technical deep-dive into how BPE tokenizers are trained
- How Large Language Models Work: The Complete Technical Guide (Starmorph) — End-to-end reference covering the full LLM pipeline including tokenization
Footnotes
-
Trott, S. (2024). Tokenization in large language models, explained. Sean Trott’s Substack. ↩ ↩2 ↩3
-
Sierra Cloud. (2025). How Large Language Models Work: From tokenization to output. Sierra Cloud. ↩ ↩2 ↩3 ↩4
-
Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Vizuara. (2025). Understanding Byte Pair Encoding (BPE) in Large Language Models. Vizuara Substack. ↩
