Tokenization

The process of breaking text into smaller units called tokens — numbers that a language model can process — because models work with mathematics, not letters.

What is it?

Language models cannot read. They are mathematical systems that operate on numbers. Before a model can do anything with your text, it needs to convert every character into a numerical representation. Tokenization is that conversion step: it splits raw text into a sequence of tokens and maps each token to an integer ID from a fixed vocabulary.¹

A token is not always a word. The word “the” is a single token. The word “unhappiness” might be split into [“un”, “happiness”] — two tokens. A rare technical term like “defenestration” could be split into [“def”, “en”, “est”, “ration”] — four tokens. Common words stay whole; uncommon words are broken into familiar pieces.²

This design is deliberate. A model that stored every possible word as a single token would need an impossibly large vocabulary (English alone has over a million word forms). A model that stored every individual character would need very few tokens but would lose the efficiency of recognising common patterns. Tokenization finds the middle ground: a vocabulary of typically 32,000 to 128,000 tokens that can represent any text efficiently.³

The most common algorithm for building these vocabularies is byte-pair-encoding (BPE), which learns the optimal set of tokens from training data by iteratively merging frequent character pairs.

In plain terms

Tokenization is like a stenographer’s shorthand. Common phrases get a single symbol. Less common words are spelled out in shorter abbreviations. Rare words are written letter by letter. The goal is the same: represent any text using a manageable set of symbols, as efficiently as possible.

At a glance

From text to token IDs (click to expand)
graph LR
    A[Raw text] --> B[Tokenizer]
    B --> C[Token sequence]
    C --> D[Integer IDs]
    D --> E[Model input]

    style B fill:#4a9ede,color:#fff
Key: The tokenizer splits text into tokens and assigns each one a numerical ID from its vocabulary. These IDs are what the model actually receives.

How does it work?

1. The vocabulary — a fixed dictionary of tokens

Every tokenizer has a vocabulary: a fixed list of tokens, each mapped to an integer ID. GPT-4’s tokenizer (cl100k_base) has about 100,000 tokens. Llama 3’s tokenizer has 128,000. The vocabulary is built once during training and frozen — it does not change when you use the model.³

The vocabulary includes:

Whole common words: “the” (ID 1820), “cat” (ID 4344)
Common subwords: “ing” (ID 278), “tion” (ID 1159)
Individual characters and bytes: “x” (ID 87), ” ” (ID 220)
Special tokens: beginning-of-text, end-of-text, padding

Think of it like...

A set of rubber stamps. The tokenizer has a box of 100,000 stamps, each representing a piece of text — some for whole words, some for word fragments, some for single characters. To “tokenize” a sentence, it finds the combination of stamps that covers the text using the fewest stamps possible.

2. Splitting rules — how text becomes tokens

When you send text to a model, the tokenizer scans it left to right and greedily matches the longest token in its vocabulary at each position:¹

Input text	Tokens	Token count
”The cat sat”	[“The”, ” cat”, ” sat”]	3
”unhappiness”	[“un”, “happiness”]	2
”ChatGPT”	[“Chat”, “G”, “PT”]	3
”12345”	[“123”, “45”]	2

Notice that spaces are often attached to the following word (” cat” is one token, not “cat” with a separate space). This is an implementation detail that affects how the model handles whitespace and formatting.²

3. Why tokenization explains model quirks

Several well-known LLM behaviours trace directly to tokenization:²

Letter counting fails. Ask a model to count the letters in “strawberry” and it often gets it wrong. The model never sees individual letters — it sees tokens like [“str”, “aw”, “berry”]. It has no direct access to the character-level representation.⁴

Code and names split unpredictably. A function name like calculateTotalPrice might become [“calculate”, “Total”, “Price”] or [“calc”, “ulate”, “Total”, “Price”] depending on the tokenizer. The model must learn that these fragments form a single name.

Non-English text costs more tokens. Tokenizers trained primarily on English text have more English subwords in their vocabulary. The same concept expressed in Japanese or Arabic often requires more tokens, using more of the context window and costing more.³

Example: tokenizing a real sentence (click to expand)

Using GPT-4’s tokenizer on: “Tokenization is surprisingly important.”

Token ID
”Token” 3962
”ization” 2065
” is” 374
” surprisingly” 29700
” important” 3062
”.“ 13

6 tokens for 5 words. The word “Tokenization” splits into two pieces because it is less common than “Token” or “ization” individually. Every other word is common enough to be a single token.

Token	ID
”Token”	3962
”ization”	2065
” is”	374
” surprisingly”	29700
” important”	3062
”.“	13

4. Vocabulary size — a design tradeoff

Vocabulary size is a fundamental design choice:³

Size	Advantage	Disadvantage
Small (32K)	Smaller model, faster training	More tokens per text, shorter effective context
Large (128K)	Fewer tokens per text, more fits in context	Larger embedding table, more memory

Llama 3 expanded from 32K to 128K tokens, and this alone improved performance — more text fits in the same context window because each piece of text requires fewer tokens.³

Why do we use it?

Key reasons

1. Mathematical necessity. Neural networks compute with numbers. Text must become numbers before any processing can happen. Tokenization is the bridge between human language and machine computation.¹

2. Efficiency. A vocabulary of 100,000 tokens compresses text far more efficiently than character-level encoding. Fewer tokens means faster processing and more text fits in the model’s context window.³

3. Handling any text. Because tokenizers can fall back to individual characters or bytes, they can represent any text in any language, including code, emoji, and Unicode — no special handling required.²

When do we use it?

When sending text to any LLM — tokenization happens automatically, but understanding it helps you manage context window limits
When estimating costs — API pricing is per token, not per word or character
When debugging unexpected model behaviour — character-level tasks, arithmetic, and multilingual text are all affected by how text is tokenized
When choosing a model — different tokenizers handle different languages and domains with different efficiency

Rule of thumb

If the model behaves oddly on a task that seems simple to you, check how the input is tokenized. The model may be seeing something very different from what you typed.

How can I think about it?

The Scrabble tile rack

Imagine you have a rack of Scrabble tiles, but instead of individual letters, your tiles have common letter combinations printed on them: “THE”, “ING”, “TION”, “CAT”, “A”, “B”, “C”…

Spelling “CATCHING” = place “CAT” + “CH” + “ING” — three tiles

Spelling “XYLOPHONE” = place “X” + “Y” + “LO” + “PHONE” — four tiles (less common combinations require more tiles)

Your tile rack = the tokenizer’s vocabulary

Each tile = one token

Spelling a word = tokenizing it

Fewer tiles is better = fewer tokens means more efficient processing

The compression algorithm

Think of text compression like ZIP files. A ZIP file finds repeated patterns in data and replaces them with shorter codes. “THE” appears thousands of times in English text, so it gets a very short code. “DEFENESTRATION” appears rarely, so it gets a longer code or is broken into pieces that each have their own short codes.

Frequent patterns = single tokens (short codes)

Rare patterns = split into subword tokens (assembled from shorter codes)

The codebook = the vocabulary

Compressed size = token count

BPE builds the codebook = the algorithm decides which patterns deserve their own code

Concepts to explore next

Concept	What it covers	Status
byte-pair-encoding	The algorithm that builds tokenizer vocabularies by iteratively merging frequent pairs	complete
embeddings	How token IDs become meaningful vectors the model can compute with	complete
next-token-prediction	What the model does with tokens once it has them	complete

Check your understanding

Test yourself (click to expand)

Explain why a language model cannot process raw text directly and what tokenization does to solve this.

Describe the tradeoff between small and large vocabularies. What improves and what gets worse as vocabulary size increases?

Distinguish between a word, a token, and a token ID. Give an example where a single word becomes multiple tokens.

Interpret this scenario: you ask a model “How many letters are in the word ‘strawberry’?” and it answers “8.” Using what you know about tokenization, explain why the model might get this wrong.

Connect tokenization to context window limits. Why does a more efficient tokenizer let the model process more text in a single request?

Where this concept fits

Position in the knowledge graph
graph TD
    AI[AI and Machine Learning] --> TOK[Tokenization]
    TOK --> BPE[Byte-Pair Encoding]
    AI --> EMB[Embeddings]
    AI --> NTP[Next-Token Prediction]
    style TOK fill:#4a9ede,color:#fff
Related concepts:

embeddings — after tokenization produces integer IDs, the embedding layer converts each ID into a vector that encodes meaning

next-token-prediction — the model’s core task is predicting the next token in a sequence; tokenization determines what counts as a “token”

structured-data-vs-prose — tokenization is one way that unstructured prose gets converted into a structured, machine-processable format

Explorer

Tokenization

Tokenization

What is it?

At a glance

How does it work?

1. The vocabulary — a fixed dictionary of tokens

2. Splitting rules — how text becomes tokens

3. Why tokenization explains model quirks

4. Vocabulary size — a design tradeoff

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Tokenization

Tokenization

What is it?

At a glance

How does it work?

1. The vocabulary — a fixed dictionary of tokens

2. Splitting rules — how text becomes tokens

3. Why tokenization explains model quirks

4. Vocabulary size — a design tradeoff

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks