Byte-Pair Encoding (BPE)

A compression algorithm that builds a tokenizer’s vocabulary by repeatedly merging the most frequent pair of adjacent symbols — starting from individual characters and working up to common subwords and whole words.

What is it?

Every language model needs a vocabulary: a fixed set of tokens it can recognise. But how do you decide what those tokens should be? You cannot include every word in every language — the vocabulary would be infinite. You cannot use only individual characters — the sequences would be too long and the model would lose efficiency.

Byte-Pair Encoding (BPE) solves this by building the vocabulary automatically from training data. The algorithm starts with the smallest possible tokens (individual characters or bytes) and iteratively merges the most common adjacent pairs until the vocabulary reaches a target size — typically 32,000 to 128,000 tokens.¹

The name comes from its origin as a data compression technique, invented by Philip Gage in 1994. In 2016, researchers at the University of Edinburgh adapted it for neural machine translation, and it became the standard tokenization method for modern language models including GPT, Llama, and Claude.²

In plain terms

BPE is like learning abbreviations by watching what people write most often. If “th” appears together constantly, make it one symbol. If “the” appears even more, merge that too. Keep merging until you have a set of symbols that covers most text efficiently.

At a glance

How BPE builds a vocabulary (click to expand)
graph TD
    A[Start: individual characters] --> B[Count all adjacent pairs]
    B --> C[Merge most frequent pair]
    C --> D[Add merged token to vocabulary]
    D --> E{Target size reached?}
    E -->|No| B
    E -->|Yes| F[Vocabulary complete]

    style F fill:#5cb85c,color:#fff
Key: The algorithm repeats the count-merge cycle thousands of times. Each merge creates a new, longer token from two existing ones. The final vocabulary contains a mix of single characters, common subwords, and frequent whole words.

How does it work?

1. Starting point — every character is a token

BPE begins with the smallest possible units. In the original version, these are individual characters. In modern implementations (like GPT’s tokenizer), the starting units are individual bytes, which means the system can handle any text in any language or encoding, including emoji and binary data.¹

Starting vocabulary for English text might look like:

a, b, c, d, e, f, g, h, i, ... z, A, B, ... Z, 0, 1, ... 9, space, ., !, ...

At this stage, the word “lower” is five tokens: l, o, w, e, r.

2. Counting pairs

The algorithm scans the entire training corpus and counts how often each pair of adjacent tokens appears. For example, in a corpus containing the words “lower,” “lowest,” “newer,” and “wider”:²

Pair	Frequency
l, o	2 (lower, lowest)
o, w	2 (lower, lowest)
w, e	3 (lower, lowest, newer)
e, r	3 (lower, newer, wider)
e, s	1 (lowest)
…	…

3. Merging the top pair

The most frequent pair gets merged into a single new token. If “e, r” and “w, e” are tied at 3, the algorithm picks one (typically the first found). Say it merges “e” + “r” into “er”:¹

“lower” is now: l, o, w, er
“newer” is now: n, e, w, er
“wider” is now: w, i, d, er

The new token “er” is added to the vocabulary. The algorithm counts pairs again with the updated tokenization and merges the next most frequent pair.

4. Repeating until done

After thousands of merges, the vocabulary contains tokens of increasing length:²

Merge round	New token	Example
Early	”er”	Common suffix
Middle	”the”	Common word
Late	”tion”	Common suffix
Very late	” the”	Word with preceding space

The process stops when the vocabulary reaches the target size. Modern models run 32,000 to 128,000 merge operations.³

Think of it like...

Learning a language’s common patterns by frequency. A child learning English quickly notices “th” appears together constantly, then “the”, then “ing.” BPE does the same thing, but mechanically and over billions of words, building up from individual characters to the most useful chunks of text.

5. Using the trained vocabulary

Once the vocabulary is built, tokenizing new text is straightforward. The tokenizer scans the input and greedily matches the longest known token at each position:¹

"Tokenization" → "Token" + "ization"
"the" → "the"
"xyzzy" → "x" + "y" + "z" + "zy"

Common words match as whole tokens. Rare words get split into known subword pieces. Truly unknown character sequences fall back to individual characters or bytes. Nothing is ever “out of vocabulary.”

Why do we use it?

Key reasons

1. Open vocabulary. BPE can represent any text, in any language, including text it has never seen before. Unknown words simply get split into smaller known pieces.¹

2. Efficiency. Common words and subwords become single tokens, compressing text and allowing more content to fit within the model’s context window.³

3. Data-driven. The vocabulary is learned from the training data, not designed by hand. This means it automatically adapts to the domain and languages present in the corpus.²

When do we use it?

When building or training a language model — BPE is the standard tokenization algorithm for GPT, Llama, Claude, and most modern LLMs
When understanding token costs — knowing that BPE merges common sequences explains why some languages and domains use more tokens than others
When debugging tokenization issues — understanding the merge process helps explain why certain words split in unexpected ways

Rule of thumb

If you are working with a modern language model, you are already using BPE (or a close variant like Unigram). Understanding the algorithm helps you predict how text will be tokenized and why some inputs cost more than others.

How can I think about it?

The LEGO brick factory

Imagine a factory that makes LEGO bricks by watching what people build most often.

Start: Every brick is a 1x1 single stud

Observation: People constantly connect 1x1 red + 1x1 red side by side

Merge: The factory produces a new 1x2 red brick

Observation: People connect 1x2 red + 1x1 blue constantly

Merge: The factory produces a new 1x3 red-red-blue brick

Repeat until the factory has 100,000 different brick shapes

Any structure can be built from these bricks, but common structures use fewer, larger bricks

The factory’s catalogue = the BPE vocabulary

The text message abbreviations

Think of how text message shorthand evolves naturally.

People start writing every word in full: “laughing out loud”

The most common phrases get shortened: “LOL”

Less common phrases stay long: “rolling on the floor laughing” only became “ROFL” once it appeared often enough

Very rare phrases never get abbreviations

BPE follows the same logic: frequently co-occurring character sequences earn their own symbol; rare ones stay split into smaller pieces

Check your understanding

Test yourself (click to expand)

Explain the BPE algorithm in three steps to someone who has never heard of it.

Describe what happens when a BPE tokenizer encounters a word it has never seen before. Why does this not cause an error?

Distinguish between a character-level tokenizer, a word-level tokenizer, and a BPE tokenizer. What problem does BPE solve that the other two cannot?

Interpret this observation: a model tokenizes “unhappiness” as [“un”, “happiness”] but “uncharacteristically” as [“un”, “character”, “istically”]. What does this tell you about the frequency of these subwords in the training data?

Connect BPE to the concept of context windows. Why does a larger BPE vocabulary allow more text to fit in the same context window?

Where this concept fits

Position in the knowledge graph
graph TD
    AI[AI and Machine Learning] --> TOK[Tokenization]
    TOK --> BPE[Byte-Pair Encoding]
    style BPE fill:#4a9ede,color:#fff
Related concepts:

tokenization — BPE is the algorithm that builds a tokenizer’s vocabulary; tokenization is the broader process of converting text to tokens

embeddings — after BPE produces token IDs, the embedding layer converts each ID into a vector that encodes meaning

Explorer

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE)

What is it?

At a glance

How does it work?

1. Starting point — every character is a token

2. Counting pairs

3. Merging the top pair

4. Repeating until done

5. Using the trained vocabulary

Why do we use it?

When do we use it?

How can I think about it?

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE)

What is it?

At a glance

How does it work?

1. Starting point — every character is a token

2. Counting pairs

3. Merging the top pair

4. Repeating until done

5. Using the trained vocabulary

Why do we use it?

When do we use it?

How can I think about it?

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks