Byte-Pair Encoding (BPE)

A compression algorithm that builds a tokenizer’s vocabulary by repeatedly merging the most frequent pair of adjacent symbols — starting from individual characters and working up to common subwords and whole words.


What is it?

Every language model needs a vocabulary: a fixed set of tokens it can recognise. But how do you decide what those tokens should be? You cannot include every word in every language — the vocabulary would be infinite. You cannot use only individual characters — the sequences would be too long and the model would lose efficiency.

Byte-Pair Encoding (BPE) solves this by building the vocabulary automatically from training data. The algorithm starts with the smallest possible tokens (individual characters or bytes) and iteratively merges the most common adjacent pairs until the vocabulary reaches a target size — typically 32,000 to 128,000 tokens.1

The name comes from its origin as a data compression technique, invented by Philip Gage in 1994. In 2016, researchers at the University of Edinburgh adapted it for neural machine translation, and it became the standard tokenization method for modern language models including GPT, Llama, and Claude.2

In plain terms

BPE is like learning abbreviations by watching what people write most often. If “th” appears together constantly, make it one symbol. If “the” appears even more, merge that too. Keep merging until you have a set of symbols that covers most text efficiently.


At a glance


How does it work?

1. Starting point — every character is a token

BPE begins with the smallest possible units. In the original version, these are individual characters. In modern implementations (like GPT’s tokenizer), the starting units are individual bytes, which means the system can handle any text in any language or encoding, including emoji and binary data.1

Starting vocabulary for English text might look like:

a, b, c, d, e, f, g, h, i, ... z, A, B, ... Z, 0, 1, ... 9, space, ., !, ...

At this stage, the word “lower” is five tokens: l, o, w, e, r.

2. Counting pairs

The algorithm scans the entire training corpus and counts how often each pair of adjacent tokens appears. For example, in a corpus containing the words “lower,” “lowest,” “newer,” and “wider”:2

PairFrequency
l, o2 (lower, lowest)
o, w2 (lower, lowest)
w, e3 (lower, lowest, newer)
e, r3 (lower, newer, wider)
e, s1 (lowest)

3. Merging the top pair

The most frequent pair gets merged into a single new token. If “e, r” and “w, e” are tied at 3, the algorithm picks one (typically the first found). Say it merges “e” + “r” into “er”:1

  • “lower” is now: l, o, w, er
  • “newer” is now: n, e, w, er
  • “wider” is now: w, i, d, er

The new token “er” is added to the vocabulary. The algorithm counts pairs again with the updated tokenization and merges the next most frequent pair.

4. Repeating until done

After thousands of merges, the vocabulary contains tokens of increasing length:2

Merge roundNew tokenExample
Early”er”Common suffix
Middle”the”Common word
Late”tion”Common suffix
Very late” the”Word with preceding space

The process stops when the vocabulary reaches the target size. Modern models run 32,000 to 128,000 merge operations.3

Think of it like...

Learning a language’s common patterns by frequency. A child learning English quickly notices “th” appears together constantly, then “the”, then “ing.” BPE does the same thing, but mechanically and over billions of words, building up from individual characters to the most useful chunks of text.

5. Using the trained vocabulary

Once the vocabulary is built, tokenizing new text is straightforward. The tokenizer scans the input and greedily matches the longest known token at each position:1

"Tokenization" → "Token" + "ization"
"the" → "the"
"xyzzy" → "x" + "y" + "z" + "zy"

Common words match as whole tokens. Rare words get split into known subword pieces. Truly unknown character sequences fall back to individual characters or bytes. Nothing is ever “out of vocabulary.”


Why do we use it?

Key reasons

1. Open vocabulary. BPE can represent any text, in any language, including text it has never seen before. Unknown words simply get split into smaller known pieces.1

2. Efficiency. Common words and subwords become single tokens, compressing text and allowing more content to fit within the model’s context window.3

3. Data-driven. The vocabulary is learned from the training data, not designed by hand. This means it automatically adapts to the domain and languages present in the corpus.2


When do we use it?

  • When building or training a language model — BPE is the standard tokenization algorithm for GPT, Llama, Claude, and most modern LLMs
  • When understanding token costs — knowing that BPE merges common sequences explains why some languages and domains use more tokens than others
  • When debugging tokenization issues — understanding the merge process helps explain why certain words split in unexpected ways

Rule of thumb

If you are working with a modern language model, you are already using BPE (or a close variant like Unigram). Understanding the algorithm helps you predict how text will be tokenized and why some inputs cost more than others.


How can I think about it?

The LEGO brick factory

Imagine a factory that makes LEGO bricks by watching what people build most often.

  • Start: Every brick is a 1x1 single stud
  • Observation: People constantly connect 1x1 red + 1x1 red side by side
  • Merge: The factory produces a new 1x2 red brick
  • Observation: People connect 1x2 red + 1x1 blue constantly
  • Merge: The factory produces a new 1x3 red-red-blue brick
  • Repeat until the factory has 100,000 different brick shapes
  • Any structure can be built from these bricks, but common structures use fewer, larger bricks
  • The factory’s catalogue = the BPE vocabulary

The text message abbreviations

Think of how text message shorthand evolves naturally.

  • People start writing every word in full: “laughing out loud”
  • The most common phrases get shortened: “LOL”
  • Less common phrases stay long: “rolling on the floor laughing” only became “ROFL” once it appeared often enough
  • Very rare phrases never get abbreviations
  • BPE follows the same logic: frequently co-occurring character sequences earn their own symbol; rare ones stay split into smaller pieces

Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AI[AI and Machine Learning] --> TOK[Tokenization]
    TOK --> BPE[Byte-Pair Encoding]
    style BPE fill:#4a9ede,color:#fff

Related concepts:

  • tokenization — BPE is the algorithm that builds a tokenizer’s vocabulary; tokenization is the broader process of converting text to tokens
  • embeddings — after BPE produces token IDs, the embedding layer converts each ID into a vector that encodes meaning

Sources


Further reading

Resources

Footnotes

  1. Langformers. (2025). BPE Tokenizer: Training and Tokenization Explained. Langformers Blog. 2 3 4 5

  2. Vizuara. (2025). Understanding Byte Pair Encoding (BPE) in Large Language Models. Vizuara Substack. 2 3 4

  3. Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. 2