Byte-Pair Encoding (BPE)

Overview

Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in Natural Language Processing (NLP) to build subword vocabularies for models such as GPT and BERT.

Key Idea

BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.

Algorithm Steps

Initialization
- Treat each character of the input text as a token.
Find Frequent Pairs
- Count all adjacent token pairs in the sequence.
Merge Most Frequent Pair
- Replace the most frequent pair with a new symbol not used in the text.
Repeat
- Continue until no frequent pairs remain or a desired vocabulary size is reached.

Example

Suppose the data to be encoded is:

aaabdaaabac

Step 1: Merge `"aa"`

Most frequent pair: "aa" → replace with "Z"

ZabdZabac
Z = aa

Step 2: Merge `"ab"`

Most frequent pair: "ab" → replace with "Y"

ZYdZYac
Y = ab
Z = aa

Step 3: Merge `"ZY"`

Most frequent pair: "ZY" → replace with "X"

XdXac
X = ZY
Y = ab
Z = aa

At this point, no pairs occur more than once, so the process stops.

Decompression

To recover the original data, replacements are applied in reverse order:

XdXac
→ ZYdZYac
→ ZabdZabac
→ aaabdaaabac

Advantages

Efficient vocabulary building: reduces the need for massive word lists.
Handles rare words: breaks them into meaningful subword units.
Balances character- and word-level tokenization.

Limitations

Does not consider linguistic meaning—merges are frequency-based.
May create tokens that are not linguistically natural.
Vocabulary is fixed after training.

2.0 KiB Raw Blame History