2.0 KiB
Byte-Pair Encoding (BPE)
Overview
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in Natural Language Processing (NLP) to build subword vocabularies for models such as GPT and BERT.
Key Idea
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
Algorithm Steps
-
Initialization
- Treat each character of the input text as a token.
-
Find Frequent Pairs
- Count all adjacent token pairs in the sequence.
-
Merge Most Frequent Pair
- Replace the most frequent pair with a new symbol not used in the text.
-
Repeat
- Continue until no frequent pairs remain or a desired vocabulary size is reached.
Example
Suppose the data to be encoded is:
aaabdaaabac
Step 1: Merge "aa"
Most frequent pair: "aa" → replace with "Z"
ZabdZabac
Z = aa
Step 2: Merge "ab"
Most frequent pair: "ab" → replace with "Y"
ZYdZYac
Y = ab
Z = aa
Step 3: Merge "ZY"
Most frequent pair: "ZY" → replace with "X"
XdXac
X = ZY
Y = ab
Z = aa
At this point, no pairs occur more than once, so the process stops.
Decompression
To recover the original data, replacements are applied in reverse order:
XdXac
→ ZYdZYac
→ ZabdZabac
→ aaabdaaabac
Advantages
- Efficient vocabulary building: reduces the need for massive word lists.
- Handles rare words: breaks them into meaningful subword units.
- Balances character- and word-level tokenization.
Limitations
- Does not consider linguistic meaning—merges are frequency-based.
- May create tokens that are not linguistically natural.
- Vocabulary is fixed after training.