# Byte-Pair Encoding (BPE) ## Overview Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. --- ## Key Idea BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. --- ## Algorithm Steps 1. **Initialization** - Treat each character of the input text as a token. 2. **Find Frequent Pairs** - Count all adjacent token pairs in the sequence. 3. **Merge Most Frequent Pair** - Replace the most frequent pair with a new symbol not used in the text. 4. **Repeat** - Continue until no frequent pairs remain or a desired vocabulary size is reached. --- ## Example Suppose the data to be encoded is: ``` aaabdaaabac ``` ### Step 1: Merge `"aa"` Most frequent pair: `"aa"` → replace with `"Z"` ``` ZabdZabac Z = aa ``` --- ### Step 2: Merge `"ab"` Most frequent pair: `"ab"` → replace with `"Y"` ``` ZYdZYac Y = ab Z = aa ``` --- ### Step 3: Merge `"ZY"` Most frequent pair: `"ZY"` → replace with `"X"` ``` XdXac X = ZY Y = ab Z = aa ``` --- At this point, no pairs occur more than once, so the process stops. --- ## Decompression To recover the original data, replacements are applied in **reverse order**: ``` XdXac → ZYdZYac → ZabdZabac → aaabdaaabac ``` --- ## Advantages - **Efficient vocabulary building**: reduces the need for massive word lists. - **Handles rare words**: breaks them into meaningful subword units. - **Balances character- and word-level tokenization**. --- ## Limitations - Does not consider linguistic meaning—merges are frequency-based. - May create tokens that are not linguistically natural. - Vocabulary is fixed after training.