2025-09-17 12:06:01 +02:00
# Byte-Pair Encoding (BPE)
## Overview
2025-09-17 12:51:14 +02:00
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
2025-09-17 12:06:01 +02:00
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
---
## Key Idea
2025-09-17 12:51:14 +02:00
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
2025-09-17 12:06:01 +02:00
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
---
## Algorithm Steps
2025-09-17 12:51:14 +02:00
1. **Initialization**
2025-09-17 12:06:01 +02:00
- Treat each character of the input text as a token.
2025-09-17 12:51:14 +02:00
2. **Find Frequent Pairs**
2025-09-17 12:06:01 +02:00
- Count all adjacent token pairs in the sequence.
2025-09-17 12:51:14 +02:00
3. **Merge Most Frequent Pair**
2025-09-17 12:06:01 +02:00
- Replace the most frequent pair with a new symbol not used in the text.
2025-09-17 12:51:14 +02:00
4. **Repeat**
2025-09-17 12:06:01 +02:00
- Continue until no frequent pairs remain or a desired vocabulary size is reached.
---
## Example
Suppose the data to be encoded is:
2025-09-17 12:51:14 +02:00
```text
2025-09-17 12:06:01 +02:00
aaabdaaabac
```
### Step 1: Merge `"aa"`
2025-09-17 12:51:14 +02:00
2025-09-17 12:06:01 +02:00
Most frequent pair: `"aa"` → replace with `"Z"`
2025-09-17 12:51:14 +02:00
```text
2025-09-17 12:06:01 +02:00
ZabdZabac
Z = aa
```
---
### Step 2: Merge `"ab"`
2025-09-17 12:51:14 +02:00
2025-09-17 12:06:01 +02:00
Most frequent pair: `"ab"` → replace with `"Y"`
2025-09-17 12:51:14 +02:00
```text
2025-09-17 12:06:01 +02:00
ZYdZYac
Y = ab
Z = aa
```
---
### Step 3: Merge `"ZY"`
2025-09-17 12:51:14 +02:00
2025-09-17 12:06:01 +02:00
Most frequent pair: `"ZY"` → replace with `"X"`
2025-09-17 12:51:14 +02:00
```text
2025-09-17 12:06:01 +02:00
XdXac
X = ZY
Y = ab
Z = aa
```
---
At this point, no pairs occur more than once, so the process stops.
---
## Decompression
2025-09-17 12:51:14 +02:00
2025-09-17 12:06:01 +02:00
To recover the original data, replacements are applied in **reverse order** :
2025-09-17 12:51:14 +02:00
```text
2025-09-17 12:06:01 +02:00
XdXac
→ ZYdZYac
→ ZabdZabac
→ aaabdaaabac
```
---
## Advantages
2025-09-17 12:51:14 +02:00
- **Efficient vocabulary building**: reduces the need for massive word lists.
- **Handles rare words**: breaks them into meaningful subword units.
- **Balances character- and word-level tokenization**.
2025-09-17 12:06:01 +02:00
---
## Limitations
2025-09-17 12:51:14 +02:00
- Does not consider linguistic meaning—merges are frequency-based.
- May create tokens that are not linguistically natural.
- Vocabulary is fixed after training.