Fixed Markdown violations

This commit is contained in:
Christian Risi 2025-09-17 12:51:14 +02:00
parent cececa14ce
commit 72eb937b47

View File

@ -1,28 +1,31 @@
# Byte-Pair Encoding (BPE) # Byte-Pair Encoding (BPE)
## Overview ## Overview
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
--- ---
## Key Idea ## Key Idea
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
--- ---
## Algorithm Steps ## Algorithm Steps
1. **Initialization**
1. **Initialization**
- Treat each character of the input text as a token. - Treat each character of the input text as a token.
2. **Find Frequent Pairs** 2. **Find Frequent Pairs**
- Count all adjacent token pairs in the sequence. - Count all adjacent token pairs in the sequence.
3. **Merge Most Frequent Pair** 3. **Merge Most Frequent Pair**
- Replace the most frequent pair with a new symbol not used in the text. - Replace the most frequent pair with a new symbol not used in the text.
4. **Repeat** 4. **Repeat**
- Continue until no frequent pairs remain or a desired vocabulary size is reached. - Continue until no frequent pairs remain or a desired vocabulary size is reached.
--- ---
@ -31,14 +34,15 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi
Suppose the data to be encoded is: Suppose the data to be encoded is:
``` ```text
aaabdaaabac aaabdaaabac
``` ```
### Step 1: Merge `"aa"` ### Step 1: Merge `"aa"`
Most frequent pair: `"aa"` → replace with `"Z"` Most frequent pair: `"aa"` → replace with `"Z"`
``` ```text
ZabdZabac ZabdZabac
Z = aa Z = aa
``` ```
@ -46,9 +50,10 @@ Z = aa
--- ---
### Step 2: Merge `"ab"` ### Step 2: Merge `"ab"`
Most frequent pair: `"ab"` → replace with `"Y"` Most frequent pair: `"ab"` → replace with `"Y"`
``` ```text
ZYdZYac ZYdZYac
Y = ab Y = ab
Z = aa Z = aa
@ -57,9 +62,10 @@ Z = aa
--- ---
### Step 3: Merge `"ZY"` ### Step 3: Merge `"ZY"`
Most frequent pair: `"ZY"` → replace with `"X"` Most frequent pair: `"ZY"` → replace with `"X"`
``` ```text
XdXac XdXac
X = ZY X = ZY
Y = ab Y = ab
@ -73,9 +79,10 @@ At this point, no pairs occur more than once, so the process stops.
--- ---
## Decompression ## Decompression
To recover the original data, replacements are applied in **reverse order**: To recover the original data, replacements are applied in **reverse order**:
``` ```text
XdXac XdXac
→ ZYdZYac → ZYdZYac
→ ZabdZabac → ZabdZabac
@ -85,13 +92,15 @@ XdXac
--- ---
## Advantages ## Advantages
- **Efficient vocabulary building**: reduces the need for massive word lists.
- **Handles rare words**: breaks them into meaningful subword units. - **Efficient vocabulary building**: reduces the need for massive word lists.
- **Balances character- and word-level tokenization**. - **Handles rare words**: breaks them into meaningful subword units.
- **Balances character- and word-level tokenization**.
--- ---
## Limitations ## Limitations
- Does not consider linguistic meaning—merges are frequency-based.
- May create tokens that are not linguistically natural. - Does not consider linguistic meaning—merges are frequency-based.
- Vocabulary is fixed after training. - May create tokens that are not linguistically natural.
- Vocabulary is fixed after training.