From 72eb937b471ac863d17bdac4c7e8cb28412b844e Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Wed, 17 Sep 2025 12:51:14 +0200 Subject: [PATCH] Fixed Markdown violations --- docs/RESOURCES.md | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/docs/RESOURCES.md b/docs/RESOURCES.md index 65c47d6..225b83c 100644 --- a/docs/RESOURCES.md +++ b/docs/RESOURCES.md @@ -1,28 +1,31 @@ # Byte-Pair Encoding (BPE) ## Overview -Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. + +Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. --- ## Key Idea -BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. + +BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. --- ## Algorithm Steps -1. **Initialization** + +1. **Initialization** - Treat each character of the input text as a token. -2. **Find Frequent Pairs** +2. **Find Frequent Pairs** - Count all adjacent token pairs in the sequence. -3. **Merge Most Frequent Pair** +3. **Merge Most Frequent Pair** - Replace the most frequent pair with a new symbol not used in the text. -4. **Repeat** +4. **Repeat** - Continue until no frequent pairs remain or a desired vocabulary size is reached. --- @@ -31,14 +34,15 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi Suppose the data to be encoded is: -``` +```text aaabdaaabac ``` ### Step 1: Merge `"aa"` + Most frequent pair: `"aa"` → replace with `"Z"` -``` +```text ZabdZabac Z = aa ``` @@ -46,9 +50,10 @@ Z = aa --- ### Step 2: Merge `"ab"` + Most frequent pair: `"ab"` → replace with `"Y"` -``` +```text ZYdZYac Y = ab Z = aa @@ -57,9 +62,10 @@ Z = aa --- ### Step 3: Merge `"ZY"` + Most frequent pair: `"ZY"` → replace with `"X"` -``` +```text XdXac X = ZY Y = ab @@ -73,9 +79,10 @@ At this point, no pairs occur more than once, so the process stops. --- ## Decompression + To recover the original data, replacements are applied in **reverse order**: -``` +```text XdXac → ZYdZYac → ZabdZabac @@ -85,13 +92,15 @@ XdXac --- ## Advantages -- **Efficient vocabulary building**: reduces the need for massive word lists. -- **Handles rare words**: breaks them into meaningful subword units. -- **Balances character- and word-level tokenization**. + +- **Efficient vocabulary building**: reduces the need for massive word lists. +- **Handles rare words**: breaks them into meaningful subword units. +- **Balances character- and word-level tokenization**. --- ## Limitations -- Does not consider linguistic meaning—merges are frequency-based. -- May create tokens that are not linguistically natural. -- Vocabulary is fixed after training. + +- Does not consider linguistic meaning—merges are frequency-based. +- May create tokens that are not linguistically natural. +- Vocabulary is fixed after training.