Resources file updated with Byte-Pair Encoding

a technique we will use to tokenize the engress' words
2025-09-17 12:06:01 +02:00
parent 12bd781fd3
commit 553b86cac2
1 changed files with 97 additions and 0 deletions
--- a/docs/RESOURCES.md
+++ b/docs/RESOURCES.md
@@ -0,0 +1,97 @@
 # Byte-Pair Encoding (BPE)
 ## Overview
 Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.  
 Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
 ---
 ## Key Idea
 BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.  
 Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
 ---
 ## Algorithm Steps
 1. **Initialization**  
   - Treat each character of the input text as a token.
 2. **Find Frequent Pairs**  
   - Count all adjacent token pairs in the sequence.
 3. **Merge Most Frequent Pair**  
   - Replace the most frequent pair with a new symbol not used in the text.
 4. **Repeat**  
   - Continue until no frequent pairs remain or a desired vocabulary size is reached.
 ---
 ## Example
 Suppose the data to be encoded is:
 ```
 aaabdaaabac
 ```
 ### Step 1: Merge `"aa"`
 Most frequent pair: `"aa"` → replace with `"Z"`
 ```
 ZabdZabac
 Z = aa
 ```
 ---
 ### Step 2: Merge `"ab"`
 Most frequent pair: `"ab"` → replace with `"Y"`
 ```
 ZYdZYac
 Y = ab
 Z = aa
 ```
 ---
 ### Step 3: Merge `"ZY"`
 Most frequent pair: `"ZY"` → replace with `"X"`
 ```
 XdXac
 X = ZY
 Y = ab
 Z = aa
 ```
 ---
 At this point, no pairs occur more than once, so the process stops.
 ---
 ## Decompression
 To recover the original data, replacements are applied in **reverse order**:
 ```
 XdXac
 → ZYdZYac
 → ZabdZabac
 → aaabdaaabac
 ```
 ---
 ## Advantages
 - **Efficient vocabulary building**: reduces the need for massive word lists.  
 - **Handles rare words**: breaks them into meaningful subword units.  
 - **Balances character- and word-level tokenization**.  
 ---
 ## Limitations
 - Does not consider linguistic meaning—merges are frequency-based.  
 - May create tokens that are not linguistically natural.  
 - Vocabulary is fixed after training.