From 553b86cac27118501d9590251a19feb0a36f3e57 Mon Sep 17 00:00:00 2001
From: GassiGiuseppe <g.gassi@studenti.poliba.it>
Date: Wed, 17 Sep 2025 12:06:01 +0200
Subject: [PATCH] Resources file updated with Byte-Pair Encoding a technique we
 will use to tokenize the engress' words

---
 docs/RESOURCES.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/docs/RESOURCES.md b/docs/RESOURCES.md
index e69de29..65c47d6 100644
--- a/docs/RESOURCES.md
+++ b/docs/RESOURCES.md
@@ -0,0 +1,97 @@
+# Byte-Pair Encoding (BPE)
+
+## Overview
+Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.  
+Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
+
+---
+
+## Key Idea
+BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.  
+Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
+
+---
+
+## Algorithm Steps
+1. **Initialization**  
+   - Treat each character of the input text as a token.
+
+2. **Find Frequent Pairs**  
+   - Count all adjacent token pairs in the sequence.
+
+3. **Merge Most Frequent Pair**  
+   - Replace the most frequent pair with a new symbol not used in the text.
+
+4. **Repeat**  
+   - Continue until no frequent pairs remain or a desired vocabulary size is reached.
+
+---
+
+## Example
+
+Suppose the data to be encoded is:
+
+```
+aaabdaaabac
+```
+
+### Step 1: Merge `"aa"`
+Most frequent pair: `"aa"` → replace with `"Z"`
+
+```
+ZabdZabac
+Z = aa
+```
+
+---
+
+### Step 2: Merge `"ab"`
+Most frequent pair: `"ab"` → replace with `"Y"`
+
+```
+ZYdZYac
+Y = ab
+Z = aa
+```
+
+---
+
+### Step 3: Merge `"ZY"`
+Most frequent pair: `"ZY"` → replace with `"X"`
+
+```
+XdXac
+X = ZY
+Y = ab
+Z = aa
+```
+
+---
+
+At this point, no pairs occur more than once, so the process stops.
+
+---
+
+## Decompression
+To recover the original data, replacements are applied in **reverse order**:
+
+```
+XdXac
+→ ZYdZYac
+→ ZabdZabac
+→ aaabdaaabac
+```
+
+---
+
+## Advantages
+- **Efficient vocabulary building**: reduces the need for massive word lists.  
+- **Handles rare words**: breaks them into meaningful subword units.  
+- **Balances character- and word-level tokenization**.  
+
+---
+
+## Limitations
+- Does not consider linguistic meaning—merges are frequency-based.  
+- May create tokens that are not linguistically natural.  
+- Vocabulary is fixed after training.