From 553b86cac27118501d9590251a19feb0a36f3e57 Mon Sep 17 00:00:00 2001 From: GassiGiuseppe Date: Wed, 17 Sep 2025 12:06:01 +0200 Subject: [PATCH] Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words --- docs/RESOURCES.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/docs/RESOURCES.md b/docs/RESOURCES.md index e69de29..65c47d6 100644 --- a/docs/RESOURCES.md +++ b/docs/RESOURCES.md @@ -0,0 +1,97 @@ +# Byte-Pair Encoding (BPE) + +## Overview +Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. +Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. + +--- + +## Key Idea +BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. +Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. + +--- + +## Algorithm Steps +1. **Initialization** + - Treat each character of the input text as a token. + +2. **Find Frequent Pairs** + - Count all adjacent token pairs in the sequence. + +3. **Merge Most Frequent Pair** + - Replace the most frequent pair with a new symbol not used in the text. + +4. **Repeat** + - Continue until no frequent pairs remain or a desired vocabulary size is reached. + +--- + +## Example + +Suppose the data to be encoded is: + +``` +aaabdaaabac +``` + +### Step 1: Merge `"aa"` +Most frequent pair: `"aa"` → replace with `"Z"` + +``` +ZabdZabac +Z = aa +``` + +--- + +### Step 2: Merge `"ab"` +Most frequent pair: `"ab"` → replace with `"Y"` + +``` +ZYdZYac +Y = ab +Z = aa +``` + +--- + +### Step 3: Merge `"ZY"` +Most frequent pair: `"ZY"` → replace with `"X"` + +``` +XdXac +X = ZY +Y = ab +Z = aa +``` + +--- + +At this point, no pairs occur more than once, so the process stops. + +--- + +## Decompression +To recover the original data, replacements are applied in **reverse order**: + +``` +XdXac +→ ZYdZYac +→ ZabdZabac +→ aaabdaaabac +``` + +--- + +## Advantages +- **Efficient vocabulary building**: reduces the need for massive word lists. +- **Handles rare words**: breaks them into meaningful subword units. +- **Balances character- and word-level tokenization**. + +--- + +## Limitations +- Does not consider linguistic meaning—merges are frequency-based. +- May create tokens that are not linguistically natural. +- Vocabulary is fixed after training.