diff --git a/docs/RESOURCES.md b/docs/RESOURCES.md index 225b83c..3f79952 100644 --- a/docs/RESOURCES.md +++ b/docs/RESOURCES.md @@ -1,20 +1,22 @@ -# Byte-Pair Encoding (BPE) +# Resources -## Overview +## Byte-Pair Encoding (BPE) + +### Overview Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. --- -## Key Idea +### Key Idea BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. --- -## Algorithm Steps +### Algorithm Steps 1. **Initialization** - Treat each character of the input text as a token. @@ -30,7 +32,7 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi --- -## Example +### Example Suppose the data to be encoded is: @@ -38,7 +40,7 @@ Suppose the data to be encoded is: aaabdaaabac ``` -### Step 1: Merge `"aa"` +#### Step 1: Merge `"aa"` Most frequent pair: `"aa"` → replace with `"Z"` @@ -49,7 +51,7 @@ Z = aa --- -### Step 2: Merge `"ab"` +#### Step 2: Merge `"ab"` Most frequent pair: `"ab"` → replace with `"Y"` @@ -61,7 +63,7 @@ Z = aa --- -### Step 3: Merge `"ZY"` +#### Step 3: Merge `"ZY"` Most frequent pair: `"ZY"` → replace with `"X"` @@ -78,7 +80,7 @@ At this point, no pairs occur more than once, so the process stops. --- -## Decompression +### Decompression To recover the original data, replacements are applied in **reverse order**: @@ -91,7 +93,7 @@ XdXac --- -## Advantages +### Advantages - **Efficient vocabulary building**: reduces the need for massive word lists. - **Handles rare words**: breaks them into meaningful subword units. @@ -99,7 +101,7 @@ XdXac --- -## Limitations +### Limitations - Does not consider linguistic meaning—merges are frequency-based. - May create tokens that are not linguistically natural.