From 1c715dc5694a96bd8fc92a447ba1f3aad4be8b45 Mon Sep 17 00:00:00 2001 From: GassiGiuseppe Date: Thu, 18 Sep 2025 20:24:11 +0200 Subject: [PATCH] Typo correction in the markdown --- docs/RESOURCES.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/RESOURCES.md b/docs/RESOURCES.md index 225b83c..3f79952 100644 --- a/docs/RESOURCES.md +++ b/docs/RESOURCES.md @@ -1,20 +1,22 @@ -# Byte-Pair Encoding (BPE) +# Resources -## Overview +## Byte-Pair Encoding (BPE) + +### Overview Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. --- -## Key Idea +### Key Idea BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. --- -## Algorithm Steps +### Algorithm Steps 1. **Initialization** - Treat each character of the input text as a token. @@ -30,7 +32,7 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi --- -## Example +### Example Suppose the data to be encoded is: @@ -38,7 +40,7 @@ Suppose the data to be encoded is: aaabdaaabac ``` -### Step 1: Merge `"aa"` +#### Step 1: Merge `"aa"` Most frequent pair: `"aa"` → replace with `"Z"` @@ -49,7 +51,7 @@ Z = aa --- -### Step 2: Merge `"ab"` +#### Step 2: Merge `"ab"` Most frequent pair: `"ab"` → replace with `"Y"` @@ -61,7 +63,7 @@ Z = aa --- -### Step 3: Merge `"ZY"` +#### Step 3: Merge `"ZY"` Most frequent pair: `"ZY"` → replace with `"X"` @@ -78,7 +80,7 @@ At this point, no pairs occur more than once, so the process stops. --- -## Decompression +### Decompression To recover the original data, replacements are applied in **reverse order**: @@ -91,7 +93,7 @@ XdXac --- -## Advantages +### Advantages - **Efficient vocabulary building**: reduces the need for massive word lists. - **Handles rare words**: breaks them into meaningful subword units. @@ -99,7 +101,7 @@ XdXac --- -## Limitations +### Limitations - Does not consider linguistic meaning—merges are frequency-based. - May create tokens that are not linguistically natural.