Typo correction in the markdown

This commit is contained in:
GassiGiuseppe 2025-09-18 20:24:11 +02:00
parent 6686b47328
commit 1c715dc569

View File

@ -1,20 +1,22 @@
# Byte-Pair Encoding (BPE) # Resources
## Overview ## Byte-Pair Encoding (BPE)
### Overview
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm. Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT. Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
--- ---
## Key Idea ### Key Idea
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol. BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens. Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
--- ---
## Algorithm Steps ### Algorithm Steps
1. **Initialization** 1. **Initialization**
- Treat each character of the input text as a token. - Treat each character of the input text as a token.
@ -30,7 +32,7 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi
--- ---
## Example ### Example
Suppose the data to be encoded is: Suppose the data to be encoded is:
@ -38,7 +40,7 @@ Suppose the data to be encoded is:
aaabdaaabac aaabdaaabac
``` ```
### Step 1: Merge `"aa"` #### Step 1: Merge `"aa"`
Most frequent pair: `"aa"` → replace with `"Z"` Most frequent pair: `"aa"` → replace with `"Z"`
@ -49,7 +51,7 @@ Z = aa
--- ---
### Step 2: Merge `"ab"` #### Step 2: Merge `"ab"`
Most frequent pair: `"ab"` → replace with `"Y"` Most frequent pair: `"ab"` → replace with `"Y"`
@ -61,7 +63,7 @@ Z = aa
--- ---
### Step 3: Merge `"ZY"` #### Step 3: Merge `"ZY"`
Most frequent pair: `"ZY"` → replace with `"X"` Most frequent pair: `"ZY"` → replace with `"X"`
@ -78,7 +80,7 @@ At this point, no pairs occur more than once, so the process stops.
--- ---
## Decompression ### Decompression
To recover the original data, replacements are applied in **reverse order**: To recover the original data, replacements are applied in **reverse order**:
@ -91,7 +93,7 @@ XdXac
--- ---
## Advantages ### Advantages
- **Efficient vocabulary building**: reduces the need for massive word lists. - **Efficient vocabulary building**: reduces the need for massive word lists.
- **Handles rare words**: breaks them into meaningful subword units. - **Handles rare words**: breaks them into meaningful subword units.
@ -99,7 +101,7 @@ XdXac
--- ---
## Limitations ### Limitations
- Does not consider linguistic meaning—merges are frequency-based. - Does not consider linguistic meaning—merges are frequency-based.
- May create tokens that are not linguistically natural. - May create tokens that are not linguistically natural.