Fixed Markdown violations
This commit is contained in:
parent
cececa14ce
commit
72eb937b47
@ -1,28 +1,31 @@
|
|||||||
# Byte-Pair Encoding (BPE)
|
# Byte-Pair Encoding (BPE)
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
|
|
||||||
|
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
|
||||||
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
|
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Key Idea
|
## Key Idea
|
||||||
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
|
|
||||||
|
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
|
||||||
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
|
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Algorithm Steps
|
## Algorithm Steps
|
||||||
1. **Initialization**
|
|
||||||
|
1. **Initialization**
|
||||||
- Treat each character of the input text as a token.
|
- Treat each character of the input text as a token.
|
||||||
|
|
||||||
2. **Find Frequent Pairs**
|
2. **Find Frequent Pairs**
|
||||||
- Count all adjacent token pairs in the sequence.
|
- Count all adjacent token pairs in the sequence.
|
||||||
|
|
||||||
3. **Merge Most Frequent Pair**
|
3. **Merge Most Frequent Pair**
|
||||||
- Replace the most frequent pair with a new symbol not used in the text.
|
- Replace the most frequent pair with a new symbol not used in the text.
|
||||||
|
|
||||||
4. **Repeat**
|
4. **Repeat**
|
||||||
- Continue until no frequent pairs remain or a desired vocabulary size is reached.
|
- Continue until no frequent pairs remain or a desired vocabulary size is reached.
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -31,14 +34,15 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi
|
|||||||
|
|
||||||
Suppose the data to be encoded is:
|
Suppose the data to be encoded is:
|
||||||
|
|
||||||
```
|
```text
|
||||||
aaabdaaabac
|
aaabdaaabac
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 1: Merge `"aa"`
|
### Step 1: Merge `"aa"`
|
||||||
|
|
||||||
Most frequent pair: `"aa"` → replace with `"Z"`
|
Most frequent pair: `"aa"` → replace with `"Z"`
|
||||||
|
|
||||||
```
|
```text
|
||||||
ZabdZabac
|
ZabdZabac
|
||||||
Z = aa
|
Z = aa
|
||||||
```
|
```
|
||||||
@ -46,9 +50,10 @@ Z = aa
|
|||||||
---
|
---
|
||||||
|
|
||||||
### Step 2: Merge `"ab"`
|
### Step 2: Merge `"ab"`
|
||||||
|
|
||||||
Most frequent pair: `"ab"` → replace with `"Y"`
|
Most frequent pair: `"ab"` → replace with `"Y"`
|
||||||
|
|
||||||
```
|
```text
|
||||||
ZYdZYac
|
ZYdZYac
|
||||||
Y = ab
|
Y = ab
|
||||||
Z = aa
|
Z = aa
|
||||||
@ -57,9 +62,10 @@ Z = aa
|
|||||||
---
|
---
|
||||||
|
|
||||||
### Step 3: Merge `"ZY"`
|
### Step 3: Merge `"ZY"`
|
||||||
|
|
||||||
Most frequent pair: `"ZY"` → replace with `"X"`
|
Most frequent pair: `"ZY"` → replace with `"X"`
|
||||||
|
|
||||||
```
|
```text
|
||||||
XdXac
|
XdXac
|
||||||
X = ZY
|
X = ZY
|
||||||
Y = ab
|
Y = ab
|
||||||
@ -73,9 +79,10 @@ At this point, no pairs occur more than once, so the process stops.
|
|||||||
---
|
---
|
||||||
|
|
||||||
## Decompression
|
## Decompression
|
||||||
|
|
||||||
To recover the original data, replacements are applied in **reverse order**:
|
To recover the original data, replacements are applied in **reverse order**:
|
||||||
|
|
||||||
```
|
```text
|
||||||
XdXac
|
XdXac
|
||||||
→ ZYdZYac
|
→ ZYdZYac
|
||||||
→ ZabdZabac
|
→ ZabdZabac
|
||||||
@ -85,13 +92,15 @@ XdXac
|
|||||||
---
|
---
|
||||||
|
|
||||||
## Advantages
|
## Advantages
|
||||||
- **Efficient vocabulary building**: reduces the need for massive word lists.
|
|
||||||
- **Handles rare words**: breaks them into meaningful subword units.
|
- **Efficient vocabulary building**: reduces the need for massive word lists.
|
||||||
- **Balances character- and word-level tokenization**.
|
- **Handles rare words**: breaks them into meaningful subword units.
|
||||||
|
- **Balances character- and word-level tokenization**.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Limitations
|
## Limitations
|
||||||
- Does not consider linguistic meaning—merges are frequency-based.
|
|
||||||
- May create tokens that are not linguistically natural.
|
- Does not consider linguistic meaning—merges are frequency-based.
|
||||||
- Vocabulary is fixed after training.
|
- May create tokens that are not linguistically natural.
|
||||||
|
- Vocabulary is fixed after training.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user