Typo correction in the markdown
This commit is contained in:
parent
6686b47328
commit
1c715dc569
@ -1,20 +1,22 @@
|
||||
# Byte-Pair Encoding (BPE)
|
||||
# Resources
|
||||
|
||||
## Overview
|
||||
## Byte-Pair Encoding (BPE)
|
||||
|
||||
### Overview
|
||||
|
||||
Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
|
||||
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.
|
||||
|
||||
---
|
||||
|
||||
## Key Idea
|
||||
### Key Idea
|
||||
|
||||
BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
|
||||
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.
|
||||
|
||||
---
|
||||
|
||||
## Algorithm Steps
|
||||
### Algorithm Steps
|
||||
|
||||
1. **Initialization**
|
||||
- Treat each character of the input text as a token.
|
||||
@ -30,7 +32,7 @@ Over time, frequent character sequences (e.g., common morphemes, prefixes, suffi
|
||||
|
||||
---
|
||||
|
||||
## Example
|
||||
### Example
|
||||
|
||||
Suppose the data to be encoded is:
|
||||
|
||||
@ -38,7 +40,7 @@ Suppose the data to be encoded is:
|
||||
aaabdaaabac
|
||||
```
|
||||
|
||||
### Step 1: Merge `"aa"`
|
||||
#### Step 1: Merge `"aa"`
|
||||
|
||||
Most frequent pair: `"aa"` → replace with `"Z"`
|
||||
|
||||
@ -49,7 +51,7 @@ Z = aa
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Merge `"ab"`
|
||||
#### Step 2: Merge `"ab"`
|
||||
|
||||
Most frequent pair: `"ab"` → replace with `"Y"`
|
||||
|
||||
@ -61,7 +63,7 @@ Z = aa
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Merge `"ZY"`
|
||||
#### Step 3: Merge `"ZY"`
|
||||
|
||||
Most frequent pair: `"ZY"` → replace with `"X"`
|
||||
|
||||
@ -78,7 +80,7 @@ At this point, no pairs occur more than once, so the process stops.
|
||||
|
||||
---
|
||||
|
||||
## Decompression
|
||||
### Decompression
|
||||
|
||||
To recover the original data, replacements are applied in **reverse order**:
|
||||
|
||||
@ -91,7 +93,7 @@ XdXac
|
||||
|
||||
---
|
||||
|
||||
## Advantages
|
||||
### Advantages
|
||||
|
||||
- **Efficient vocabulary building**: reduces the need for massive word lists.
|
||||
- **Handles rare words**: breaks them into meaningful subword units.
|
||||
@ -99,7 +101,7 @@ XdXac
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
### Limitations
|
||||
|
||||
- Does not consider linguistic meaning—merges are frequency-based.
|
||||
- May create tokens that are not linguistically natural.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user