NanoSocrates/docs/RESOURCES.md

# Resources

## Byte-Pair Encoding (BPE)

### Overview

Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.
Originally introduced as a data compression method, it has been widely adopted in **Natural Language Processing (NLP)** to build subword vocabularies for models such as GPT and BERT.

---

### Key Idea

BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.
Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.

---

### Algorithm Steps

1. **Initialization**
   - Treat each character of the input text as a token.

2. **Find Frequent Pairs**
   - Count all adjacent token pairs in the sequence.

3. **Merge Most Frequent Pair**
   - Replace the most frequent pair with a new symbol not used in the text.

4. **Repeat**
   - Continue until no frequent pairs remain or a desired vocabulary size is reached.

---

### Example

Suppose the data to be encoded is:

```text
aaabdaaabac
```

#### Step 1: Merge `"aa"`

Most frequent pair: `"aa"` → replace with `"Z"`

```text
ZabdZabac
Z = aa
```

---

#### Step 2: Merge `"ab"`

Most frequent pair: `"ab"` → replace with `"Y"`

```text
ZYdZYac
Y = ab
Z = aa
```

---

#### Step 3: Merge `"ZY"`

Most frequent pair: `"ZY"` → replace with `"X"`

```text
XdXac
X = ZY
Y = ab
Z = aa
```

---

At this point, no pairs occur more than once, so the process stops.

---

### Decompression

To recover the original data, replacements are applied in **reverse order**:

```text
XdXac
→ ZYdZYac
→ ZabdZabac
→ aaabdaaabac
```

---

### Advantages

- **Efficient vocabulary building**: reduces the need for massive word lists.
- **Handles rare words**: breaks them into meaningful subword units.
- **Balances character- and word-level tokenization**.

---

### Limitations

- Does not consider linguistic meaning—merges are frequency-based.
- May create tokens that are not linguistically natural.
- Vocabulary is fixed after training.
Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`# Resources`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00
Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`## Byte-Pair Encoding (BPE)`

			`### Overview`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
			`Byte-Pair Encoding (BPE) is a simple but powerful text compression and tokenization algorithm.`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`Originally introduced as a data compression method, it has been widely adopted in Natural Language Processing (NLP) to build subword vocabularies for models such as GPT and BERT.`

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Key Idea`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
			`BPE works by iteratively replacing the most frequent pair of symbols (initially characters) with a new symbol.`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`Over time, frequent character sequences (e.g., common morphemes, prefixes, suffixes) are merged into single tokens.`

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Algorithm Steps`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
			`1. Initialization`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`- Treat each character of the input text as a token.`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			`2. Find Frequent Pairs`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`- Count all adjacent token pairs in the sequence.`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			`3. Merge Most Frequent Pair`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`- Replace the most frequent pair with a new symbol not used in the text.`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			`4. Repeat`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`- Continue until no frequent pairs remain or a desired vocabulary size is reached.`

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Example`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00
			`Suppose the data to be encoded is:`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			```text
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`aaabdaaabac`
			```

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			#### Step 1: Merge `"aa"`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			Most frequent pair: `"aa"` → replace with `"Z"`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			```text
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`ZabdZabac`
			`Z = aa`
			```

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			#### Step 2: Merge `"ab"`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			Most frequent pair: `"ab"` → replace with `"Y"`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			```text
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`ZYdZYac`
			`Y = ab`
			`Z = aa`
			```

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			#### Step 3: Merge `"ZY"`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			Most frequent pair: `"ZY"` → replace with `"X"`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			```text
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`XdXac`
			`X = ZY`
			`Y = ab`
			`Z = aa`
			```

			`---`

			`At this point, no pairs occur more than once, so the process stops.`

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Decompression`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`To recover the original data, replacements are applied in reverse order:`

Fixed Markdown violations 2025-09-17 12:51:14 +02:00			```text
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00			`XdXac`
			`→ ZYdZYac`
			`→ ZabdZabac`
			`→ aaabdaaabac`
			```

			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Advantages`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
			`- Efficient vocabulary building: reduces the need for massive word lists.`
			`- Handles rare words: breaks them into meaningful subword units.`
			`- Balances character- and word-level tokenization.`
Resources file updated with Byte-Pair Encoding a technique we will use to tokenize the engress' words 2025-09-17 12:06:01 +02:00
			`---`

Typo correction in the markdown 2025-09-18 20:24:11 +02:00			`### Limitations`
Fixed Markdown violations 2025-09-17 12:51:14 +02:00
			`- Does not consider linguistic meaning—merges are frequency-based.`
			`- May create tokens that are not linguistically natural.`
			`- Vocabulary is fixed after training.`