Deep-Learning/Chapters/12-Transformers/INDEX.md

# Transformers

Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
in terms of usage (machine translation, text generation, sequence to
sequence, sentimental analysis, word prediction, ...),
but differ for how they process data.

While `RNNs` have a recurrent part that computes the input
sequentially, `Transformers` computes it all at once, making it easier to
parallelize and make it **effectively faster despite being quadratically
complex**.

However this comes at the cost of not having an *infinite context* for
tranformers. They have no memory,
usually[^infinite-transformer-context][^transformer-xl],
meaning that they need to resort to tricks as **autoregressiveness** or
fixed context windows.

## Basic Technologies

### Positional Encoding

When words are processed in our Transformer, since they are processed at
once, they may lose their positional information, making them less informative.

By using a Positional Encoding, we add back this information to the word.

There are several ways to add such encoding to words, but among these we find:

- **Learnt One**:\
    Use another network to learn how to add a positional encoding
    to the word embedding
- **Positional Encoding[^attention-is-all-you-need]**:\
    This comes from ***"Attention Is All You Need"***[^attention-is-all-you-need]
    and it's a fixed function that adds alternately the sine and cosine
    to word embeddings
    $$
    \begin{aligned}
        PE_{(pos, 2i)} &= \sin{\left(
            \frac{
                pos
            }{
                10000^{2i/d_{model}}
            }
        \right)}
        \\
        PE_{(pos, 2i+1)} &= \cos{\left(
            \frac{
                pos
            }{
                10000^{2i/d_{model}}
            }
        \right)}
    \end{aligned}
    $$
- **RoPE[^rope-paper][^hugging-face-pe]**:\
    This algorithm uses the same function as above, but it doesn't add it, rather
    it uses it to rotate (multiply) vectors. The idea is that by rotating a
    vector, it doesn't change its magnitude and possibly its latent meaning.

### Feed Forward

This is just a couple of linear layers where the first one expands
dimensionality (usually by 4 times) of the embedding size, due to
Cover Theorem, and then it shrinks it back to the original embedding
size.

### Self Attention

This Layer employs 3 matrices, for each attention head, that computes
Query, Key and Value vectors for each word embedding.

#### Steps

- Compute $Q, K, V$ matrices for each embedding

$$
\begin{aligned}
    Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\
    K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\
    V_{i} &= S \times W_{Vi} \in \R^{S \times H}
\end{aligned}
$$

- Compute the head value

$$
\begin{aligned}
    Head_i = softmax\left(
        \frac{
            Q_{i} \times K_{i}^T
        }{
            \sqrt{H}
        }
    \right) \times V_{i}
    \in \R^{S \times H}
\end{aligned}
$$

- Concatenate all heads and multiply for a learnt matrix

$$
\begin{aligned}
    Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\
    Out &= Heads \times W_{Heads} \in \R^{S \times Em}
\end{aligned}
$$

> [!NOTE]
> Legend for each notation:
>
> - $H$: Head dimension
> - $S$: Sentence length (number of tokens)
> - $i$: head index

> [!TIP]
> $H$ is usually smaller (makes computation faster and memory efficient), however
> it's not necessary.
>
> Here we shown several operations, however, instead of making many small tensor
> multiplications, it's better to perform
> one (computationally more efficient) and then split ist result into
> its components

### Cross-Attention

It's the same as the Self Attention, however we only compute $Q_{i}$ for what
comes from the encoder, while $K_i$ and $V_i$ come from inputs coming
from the last encoder:

$$
\begin{aligned}
    Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\
    K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\
    V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H}
\end{aligned}
$$

### Masking

In order to make it sure that a decoder doesn't attent future info for past
words, we have masks that makes it sure that information doesn't leak
to parts of the networks.

We usually implement 4 kind of masks

![masks](./pngs/masks.png)

- **Padding Mask**:\
    This mask is useful to avoid computing attention for paddings
- **Full Attention**:
    This mask is useful in encoders. It allows the attention to have a double
    directed attention by making words on the right add info to words on
    the left and vice-versa.
- **Causal Attention**:\
    This mask is useful in decoders. It denies the attention of words on
    the right to leak over leftwards ones. In other words, it prevents
    that future words can affect the past meaning.
- **Prefix Attention**:\
    This mask is useful for some task in decoders. It allows some words to
    add info over the past. These words however are not generated by the decoder,
    but are part of its initial input.

## Basic Blocks

### Embedder

This layer is responsible of transforming the input (usually tokens) into
word embeddings following these steps:

- one-hot encoding
- matrix multiplication to get the desired embedding
- inclusion of positional info

### Encoder

It takes meanings from embedded vectors both on the right and left part:

- Self Attention
- Residual Connection
- Layer Normalization
- Feed Forward
- Residual Connection
- Layer Normalization (sometimes it is done before going to self attention)

![encoder picture](./pngs/encoder.png)

Usually it used to condition all decoders, however, if connected to a
`De-Embedding` block or other layers, it can be used stand alone to
generate outputs

### Decoder

It takes meaning from output embedded vectors, usually left to right, and
condition them with last encoder output.

- Self Attention
- Residual Connection
- Layer Normalization
- Cross Attention
- Residual Connection
- Layer Normalization
- Feed Forward
- Residual Connection
- Layer Normalization (sometimes it is done before going to self attention)

![decoder picture](./pngs/decoder.png)

Usually this block is used to generate outptus autoregressively, meaning
that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$
during inference time.

> [!WARNING]
> During train time, we are going to feed it all expected sequence, but shifted
> by a `start` token, predicting the whole sequence again.
>
> So, **it isn't trained autoregressively**

### De-Embedding

Before having a result, we de-embed results, coming to a known format.

Usually, for text generation, this makes the whole problem a classification one
as we need to predict the right token among all available ones.

Usually, for text, it is implemented as this:

- Linear layer -> Go back to token space dimensions
- Softmax -> Transform into probabilities
- Argmax -> Take the most probable one

However, depending on the objectives, this is subject to change

## Basic Architectures

### Full Transformer (aka Encoder Decoder - Encoder Conditioned)

This architecture is very powerful, but ***"with great power, comes a great
energy bill"***. While it has been successfully used in models like `T5`[^t5],
it comes with additional complexity, both over the coding part and the
computational one.

![full architecture picture](./pngs/full-transformer.png)

This is the basic architecture proposed in ***"Attention Is All You Need"***
[^attention-is-all-you-need], but nowadays has been supersided by decoder only
architectures, for tasks as text generation.

### Encoder Only

This architecture is done by only employing encoders at its base. A model using
this architecture is `BERT`[^bert], which is capable of tasks such as Masked
Langauge Model, Sentimental Analysis, Feature Extraction and General
Classification (as for e-mails).

![encoder only picture](./pngs/encoder-only.png)

However this architecture comes at the cost of not being good at text and
sequence generation, but has the advantage of being able to process everything
in one step.

### Decoder Only

This architecture employs only decoders, which are modified to get ridden of
cross-attention. Usually this is at the base of modern LLMS such as
`GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and
music (converting into MIDI format).

![decoder only picture](./pngs/decoder-only.png)

However this architecture needs time to generate, due to its autoregressive
nature.

## Curiosities

> [!NOTE]
> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names
> come from an interpretation given to how they interact together, like
> if were to search for `Key` over the `Query` Box and the found items is
> `Value`.
>
> However this is only an analogy and not the actual process.

<!--
    MARK: Footnotes
-->
[^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143)

[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)

[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)

[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)

[^hugging-face-pe]: [Hugging Face | Positional Encoding | 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding)

[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)

[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

[^gpt-2]: [Release Strategies and the Social Impacts of Language Models | GPT 2](https://arxiv.org/pdf/1908.09203)
Added Transformers 2025-09-08 17:54:01 +02:00			`# Transformers`

Revised notes 2025-11-02 16:49:01 +01:00			Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
			`in terms of usage (machine translation, text generation, sequence to`
			`sequence, sentimental analysis, word prediction, ...),`
			`but differ for how they process data.`

			While `RNNs` have a recurrent part that computes the input
			sequentially, `Transformers` computes it all at once, making it easier to
			`parallelize and make it **effectively faster despite being quadratically`
			`complex**.`

			`However this comes at the cost of not having an infinite context for`
			`tranformers. They have no memory,`
			`usually[^infinite-transformer-context][^transformer-xl],`
			`meaning that they need to resort to tricks as autoregressiveness or`
			`fixed context windows.`

			`## Basic Technologies`

			`### Positional Encoding`

			`When words are processed in our Transformer, since they are processed at`
			`once, they may lose their positional information, making them less informative.`

			`By using a Positional Encoding, we add back this information to the word.`

			`There are several ways to add such encoding to words, but among these we find:`

			`- Learnt One:\`
			`Use another network to learn how to add a positional encoding`
			`to the word embedding`
			`- Positional Encoding[^attention-is-all-you-need]:\`
			`This comes from *"Attention Is All You Need"*[^attention-is-all-you-need]`
			`and it's a fixed function that adds alternately the sine and cosine`
			`to word embeddings`
			`$$`
			`\begin{aligned}`
			`PE_{(pos, 2i)} &= \sin{\left(`
			`\frac{`
			`pos`
			`}{`
			`10000^{2i/d_{model}}`
			`}`
			`\right)}`
			`\\`
			`PE_{(pos, 2i+1)} &= \cos{\left(`
			`\frac{`
			`pos`
			`}{`
			`10000^{2i/d_{model}}`
			`}`
			`\right)}`
			`\end{aligned}`
			`$$`
			`- RoPE[^rope-paper][^hugging-face-pe]:\`
			`This algorithm uses the same function as above, but it doesn't add it, rather`
			`it uses it to rotate (multiply) vectors. The idea is that by rotating a`
			`vector, it doesn't change its magnitude and possibly its latent meaning.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Feed Forward`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This is just a couple of linear layers where the first one expands`
			`dimensionality (usually by 4 times) of the embedding size, due to`
			`Cover Theorem, and then it shrinks it back to the original embedding`
			`size.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Self Attention`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This Layer employs 3 matrices, for each attention head, that computes`
			`Query, Key and Value vectors for each word embedding.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`#### Steps`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Compute $Q, K, V$ matrices for each embedding`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`$$`
			`\begin{aligned}`
			`Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\`
			`K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\`
			`V_{i} &= S \times W_{Vi} \in \R^{S \times H}`
			`\end{aligned}`
			`$$`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Compute the head value`
Added Chapters 12 and 13 2025-09-16 20:54:51 +02:00
Added Transformers 2025-09-08 17:54:01 +02:00			`$$`
Revised notes 2025-11-02 16:49:01 +01:00			`\begin{aligned}`
			`Head_i = softmax\left(`
Added Transformers 2025-09-08 17:54:01 +02:00			`\frac{`
Fixed a formula over Attention 2025-11-04 15:26:54 +01:00			`Q_{i} \times K_{i}^T`
Added Transformers 2025-09-08 17:54:01 +02:00			`}{`
Revised notes 2025-11-02 16:49:01 +01:00			`\sqrt{H}`
Added Transformers 2025-09-08 17:54:01 +02:00			`}`
Revised notes 2025-11-02 16:49:01 +01:00			`\right) \times V_{i}`
			`\in \R^{S \times H}`
			`\end{aligned}`
Added Transformers 2025-09-08 17:54:01 +02:00			`$$`
Added Chapters 12 and 13 2025-09-16 20:54:51 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Concatenate all heads and multiply for a learnt matrix`
Added Chapters 12 and 13 2025-09-16 20:54:51 +02:00
Added Transformers 2025-09-08 17:54:01 +02:00			`$$`
Revised notes 2025-11-02 16:49:01 +01:00			`\begin{aligned}`
			`Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\`
			`Out &= Heads \times W_{Heads} \in \R^{S \times Em}`
			`\end{aligned}`
Added Transformers 2025-09-08 17:54:01 +02:00			`$$`

Revised notes 2025-11-02 16:49:01 +01:00			`> [!NOTE]`
			`> Legend for each notation:`
			`>`
			`> - $H$: Head dimension`
			`> - $S$: Sentence length (number of tokens)`
			`> - $i$: head index`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`> [!TIP]`
			`> $H$ is usually smaller (makes computation faster and memory efficient), however`
			`> it's not necessary.`
			`>`
			`> Here we shown several operations, however, instead of making many small tensor`
			`> multiplications, it's better to perform`
			`> one (computationally more efficient) and then split ist result into`
			`> its components`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Cross-Attention`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`It's the same as the Self Attention, however we only compute $Q_{i}$ for what`
			`comes from the encoder, while $K_i$ and $V_i$ come from inputs coming`
			`from the last encoder:`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`$$`
			`\begin{aligned}`
			`Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\`
			`K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\`
			`V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H}`
			`\end{aligned}`
			`$$`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Masking`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`In order to make it sure that a decoder doesn't attent future info for past`
			`words, we have masks that makes it sure that information doesn't leak`
			`to parts of the networks.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`We usually implement 4 kind of masks`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![masks](./pngs/masks.png)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Padding Mask:\`
			`This mask is useful to avoid computing attention for paddings`
			`- Full Attention:`
			`This mask is useful in encoders. It allows the attention to have a double`
			`directed attention by making words on the right add info to words on`
			`the left and vice-versa.`
			`- Causal Attention:\`
			`This mask is useful in decoders. It denies the attention of words on`
			`the right to leak over leftwards ones. In other words, it prevents`
			`that future words can affect the past meaning.`
			`- Prefix Attention:\`
			`This mask is useful for some task in decoders. It allows some words to`
			`add info over the past. These words however are not generated by the decoder,`
			`but are part of its initial input.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`## Basic Blocks`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Embedder`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This layer is responsible of transforming the input (usually tokens) into`
			`word embeddings following these steps:`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- one-hot encoding`
			`- matrix multiplication to get the desired embedding`
			`- inclusion of positional info`

			`### Encoder`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`It takes meanings from embedded vectors both on the right and left part:`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Self Attention`
			`- Residual Connection`
			`- Layer Normalization`
			`- Feed Forward`
			`- Residual Connection`
			`- Layer Normalization (sometimes it is done before going to self attention)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![encoder picture](./pngs/encoder.png)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`Usually it used to condition all decoders, however, if connected to a`
			`De-Embedding` block or other layers, it can be used stand alone to
			`generate outputs`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Decoder`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`It takes meaning from output embedded vectors, usually left to right, and`
			`condition them with last encoder output.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Self Attention`
			`- Residual Connection`
			`- Layer Normalization`
			`- Cross Attention`
			`- Residual Connection`
			`- Layer Normalization`
			`- Feed Forward`
			`- Residual Connection`
			`- Layer Normalization (sometimes it is done before going to self attention)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![decoder picture](./pngs/decoder.png)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`Usually this block is used to generate outptus autoregressively, meaning`
			`that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$`
			`during inference time.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`> [!WARNING]`
			`> During train time, we are going to feed it all expected sequence, but shifted`
			> by a `start` token, predicting the whole sequence again.
			`>`
			`> So, it isn't trained autoregressively`

			`### De-Embedding`

			`Before having a result, we de-embed results, coming to a known format.`

			`Usually, for text generation, this makes the whole problem a classification one`
			`as we need to predict the right token among all available ones.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`Usually, for text, it is implemented as this:`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`- Linear layer -> Go back to token space dimensions`
			`- Softmax -> Transform into probabilities`
			`- Argmax -> Take the most probable one`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`However, depending on the objectives, this is subject to change`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`## Basic Architectures`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Full Transformer (aka Encoder Decoder - Encoder Conditioned)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This architecture is very powerful, but ***"with great power, comes a great`
			energy bill"***. While it has been successfully used in models like `T5`[^t5],
			`it comes with additional complexity, both over the coding part and the`
			`computational one.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![full architecture picture](./pngs/full-transformer.png)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This is the basic architecture proposed in *"Attention Is All You Need"*`
			`[^attention-is-all-you-need], but nowadays has been supersided by decoder only`
			`architectures, for tasks as text generation.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Encoder Only`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This architecture is done by only employing encoders at its base. A model using`
			this architecture is `BERT`[^bert], which is capable of tasks such as Masked
			`Langauge Model, Sentimental Analysis, Feature Extraction and General`
			`Classification (as for e-mails).`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![encoder only picture](./pngs/encoder-only.png)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`However this architecture comes at the cost of not being good at text and`
			`sequence generation, but has the advantage of being able to process everything`
			`in one step.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`### Decoder Only`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`This architecture employs only decoders, which are modified to get ridden of`
			`cross-attention. Usually this is at the base of modern LLMS such as`
			`GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and
			`music (converting into MIDI format).`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`![decoder only picture](./pngs/decoder-only.png)`

			`However this architecture needs time to generate, due to its autoregressive`
			`nature.`

			`## Curiosities`

			`> [!NOTE]`
			> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names
			`> come from an interpretation given to how they interact together, like`
			> if were to search for `Key` over the `Query` Box and the found items is
			> `Value`.
			`>`
			`> However this is only an analogy and not the actual process.`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`<!--`
			`MARK: Footnotes`
			`-->`
			`[^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^hugging-face-pe]: [Hugging Face \| Positional Encoding \| 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)`
Added Transformers 2025-09-08 17:54:01 +02:00
Revised notes 2025-11-02 16:49:01 +01:00			`[^gpt-2]: [Release Strategies and the Social Impacts of Language Models \| GPT 2](https://arxiv.org/pdf/1908.09203)`