2025-09-08 17:54:01 +02:00
|
|
|
# Transformers
|
|
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
|
|
|
|
|
in terms of usage (machine translation, text generation, sequence to
|
|
|
|
|
sequence, sentimental analysis, word prediction, ...),
|
|
|
|
|
but differ for how they process data.
|
|
|
|
|
|
|
|
|
|
While `RNNs` have a recurrent part that computes the input
|
|
|
|
|
sequentially, `Transformers` computes it all at once, making it easier to
|
|
|
|
|
parallelize and make it **effectively faster despite being quadratically
|
|
|
|
|
complex**.
|
|
|
|
|
|
|
|
|
|
However this comes at the cost of not having an *infinite context* for
|
|
|
|
|
tranformers. They have no memory,
|
|
|
|
|
usually[^infinite-transformer-context][^transformer-xl],
|
|
|
|
|
meaning that they need to resort to tricks as **autoregressiveness** or
|
|
|
|
|
fixed context windows.
|
|
|
|
|
|
|
|
|
|
## Basic Technologies
|
|
|
|
|
|
|
|
|
|
### Positional Encoding
|
|
|
|
|
|
|
|
|
|
When words are processed in our Transformer, since they are processed at
|
|
|
|
|
once, they may lose their positional information, making them less informative.
|
|
|
|
|
|
|
|
|
|
By using a Positional Encoding, we add back this information to the word.
|
|
|
|
|
|
|
|
|
|
There are several ways to add such encoding to words, but among these we find:
|
|
|
|
|
|
|
|
|
|
- **Learnt One**:\
|
|
|
|
|
Use another network to learn how to add a positional encoding
|
|
|
|
|
to the word embedding
|
|
|
|
|
- **Positional Encoding[^attention-is-all-you-need]**:\
|
|
|
|
|
This comes from ***"Attention Is All You Need"***[^attention-is-all-you-need]
|
|
|
|
|
and it's a fixed function that adds alternately the sine and cosine
|
|
|
|
|
to word embeddings
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
PE_{(pos, 2i)} &= \sin{\left(
|
|
|
|
|
\frac{
|
|
|
|
|
pos
|
|
|
|
|
}{
|
|
|
|
|
10000^{2i/d_{model}}
|
|
|
|
|
}
|
|
|
|
|
\right)}
|
|
|
|
|
\\
|
|
|
|
|
PE_{(pos, 2i+1)} &= \cos{\left(
|
|
|
|
|
\frac{
|
|
|
|
|
pos
|
|
|
|
|
}{
|
|
|
|
|
10000^{2i/d_{model}}
|
|
|
|
|
}
|
|
|
|
|
\right)}
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
- **RoPE[^rope-paper][^hugging-face-pe]**:\
|
|
|
|
|
This algorithm uses the same function as above, but it doesn't add it, rather
|
|
|
|
|
it uses it to rotate (multiply) vectors. The idea is that by rotating a
|
|
|
|
|
vector, it doesn't change its magnitude and possibly its latent meaning.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Feed Forward
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This is just a couple of linear layers where the first one expands
|
|
|
|
|
dimensionality (usually by 4 times) of the embedding size, due to
|
|
|
|
|
Cover Theorem, and then it shrinks it back to the original embedding
|
|
|
|
|
size.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Self Attention
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This Layer employs 3 matrices, for each attention head, that computes
|
|
|
|
|
Query, Key and Value vectors for each word embedding.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
#### Steps
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Compute $Q, K, V$ matrices for each embedding
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\
|
|
|
|
|
K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\
|
|
|
|
|
V_{i} &= S \times W_{Vi} \in \R^{S \times H}
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Compute the head value
|
2025-09-16 20:54:51 +02:00
|
|
|
|
2025-09-08 17:54:01 +02:00
|
|
|
$$
|
2025-11-02 16:49:01 +01:00
|
|
|
\begin{aligned}
|
|
|
|
|
Head_i = softmax\left(
|
2025-09-08 17:54:01 +02:00
|
|
|
\frac{
|
2025-11-04 15:26:54 +01:00
|
|
|
Q_{i} \times K_{i}^T
|
2025-09-08 17:54:01 +02:00
|
|
|
}{
|
2025-11-02 16:49:01 +01:00
|
|
|
\sqrt{H}
|
2025-09-08 17:54:01 +02:00
|
|
|
}
|
2025-11-02 16:49:01 +01:00
|
|
|
\right) \times V_{i}
|
|
|
|
|
\in \R^{S \times H}
|
|
|
|
|
\end{aligned}
|
2025-09-08 17:54:01 +02:00
|
|
|
$$
|
2025-09-16 20:54:51 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Concatenate all heads and multiply for a learnt matrix
|
2025-09-16 20:54:51 +02:00
|
|
|
|
2025-09-08 17:54:01 +02:00
|
|
|
$$
|
2025-11-02 16:49:01 +01:00
|
|
|
\begin{aligned}
|
|
|
|
|
Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\
|
|
|
|
|
Out &= Heads \times W_{Heads} \in \R^{S \times Em}
|
|
|
|
|
\end{aligned}
|
2025-09-08 17:54:01 +02:00
|
|
|
$$
|
|
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
> [!NOTE]
|
|
|
|
|
> Legend for each notation:
|
|
|
|
|
>
|
|
|
|
|
> - $H$: Head dimension
|
|
|
|
|
> - $S$: Sentence length (number of tokens)
|
|
|
|
|
> - $i$: head index
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
> [!TIP]
|
|
|
|
|
> $H$ is usually smaller (makes computation faster and memory efficient), however
|
|
|
|
|
> it's not necessary.
|
|
|
|
|
>
|
|
|
|
|
> Here we shown several operations, however, instead of making many small tensor
|
|
|
|
|
> multiplications, it's better to perform
|
|
|
|
|
> one (computationally more efficient) and then split ist result into
|
|
|
|
|
> its components
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Cross-Attention
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
It's the same as the Self Attention, however we only compute $Q_{i}$ for what
|
|
|
|
|
comes from the encoder, while $K_i$ and $V_i$ come from inputs coming
|
|
|
|
|
from the last encoder:
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\
|
|
|
|
|
K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\
|
|
|
|
|
V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H}
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Masking
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
In order to make it sure that a decoder doesn't attent future info for past
|
|
|
|
|
words, we have masks that makes it sure that information doesn't leak
|
|
|
|
|
to parts of the networks.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
We usually implement 4 kind of masks
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- **Padding Mask**:\
|
|
|
|
|
This mask is useful to avoid computing attention for paddings
|
|
|
|
|
- **Full Attention**:
|
|
|
|
|
This mask is useful in encoders. It allows the attention to have a double
|
|
|
|
|
directed attention by making words on the right add info to words on
|
|
|
|
|
the left and vice-versa.
|
|
|
|
|
- **Causal Attention**:\
|
|
|
|
|
This mask is useful in decoders. It denies the attention of words on
|
|
|
|
|
the right to leak over leftwards ones. In other words, it prevents
|
|
|
|
|
that future words can affect the past meaning.
|
|
|
|
|
- **Prefix Attention**:\
|
|
|
|
|
This mask is useful for some task in decoders. It allows some words to
|
|
|
|
|
add info over the past. These words however are not generated by the decoder,
|
|
|
|
|
but are part of its initial input.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
## Basic Blocks
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Embedder
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This layer is responsible of transforming the input (usually tokens) into
|
|
|
|
|
word embeddings following these steps:
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- one-hot encoding
|
|
|
|
|
- matrix multiplication to get the desired embedding
|
|
|
|
|
- inclusion of positional info
|
|
|
|
|
|
|
|
|
|
### Encoder
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
It takes meanings from embedded vectors both on the right and left part:
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Self Attention
|
|
|
|
|
- Residual Connection
|
|
|
|
|
- Layer Normalization
|
|
|
|
|
- Feed Forward
|
|
|
|
|
- Residual Connection
|
|
|
|
|
- Layer Normalization (sometimes it is done before going to self attention)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
Usually it used to condition all decoders, however, if connected to a
|
|
|
|
|
`De-Embedding` block or other layers, it can be used stand alone to
|
|
|
|
|
generate outputs
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Decoder
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
It takes meaning from output embedded vectors, usually left to right, and
|
|
|
|
|
condition them with last encoder output.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Self Attention
|
|
|
|
|
- Residual Connection
|
|
|
|
|
- Layer Normalization
|
|
|
|
|
- Cross Attention
|
|
|
|
|
- Residual Connection
|
|
|
|
|
- Layer Normalization
|
|
|
|
|
- Feed Forward
|
|
|
|
|
- Residual Connection
|
|
|
|
|
- Layer Normalization (sometimes it is done before going to self attention)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
Usually this block is used to generate outptus autoregressively, meaning
|
|
|
|
|
that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$
|
|
|
|
|
during inference time.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
> [!WARNING]
|
|
|
|
|
> During train time, we are going to feed it all expected sequence, but shifted
|
|
|
|
|
> by a `start` token, predicting the whole sequence again.
|
|
|
|
|
>
|
|
|
|
|
> So, **it isn't trained autoregressively**
|
|
|
|
|
|
|
|
|
|
### De-Embedding
|
|
|
|
|
|
|
|
|
|
Before having a result, we de-embed results, coming to a known format.
|
|
|
|
|
|
|
|
|
|
Usually, for text generation, this makes the whole problem a classification one
|
|
|
|
|
as we need to predict the right token among all available ones.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
Usually, for text, it is implemented as this:
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
- Linear layer -> Go back to token space dimensions
|
|
|
|
|
- Softmax -> Transform into probabilities
|
|
|
|
|
- Argmax -> Take the most probable one
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
However, depending on the objectives, this is subject to change
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
## Basic Architectures
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Full Transformer (aka Encoder Decoder - Encoder Conditioned)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This architecture is very powerful, but ***"with great power, comes a great
|
|
|
|
|
energy bill"***. While it has been successfully used in models like `T5`[^t5],
|
|
|
|
|
it comes with additional complexity, both over the coding part and the
|
|
|
|
|
computational one.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This is the basic architecture proposed in ***"Attention Is All You Need"***
|
|
|
|
|
[^attention-is-all-you-need], but nowadays has been supersided by decoder only
|
|
|
|
|
architectures, for tasks as text generation.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Encoder Only
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This architecture is done by only employing encoders at its base. A model using
|
|
|
|
|
this architecture is `BERT`[^bert], which is capable of tasks such as Masked
|
|
|
|
|
Langauge Model, Sentimental Analysis, Feature Extraction and General
|
|
|
|
|
Classification (as for e-mails).
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
However this architecture comes at the cost of not being good at text and
|
|
|
|
|
sequence generation, but has the advantage of being able to process everything
|
|
|
|
|
in one step.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
### Decoder Only
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
This architecture employs only decoders, which are modified to get ridden of
|
|
|
|
|
cross-attention. Usually this is at the base of modern LLMS such as
|
|
|
|
|
`GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and
|
|
|
|
|
music (converting into MIDI format).
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|

|
|
|
|
|
|
|
|
|
|
However this architecture needs time to generate, due to its autoregressive
|
|
|
|
|
nature.
|
|
|
|
|
|
|
|
|
|
## Curiosities
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names
|
|
|
|
|
> come from an interpretation given to how they interact together, like
|
|
|
|
|
> if were to search for `Key` over the `Query` Box and the found items is
|
|
|
|
|
> `Value`.
|
|
|
|
|
>
|
|
|
|
|
> However this is only an analogy and not the actual process.
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
<!--
|
|
|
|
|
MARK: Footnotes
|
|
|
|
|
-->
|
|
|
|
|
[^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^hugging-face-pe]: [Hugging Face | Positional Encoding | 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)
|
2025-09-08 17:54:01 +02:00
|
|
|
|
2025-11-02 16:49:01 +01:00
|
|
|
[^gpt-2]: [Release Strategies and the Social Impacts of Language Models | GPT 2](https://arxiv.org/pdf/1908.09203)
|