Revised notes

2025-11-02 16:49:01 +01:00
parent 8768f128fe
commit 65288793ce
1 changed files with 272 additions and 159 deletions
--- a/Chapters/12-Transformers/INDEX.md
+++ b/Chapters/12-Transformers/INDEX.md
@@ -1,188 +1,301 @@
 # Transformers

-## Block Components
+Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
+in terms of usage (machine translation, text generation, sequence to
+sequence, sentimental analysis, word prediction, ...),
+but differ for how they process data.

-The idea is that each Transformer block is made of the **same number** of [`Encoders`](#encoder) and
-[`Decoders`](#decoder)
+While `RNNs` have a recurrent part that computes the input
+sequentially, `Transformers` computes it all at once, making it easier to
+parallelize and make it **effectively faster despite being quadratically
+complex**.

-![image](./Images/PNGs/transformer-high-level.png)
+However this comes at the cost of not having an *infinite context* for
+tranformers. They have no memory,
+usually[^infinite-transformer-context][^transformer-xl],
+meaning that they need to resort to tricks as **autoregressiveness** or
+fixed context windows.
+
+## Basic Technologies
+
+### Positional Encoding
+
+When words are processed in our Transformer, since they are processed at
+once, they may lose their positional information, making them less informative.
+
+By using a Positional Encoding, we add back this information to the word.
+
+There are several ways to add such encoding to words, but among these we find:
+
+- **Learnt One**:\
+    Use another network to learn how to add a positional encoding
+    to the word embedding
+- **Positional Encoding[^attention-is-all-you-need]**:\
+    This comes from ***"Attention Is All You Need"***[^attention-is-all-you-need]
+    and it's a fixed function that adds alternately the sine and cosine
+    to word embeddings
+    $$
+    \begin{aligned}
+        PE_{(pos, 2i)} &= \sin{\left(
+            \frac{
+                pos
+            }{
+                10000^{2i/d_{model}}
+            }
+        \right)}
+        \\
+        PE_{(pos, 2i+1)} &= \cos{\left(
+            \frac{
+                pos
+            }{
+                10000^{2i/d_{model}}
+            }
+        \right)}
+    \end{aligned}
+    $$
+- **RoPE[^rope-paper][^hugging-face-pe]**:\
+    This algorithm uses the same function as above, but it doesn't add it, rather
+    it uses it to rotate (multiply) vectors. The idea is that by rotating a
+    vector, it doesn't change its magnitude and possibly its latent meaning.
+
+### Feed Forward
+
+This is just a couple of linear layers where the first one expands
+dimensionality (usually by 4 times) of the embedding size, due to
+Cover Theorem, and then it shrinks it back to the original embedding
+size.
+
+### Self Attention
+
+This Layer employs 3 matrices, for each attention head, that computes
+Query, Key and Value vectors for each word embedding.
+
+#### Steps
+
+- Compute $Q, K, V$ matrices for each embedding
+
+$$
+\begin{aligned}
+    Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\
+    K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\
+    V_{i} &= S \times W_{Vi} \in \R^{S \times H}
+\end{aligned}
+$$
+
+- Compute the head value
+
+$$
+\begin{aligned}
+    Head_i = softmax\left(
+        \frac{
+            Q_{i} \times K_{i}
+        }{
+            \sqrt{H}
+        }
+    \right) \times V_{i}
+    \in \R^{S \times H}
+\end{aligned}
+$$
+
+- Concatenate all heads and multiply for a learnt matrix
+
+$$
+\begin{aligned}
+    Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\
+    Out &= Heads \times W_{Heads} \in \R^{S \times Em}
+\end{aligned}
+$$

 > [!NOTE]
-> Input and output are vectors of **fixed size** with padding
+> Legend for each notation:
+>
+> - $H$: Head dimension
+> - $S$: Sentence length (number of tokens)
+> - $i$: head index

-Before feeding our input, we split and embed each word into a fixed vector size. This size depends on the length of
-longest sentence in our training set
+> [!TIP]
+> $H$ is usually smaller (makes computation faster and memory efficient), however
+> it's not necessary.
+>
+> Here we shown several operations, however, instead of making many small tensor
+> multiplications, it's better to perform
+> one (computationally more efficient) and then split ist result into
+> its components
+
+### Cross-Attention
+
+It's the same as the Self Attention, however we only compute $Q_{i}$ for what
+comes from the encoder, while $K_i$ and $V_i$ come from inputs coming
+from the last encoder:
+
+$$
+\begin{aligned}
+    Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\
+    K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\
+    V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H}
+\end{aligned}
+$$
+
+### Masking
+
+In order to make it sure that a decoder doesn't attent future info for past
+words, we have masks that makes it sure that information doesn't leak
+to parts of the networks.
+
+We usually implement 4 kind of masks
+
+![masks](./pngs/masks.png)
+
+- **Padding Mask**:\
+    This mask is useful to avoid computing attention for paddings
+- **Full Attention**:
+    This mask is useful in encoders. It allows the attention to have a double
+    directed attention by making words on the right add info to words on
+    the left and vice-versa.
+- **Causal Attention**:\
+    This mask is useful in decoders. It denies the attention of words on
+    the right to leak over leftwards ones. In other words, it prevents
+    that future words can affect the past meaning.
+- **Prefix Attention**:\
+    This mask is useful for some task in decoders. It allows some words to
+    add info over the past. These words however are not generated by the decoder,
+    but are part of its initial input.
+
+## Basic Blocks

 ### Embedder

-While this is not a real component per se, this is the first phase before even coming
-to the first `encoder` and `decoder`.
+This layer is responsible of transforming the input (usually tokens) into
+word embeddings following these steps:

-Here we transform each word of the input into an ***embedding*** and add a vector to account for
-position. This positional encoding can either be learnt or can follow this formula:
-
- Even size:
-
-$$
-\text{positional\_encoding}_{
-    (position, 2\text{size})
-} = \sin\left(
-        \frac{
-            pos
-        }{
-            10000^{
-                \frac{
-                    2\text{size}
-                }{
-                    \text{model\_depth}
-                }
-            }
-        }
-    \right)
-$$
-
- Odd size:
-
-$$
-\text{positional\_encoding}_{
-    (position, 2\text{size} + 1)
-} = \cos\left(
-        \frac{
-            pos
-        }{
-            10000^{
-                \frac{
-                    2\text{size}
-                }{
-                    \text{model\_depth}
-                }
-            }
-        }
-    \right)
-$$
+- one-hot encoding
+- matrix multiplication to get the desired embedding
+- inclusion of positional info

 ### Encoder

-> [!CAUTION]
-> Weights are not shared between `encoders` or `decoders`
+It takes meanings from embedded vectors both on the right and left part:

-Each phase happens for each word. In other words, if our embed size is 512, we have 512 `Self Attentions` and
-512 `Feed Forward NN` **per `encoder`**
+- Self Attention
+- Residual Connection
+- Layer Normalization
+- Feed Forward
+- Residual Connection
+- Layer Normalization (sometimes it is done before going to self attention)

-![Image](./Images/PNGs/encoder.png)
+![encoder picture](./pngs/encoder.png)

-#### Encoder Self Attention
-
-> [!WARNING]
-> This step is the most expensive one as it involves many computations
-
-Self Attention is a step in which each ***token*** gets the knowledge of previous ones.
-
-During this step, we produce 3 vectors that are **usually smaller**, for example 64 instead of 512:
-
- **Queries** $\rightarrow q_{i}$
- **Keys** $\rightarrow k_{i}$
- **Values** $\rightarrow v_{i}$
-
-We use these values to compute a **score** that will tell us **how much to focus on certain parts of the sentence
-while encoding a token**
-
-In order to compute the final encoding we do these for each encoding word $i$:
-
- Compute score for each word $j$ : $\text{score}_{j} = q_{i} \cdot k_{j}$
- Divide each score by the square root of the size of these *helping vectors*:
-    $\text{score}_{j} = \frac{\text{score}_{j}}{\sqrt{\text{size}}}$
- Compute softmax of all scores
- Multiply softmax each score per its value: $\text{score}_{j} = \text{score}_{j} \cdot v_{j}$
- Sum them all: $\text{encoding}_{i} = \sum_{j}^{N} \text{score}_{j}$
-
-> [!NOTE]
-> These steps will be done with matrices, not in this sequential way
-
-##### Multi-Headed Attention
-
-Instead of doing the Attention operation once, we do it more times, by having differente matrices to produce
-our *helping vectors*.
-
-This produces N encodings for each ***token***, or N matrices of encodings.
-
-The trick here is to **concatenate all encoding matrices** and **learn a new weight matrix** that will
-**combine them**
-
-#### Residuals
-
-In order no to lose some information along the path, after each `Feed Forward` and `Self-Attention`
-we add inputs to each ***sublayer*** `outputs` and we do a `Layer Normalization`
-
-#### Encoder Feed Forward NN
-
-> [!TIP]
-> This step is mostly **parallel** as there's no dependency between *neighbour vectors*
+Usually it used to condition all decoders, however, if connected to a
+`De-Embedding` block or other layers, it can be used stand alone to
+generate outputs

 ### Decoder

+It takes meaning from output embedded vectors, usually left to right, and
+condition them with last encoder output.
+
+- Self Attention
+- Residual Connection
+- Layer Normalization
+- Cross Attention
+- Residual Connection
+- Layer Normalization
+- Feed Forward
+- Residual Connection
+- Layer Normalization (sometimes it is done before going to self attention)
+
+![decoder picture](./pngs/decoder.png)
+
+Usually this block is used to generate outptus autoregressively, meaning
+that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$
+during inference time.
+
+> [!WARNING]
+> During train time, we are going to feed it all expected sequence, but shifted
+> by a `start` token, predicting the whole sequence again.
+>
+> So, **it isn't trained autoregressively**
+
+### De-Embedding
+
+Before having a result, we de-embed results, coming to a known format.
+
+Usually, for text generation, this makes the whole problem a classification one
+as we need to predict the right token among all available ones.
+
+Usually, for text, it is implemented as this:
+
+- Linear layer -> Go back to token space dimensions
+- Softmax -> Transform into probabilities
+- Argmax -> Take the most probable one
+
+However, depending on the objectives, this is subject to change
+
+## Basic Architectures
+
+### Full Transformer (aka Encoder Decoder - Encoder Conditioned)
+
+This architecture is very powerful, but ***"with great power, comes a great
+energy bill"***. While it has been successfully used in models like `T5`[^t5],
+it comes with additional complexity, both over the coding part and the
+computational one.
+
+![full architecture picture](./pngs/full-transformer.png)
+
+This is the basic architecture proposed in ***"Attention Is All You Need"***
+[^attention-is-all-you-need], but nowadays has been supersided by decoder only
+architectures, for tasks as text generation.
+
+### Encoder Only
+
+This architecture is done by only employing encoders at its base. A model using
+this architecture is `BERT`[^bert], which is capable of tasks such as Masked
+Langauge Model, Sentimental Analysis, Feature Extraction and General
+Classification (as for e-mails).
+
+![encoder only picture](./pngs/encoder-only.png)
+
+However this architecture comes at the cost of not being good at text and
+sequence generation, but has the advantage of being able to process everything
+in one step.
+
+### Decoder Only
+
+This architecture employs only decoders, which are modified to get ridden of
+cross-attention. Usually this is at the base of modern LLMS such as
+`GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and
+music (converting into MIDI format).
+
+![decoder only picture](./pngs/decoder-only.png)
+
+However this architecture needs time to generate, due to its autoregressive
+nature.
+
+## Curiosities
+
 > [!NOTE]
-> The decoding phase is slower than the encoding one, as it is sequential, producing a token for each iteration.
-> However it can be sped up by producing several tokerns at once
+> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names
+> come from an interpretation given to how they interact together, like
+> if were to search for `Key` over the `Query` Box and the found items is
+> `Value`.
+>
+> However this is only an analogy and not the actual process.

-After the **last `Encoder`** has produced its `output`, $K$ and $V$ vectors, these are then used by
-**all `Decoders`** during their self attention step, meaning they are **shared** among all `Decoders`.
+<!--
+    MARK: Footnotes
+-->
+[^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143)

-All `Decoders` steps are then **repeated until we get a `<eos>` token** which will tell the decoder to stop.
+[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)

-#### Decoder Self Attention
+[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)

-It's almost the same as in the `encoding` phase, though here, since we have no future `outputs`, we can only take into
-account only previous ***tokens***, by setting future ones to `-inf`.
-Moreover, here the `Key` and `Values` Mappings come from the `encoder` pase, while the
-`Queue` Mapping is learnt here.
+[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)

-#### Final Steps
+[^hugging-face-pe]: [Hugging Face | Positional Encoding | 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding)

-##### Linear Layer
+[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)

-Produces a vector of ***logits***, one per each ***known words***.
+[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)

-##### Softmax Layer
-
-We then score these ***logits*** over a `SoftMax` to get probabilities. We then take the highest one, usually.
-
-If we implement ***Temperature***, though, we can take some `tokens` that are less probable, but having less predictability and
-have some results that feel more natural.
-
-## Training a Transformer
-
-<!-- TODO: See PDF 12 pg. 58 to 65 -->
-
-## Known Transformers
-
-### BERT (Bidirectional Encoder Representations from Transformers)
-
-Differently from other `Transformers`, it uses only `Encoder` blocks.
-
-It can be used as a classifier and can be fine tuned.
-
-The fine tuning happens by **masking** input and **predict** the **masked word**:
-
- 15% of total words in input are masked
-  - 80% will become a `[masked]` token
-  - 10% will become random words
-  - 10% will remain unchanged
-
-#### Bert tasks
-
- **Classification**
- **Fine Tuning**
- **2 sentences tasks**
-  - **Are they paraphrases?**
-  - **Does one sentence follow from this other one?**
- **Feature Extraction**: "Allows us to extract feature to use in our model
-
-### GPT-2
-
-Differently from other `Transformers`, it uses only `Decoder` blocks.
-
-Since it has no `encoders`, `GPT-2` takes `outputs` and append them to the original `input`. This is called **autoregression**.
-This, however, limits `GPT-2` on how to learn context on `input` because of `masking`.
-
-During `evaluation`, `GPT-2` does not recompute `V`, `K` and `Q` for previous tokens, but hold on their previosu values.
+[^gpt-2]: [Release Strategies and the Social Impacts of Language Models | GPT 2](https://arxiv.org/pdf/1908.09203)