Revised notes

2025-11-02 16:49:01 +01:00
parent 8768f128fe
commit 65288793ce
1 changed files with 272 additions and 159 deletions
--- a/Chapters/12-Transformers/INDEX.md
+++ b/Chapters/12-Transformers/INDEX.md
@@ -1,188 +1,301 @@
 # Transformers
-## Block Components
+Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
 in terms of usage (machine translation, text generation, sequence to
 sequence, sentimental analysis, word prediction, ...),
 but differ for how they process data.
-The idea is that each Transformer block is made of the **same number** of [`Encoders`](#encoder) and
+While `RNNs` have a recurrent part that computes the input
-[`Decoders`](#decoder)
+sequentially, `Transformers` computes it all at once, making it easier to
 parallelize and make it **effectively faster despite being quadratically
 complex**.
-![image](./Images/PNGs/transformer-high-level.png)
+However this comes at the cost of not having an *infinite context* for
 tranformers. They have no memory,
 usually[^infinite-transformer-context][^transformer-xl],
 meaning that they need to resort to tricks as **autoregressiveness** or
 fixed context windows.
 ## Basic Technologies
 ### Positional Encoding
 When words are processed in our Transformer, since they are processed at
 once, they may lose their positional information, making them less informative.
 By using a Positional Encoding, we add back this information to the word.
 There are several ways to add such encoding to words, but among these we find:
 - **Learnt One**:\
    Use another network to learn how to add a positional encoding
    to the word embedding
 - **Positional Encoding[^attention-is-all-you-need]**:\
    This comes from ***"Attention Is All You Need"***[^attention-is-all-you-need]
    and it's a fixed function that adds alternately the sine and cosine
    to word embeddings
    $$
    \begin{aligned}
        PE_{(pos, 2i)} &= \sin{\left(
            \frac{
                pos
            }{
                10000^{2i/d_{model}}
            }
        \right)}
        \\
        PE_{(pos, 2i+1)} &= \cos{\left(
            \frac{
                pos
            }{
                10000^{2i/d_{model}}
            }
        \right)}
    \end{aligned}
    $$
 - **RoPE[^rope-paper][^hugging-face-pe]**:\
    This algorithm uses the same function as above, but it doesn't add it, rather
    it uses it to rotate (multiply) vectors. The idea is that by rotating a
    vector, it doesn't change its magnitude and possibly its latent meaning.
 ### Feed Forward
 This is just a couple of linear layers where the first one expands
 dimensionality (usually by 4 times) of the embedding size, due to
 Cover Theorem, and then it shrinks it back to the original embedding
 size.
 ### Self Attention
 This Layer employs 3 matrices, for each attention head, that computes
 Query, Key and Value vectors for each word embedding.
 #### Steps
 - Compute $Q, K, V$ matrices for each embedding
 $$
 \begin{aligned}
    Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\
    K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\
    V_{i} &= S \times W_{Vi} \in \R^{S \times H}
 \end{aligned}
 $$
 - Compute the head value
 $$
 \begin{aligned}
    Head_i = softmax\left(
        \frac{
            Q_{i} \times K_{i}
        }{
            \sqrt{H}
        }
    \right) \times V_{i}
    \in \R^{S \times H}
 \end{aligned}
 $$
 - Concatenate all heads and multiply for a learnt matrix
 $$
 \begin{aligned}
    Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\
    Out &= Heads \times W_{Heads} \in \R^{S \times Em}
 \end{aligned}
 $$
 > [!NOTE]
-> Input and output are vectors of **fixed size** with padding
+> Legend for each notation:
 >
 > - $H$: Head dimension
 > - $S$: Sentence length (number of tokens)
 > - $i$: head index
-Before feeding our input, we split and embed each word into a fixed vector size. This size depends on the length of
+> [!TIP]
-longest sentence in our training set
+> $H$ is usually smaller (makes computation faster and memory efficient), however
 > it's not necessary.
 >
 > Here we shown several operations, however, instead of making many small tensor
 > multiplications, it's better to perform
 > one (computationally more efficient) and then split ist result into
 > its components
 ### Cross-Attention
 It's the same as the Self Attention, however we only compute $Q_{i}$ for what
 comes from the encoder, while $K_i$ and $V_i$ come from inputs coming
 from the last encoder:
 $$
 \begin{aligned}
    Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\
    K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\
    V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H}
 \end{aligned}
 $$
 ### Masking
 In order to make it sure that a decoder doesn't attent future info for past
 words, we have masks that makes it sure that information doesn't leak
 to parts of the networks.
 We usually implement 4 kind of masks
 ![masks](./pngs/masks.png)
 - **Padding Mask**:\
    This mask is useful to avoid computing attention for paddings
 - **Full Attention**:
    This mask is useful in encoders. It allows the attention to have a double
    directed attention by making words on the right add info to words on
    the left and vice-versa.
 - **Causal Attention**:\
    This mask is useful in decoders. It denies the attention of words on
    the right to leak over leftwards ones. In other words, it prevents
    that future words can affect the past meaning.
 - **Prefix Attention**:\
    This mask is useful for some task in decoders. It allows some words to
    add info over the past. These words however are not generated by the decoder,
    but are part of its initial input.
 ## Basic Blocks
 ### Embedder
-While this is not a real component per se, this is the first phase before even coming
+This layer is responsible of transforming the input (usually tokens) into
-to the first `encoder` and `decoder`.
+word embeddings following these steps:
-Here we transform each word of the input into an ***embedding*** and add a vector to account for
+- one-hot encoding
-position. This positional encoding can either be learnt or can follow this formula:
+- matrix multiplication to get the desired embedding
-
+- inclusion of positional info
 - Even size:
 $$
 \text{positional\_encoding}_{
    (position, 2\text{size})
 } = \sin\left(
        \frac{
            pos
        }{
            10000^{
                \frac{
                    2\text{size}
                }{
                    \text{model\_depth}
                }
            }
        }
    \right)
 $$
 - Odd size:
 $$
 \text{positional\_encoding}_{
    (position, 2\text{size} + 1)
 } = \cos\left(
        \frac{
            pos
        }{
            10000^{
                \frac{
                    2\text{size}
                }{
                    \text{model\_depth}
                }
            }
        }
    \right)
 $$
 ### Encoder
-> [!CAUTION]
+It takes meanings from embedded vectors both on the right and left part:
 > Weights are not shared between `encoders` or `decoders`
-Each phase happens for each word. In other words, if our embed size is 512, we have 512 `Self Attentions` and
+- Self Attention
-512 `Feed Forward NN` **per `encoder`**
+- Residual Connection
 - Layer Normalization
 - Feed Forward
 - Residual Connection
 - Layer Normalization (sometimes it is done before going to self attention)
-![Image](./Images/PNGs/encoder.png)
+![encoder picture](./pngs/encoder.png)
-#### Encoder Self Attention
+Usually it used to condition all decoders, however, if connected to a
-
+`De-Embedding` block or other layers, it can be used stand alone to
-> [!WARNING]
+generate outputs
 > This step is the most expensive one as it involves many computations
 Self Attention is a step in which each ***token*** gets the knowledge of previous ones.
 During this step, we produce 3 vectors that are **usually smaller**, for example 64 instead of 512:
 - **Queries** $\rightarrow q_{i}$
 - **Keys** $\rightarrow k_{i}$
 - **Values** $\rightarrow v_{i}$
 We use these values to compute a **score** that will tell us **how much to focus on certain parts of the sentence
 while encoding a token**
 In order to compute the final encoding we do these for each encoding word $i$:
 - Compute score for each word $j$ : $\text{score}_{j} = q_{i} \cdot k_{j}$
 - Divide each score by the square root of the size of these *helping vectors*:
    $\text{score}_{j} = \frac{\text{score}_{j}}{\sqrt{\text{size}}}$
 - Compute softmax of all scores
 - Multiply softmax each score per its value: $\text{score}_{j} = \text{score}_{j} \cdot v_{j}$
 - Sum them all: $\text{encoding}_{i} = \sum_{j}^{N} \text{score}_{j}$
 > [!NOTE]
 > These steps will be done with matrices, not in this sequential way
 ##### Multi-Headed Attention
 Instead of doing the Attention operation once, we do it more times, by having differente matrices to produce
 our *helping vectors*.
 This produces N encodings for each ***token***, or N matrices of encodings.
 The trick here is to **concatenate all encoding matrices** and **learn a new weight matrix** that will
 **combine them**
 #### Residuals
 In order no to lose some information along the path, after each `Feed Forward` and `Self-Attention`
 we add inputs to each ***sublayer*** `outputs` and we do a `Layer Normalization`
 #### Encoder Feed Forward NN
 > [!TIP]
 > This step is mostly **parallel** as there's no dependency between *neighbour vectors*
 ### Decoder
 It takes meaning from output embedded vectors, usually left to right, and
 condition them with last encoder output.
 - Self Attention
 - Residual Connection
 - Layer Normalization
 - Cross Attention
 - Residual Connection
 - Layer Normalization
 - Feed Forward
 - Residual Connection
 - Layer Normalization (sometimes it is done before going to self attention)
 ![decoder picture](./pngs/decoder.png)
 Usually this block is used to generate outptus autoregressively, meaning
 that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$
 during inference time.
 > [!WARNING]
 > During train time, we are going to feed it all expected sequence, but shifted
 > by a `start` token, predicting the whole sequence again.
 >
 > So, **it isn't trained autoregressively**
 ### De-Embedding
 Before having a result, we de-embed results, coming to a known format.
 Usually, for text generation, this makes the whole problem a classification one
 as we need to predict the right token among all available ones.
 Usually, for text, it is implemented as this:
 - Linear layer -> Go back to token space dimensions
 - Softmax -> Transform into probabilities
 - Argmax -> Take the most probable one
 However, depending on the objectives, this is subject to change
 ## Basic Architectures
 ### Full Transformer (aka Encoder Decoder - Encoder Conditioned)
 This architecture is very powerful, but ***"with great power, comes a great
 energy bill"***. While it has been successfully used in models like `T5`[^t5],
 it comes with additional complexity, both over the coding part and the
 computational one.
 ![full architecture picture](./pngs/full-transformer.png)
 This is the basic architecture proposed in ***"Attention Is All You Need"***
 [^attention-is-all-you-need], but nowadays has been supersided by decoder only
 architectures, for tasks as text generation.
 ### Encoder Only
 This architecture is done by only employing encoders at its base. A model using
 this architecture is `BERT`[^bert], which is capable of tasks such as Masked
 Langauge Model, Sentimental Analysis, Feature Extraction and General
 Classification (as for e-mails).
 ![encoder only picture](./pngs/encoder-only.png)
 However this architecture comes at the cost of not being good at text and
 sequence generation, but has the advantage of being able to process everything
 in one step.
 ### Decoder Only
 This architecture employs only decoders, which are modified to get ridden of
 cross-attention. Usually this is at the base of modern LLMS such as
 `GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and
 music (converting into MIDI format).
 ![decoder only picture](./pngs/decoder-only.png)
 However this architecture needs time to generate, due to its autoregressive
 nature.
 ## Curiosities
 > [!NOTE]
-> The decoding phase is slower than the encoding one, as it is sequential, producing a token for each iteration.
+> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names
-> However it can be sped up by producing several tokerns at once
+> come from an interpretation given to how they interact together, like
 > if were to search for `Key` over the `Query` Box and the found items is
 > `Value`.
 >
 > However this is only an analogy and not the actual process.
-After the **last `Encoder`** has produced its `output`, $K$ and $V$ vectors, these are then used by
+<!--
-**all `Decoders`** during their self attention step, meaning they are **shared** among all `Decoders`.
+    MARK: Footnotes
 -->
 [^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143)
-All `Decoders` steps are then **repeated until we get a `<eos>` token** which will tell the decoder to stop.
+[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860)
-#### Decoder Self Attention
+[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)
-It's almost the same as in the `encoding` phase, though here, since we have no future `outputs`, we can only take into
+[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)
 account only previous ***tokens***, by setting future ones to `-inf`.
 Moreover, here the `Key` and `Values` Mappings come from the `encoder` pase, while the
 `Queue` Mapping is learnt here.
-#### Final Steps
+[^hugging-face-pe]: [Hugging Face | Positional Encoding | 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding)
-##### Linear Layer
+[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683)
-Produces a vector of ***logits***, one per each ***known words***.
+[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)
-##### Softmax Layer
+[^gpt-2]: [Release Strategies and the Social Impacts of Language Models | GPT 2](https://arxiv.org/pdf/1908.09203)
 We then score these ***logits*** over a `SoftMax` to get probabilities. We then take the highest one, usually.
 If we implement ***Temperature***, though, we can take some `tokens` that are less probable, but having less predictability and
 have some results that feel more natural.
 ## Training a Transformer
 <!-- TODO: See PDF 12 pg. 58 to 65 -->
 ## Known Transformers
 ### BERT (Bidirectional Encoder Representations from Transformers)
 Differently from other `Transformers`, it uses only `Encoder` blocks.
 It can be used as a classifier and can be fine tuned.
 The fine tuning happens by **masking** input and **predict** the **masked word**:
 - 15% of total words in input are masked
  - 80% will become a `[masked]` token
  - 10% will become random words
  - 10% will remain unchanged
 #### Bert tasks
 - **Classification**
 - **Fine Tuning**
 - **2 sentences tasks**
  - **Are they paraphrases?**
  - **Does one sentence follow from this other one?**
 - **Feature Extraction**: "Allows us to extract feature to use in our model
 ### GPT-2
 Differently from other `Transformers`, it uses only `Decoder` blocks.
 Since it has no `encoders`, `GPT-2` takes `outputs` and append them to the original `input`. This is called **autoregression**.
 This, however, limits `GPT-2` on how to learn context on `input` because of `masking`.
 During `evaluation`, `GPT-2` does not recompute `V`, `K` and `Q` for previous tokens, but hold on their previosu values.