Added Transformers

2025-09-08 17:54:01 +02:00
parent f6960c109b
commit 191dc0ff12
5 changed files with 3534 additions and 0 deletions
--- a/Chapters/12-Transformers/INDEX.md
+++ b/Chapters/12-Transformers/INDEX.md
@@ -0,0 +1,187 @@
+# Transformers
+
+## Block Components
+
+The idea is that each Transformer block is made of the **same number** of [`Encoders`](#encoder) and
+[`Decoders`](#decoder)
+
+![image](./Images/PNGs/transformer-high-level.png)
+
+> [!NOTE]
+> Input and output are vectors of **fixed size** with padding
+
+Before feeding our input, we split and embed each word into a fixed vector size. This size depends on the length of
+longest sentence in our training set
+
+### Embedder
+
+While this is not a real component per se, this is the first phase before even coming
+to the first `encoder` and `decoder`.
+
+Here we transform each word of the input into an ***embedding*** and add a vector to account for
+position. This positional encoding can either be learnt or can follow this formula:
+
+
+- Even size:
+$$
+\text{positional\_encoding}_{
+    (position, 2\text{size})
+} = \sin\left(
+        \frac{
+            pos
+        }{
+            10000^{
+                \frac{
+                    2\text{size}
+                }{
+                    \text{model\_depth}
+                }
+            }
+        }
+    \right)
+$$
+- Odd size:
+$$
+\text{positional\_encoding}_{
+    (position, 2\text{size} + 1)
+} = \cos\left(
+        \frac{
+            pos
+        }{
+            10000^{
+                \frac{
+                    2\text{size}
+                }{
+                    \text{model\_depth}
+                }
+            }
+        }
+    \right)
+$$
+
+
+### Encoder
+
+> [!CAUTION]
+> Weights are not shared between `encoders` or `decoders`
+
+Each phase happens for each word. In other words, if our embed size is 512, we have 512 `Self Attentions` and
+512 `Feed Forward NN` **per `encoder`**
+
+![Image](./Images/PNGs/encoder.png)
+
+#### Encoder Self Attention
+
+> [!WARNING]
+> This step is the most expensive one as it involves many computations
+
+Self Attention is a step in which each ***token*** gets the knowledge of previous ones.
+
+During this step, we produce 3 vectors that are **usually smaller**, for example 64 instead of 512:
+
+- **Queries** $\rightarrow q_{i}$
+- **Keys** $\rightarrow k_{i}$
+- **Values** $\rightarrow v_{i}$
+
+We use these values to compute a **score** that will tell us **how much to focus on certain parts of the sentence
+while encoding a token**
+
+In order to compute the final encoding we do these for each encoding word $i$:
+
+- Compute score for each word $j$ : $\text{score}_{j} = q_{i} \cdot k_{j}$
+- Divide each score by the square root of the size of these *helping vectors*:
+    $\text{score}_{j} = \frac{\text{score}_{j}}{\sqrt{\text{size}}}$
+- Compute softmax of all scores
+- Multiply softmax each score per its value: $\text{score}_{j} = \text{score}_{j} \cdot v_{j}$
+- Sum them all: $\text{encoding}_{i} = \sum_{j}^{N} \text{score}_{j}$
+
+> [!NOTE]
+> These steps will be done with matrices, not in this sequential way
+
+##### Multi-Headed Attention
+
+Instead of doing the Attention operation once, we do it more times, by having differente matrices to produce
+our *helping vectors*.
+
+This produces N encodings for each ***token***, or N matrices of encodings.
+
+The trick here is to **concatenate all encoding matrices** and **learn a new weight matrix** that will
+**combine them**
+
+#### Residuals
+
+In order no to lose some information along the path, after each `Feed Forward` and `Self-Attention`
+we add inputs to each ***sublayer*** `outputs` and we do a `Layer Normalization`
+
+#### Encoder Feed Forward NN
+
+> [!TIP]
+> This step is mostly **parallel** as there's no dependency between *neighbour vectors*
+
+### Decoder
+
+> [!NOTE]
+> The decoding phase is slower than the encoding one, as it is sequential, producing a token for each iteration.
+> However it can be sped up by producing several tokerns at once
+
+After the **last `Encoder`** has produced its `output`, $K$ and $V$ vectors, these are then used by
+**all `Decoders`** during their self attention step, meaning they are **shared** among all `Decoders`.
+
+All `Decoders` steps are then **repeated until we get a `<eos>` token** which will tell the decoder to stop.
+
+#### Decoder Self Attention
+
+It's almost the same as in the `encoding` phase, though here, since we have no future `outputs`, we can only take into
+account only previous ***tokens***, by setting future ones to `-inf`.
+Moreover, here the `Key` and `Values` Mappings come from the `encoder` pase, while the
+`Queue` Mapping is learnt here.
+
+#### Final Steps
+
+##### Linear Layer
+
+Produces a vector of ***logits***, one per each ***known words***.
+
+##### Softmax Layer
+
+We then score these ***logits*** over a `SoftMax` to get probabilities. We then take the highest one, usually.
+
+If we implement ***Temperature***, though, we can take some `tokens` that are less probable, but having less predictability and
+have some results that feel more natural.
+
+## Training a Transformer
+
+<!-- TODO: See PDF 12 pg. 58 to 65 -->
+
+## Known Transformers
+
+### BERT (Bidirectional Encoder Representations from Transformers)
+
+Differently from other `Transformers`, it uses only `Encoder` blocks.
+
+It can be used as a classifier and can be fine tuned.
+
+The fine tuning happens by **masking** input and **predict** the **masked word**:
+
+- 15% of total words in input are masked
+    - 80% will become a `[masked]` token
+    - 10% will become random words
+    - 10% will remain unchanged
+
+#### Bert tasks
+
+- **Classification**
+- **Fine Tuning**
+- **2 sentences tasks**
+    - **Are they paraphrases?**
+    - **Does one sentence follow from this other one?**
+- **Feature Extraction**: "Allows us to extract feature to use in our model
+
+### GPT-2
+
+Differently from other `Transformers`, it uses only `Decoder` blocks.
+
+Since it has no `encoders`, `GPT-2` takes `outputs` and append them to the original `input`. This is called **autoregression**.
+This, however, limits `GPT-2` on how to learn context on `input` because of `masking`.
+
+During `evaluation`, `GPT-2` does not recompute `V`, `K` and `Q` for previous tokens, but hold on their previosu values.
--- a/Chapters/12-Transformers/Images/Excalidraw/encoder.excalidraw.json
+++ b/Chapters/12-Transformers/Images/Excalidraw/encoder.excalidraw.json
--- a/Chapters/12-Transformers/Images/Excalidraw/transformer-high-level.excalidraw.json
+++ b/Chapters/12-Transformers/Images/Excalidraw/transformer-high-level.excalidraw.json
--- a/Chapters/12-Transformers/Images/PNGs/encoder.png
+++ b/Chapters/12-Transformers/Images/PNGs/encoder.png
--- a/Chapters/12-Transformers/Images/PNGs/transformer-high-level.png
+++ b/Chapters/12-Transformers/Images/PNGs/transformer-high-level.png