Added Chapters 12 and 13

2025-09-16 20:54:51 +02:00
parent f5bc43c59b
commit ed35a6196d
2 changed files with 64 additions and 7 deletions
--- a/Chapters/12-Transformers/INDEX.md
+++ b/Chapters/12-Transformers/INDEX.md
@@ -21,8 +21,8 @@ to the first `encoder` and `decoder`.
 Here we transform each word of the input into an ***embedding*** and add a vector to account for
 position. This positional encoding can either be learnt or can follow this formula:

-
 - Even size:
+
 $$
 \text{positional\_encoding}_{
    (position, 2\text{size})
@@ -40,7 +40,9 @@ $$
        }
    \right)
 $$
+
 - Odd size:
+
 $$
 \text{positional\_encoding}_{
    (position, 2\text{size} + 1)
@@ -59,7 +61,6 @@ $$
    \right)
 $$

-
 ### Encoder

 > [!CAUTION]
@@ -164,17 +165,17 @@ It can be used as a classifier and can be fine tuned.
 The fine tuning happens by **masking** input and **predict** the **masked word**:

 - 15% of total words in input are masked
-    - 80% will become a `[masked]` token
-    - 10% will become random words
-    - 10% will remain unchanged
+  - 80% will become a `[masked]` token
+  - 10% will become random words
+  - 10% will remain unchanged

 #### Bert tasks

 - **Classification**
 - **Fine Tuning**
 - **2 sentences tasks**
-    - **Are they paraphrases?**
-    - **Does one sentence follow from this other one?**
+  - **Are they paraphrases?**
+  - **Does one sentence follow from this other one?**
 - **Feature Extraction**: "Allows us to extract feature to use in our model

 ### GPT-2
--- a/Chapters/13-Diffusion/INDEX.md
+++ b/Chapters/13-Diffusion/INDEX.md
@@ -0,0 +1,56 @@
+# Diffusion
+
+<!-- TODO: Read Imagen Paper -->
+<!-- TODO: Read Hypernetwork Paper -->
+<!-- TODO: Read PerceiverIO Paper -->
+
+## Text2image Architecture
+
+### Conditioner (usually Text Encoder)
+
+Encodes text into several tokens (e.g. 77) over many dimensions (e.g. 768), however it could be an encoder of other formats such as inpaintings, silhouettes, depth maps, etc...
+
+Usually, for text, we use a transformer such as Clip-Text or BERT
+
+> [!NOTE]
+> Clip text is trained to associate images with captions such that encoded images are equal to their caption in latent space
+
+### Image Information Creator
+
+This component creates info over the **latent space** directly and has a parameter called `steps` which controls the number of steps, typically 50 or 100.
+
+At its core it has a UNet NN and a schedule algorithm, usually. It takes what the [Conditioner](#conditioner-usually-text-encoder) produces, plus a noise tensor.
+
+In order to work with text embeddings, we need to interleave RESNet blocks with attention blocks and use some residual connections
+
+#### Creator U-Net
+
+> [!TIP]
+> TO speed up the process, operate over a **compressed** version of the image made with a **variational autoencoder**
+
+For each step we add some noise to an image and the core needs to predict which kind of noise was added. This is because we discretize the amount of noise we add at each step.
+
+> [!CAUTION]
+> We can use this method even in latent space, which is 48 times smaller than the pixel one.
+>
+> If you train over a dimension, that'll be your max dimension, go over it and you'll get artifacts
+
+When in latent space, however, we must have a [VAE](../10-Autoencoders/INDEX.md#variational-autoencoders) that should be fine tuned at the `decoding` step to paint fine details.
+
+When we have text, its encoded version will control how our U-Net will choose the noise level
+
+#### Noise Scheduling
+
+We can tell the model how much noise we want at each step during the decoding phase.
+
+#### Cross Attention
+
+We take 2 sequences, one to produce thew `Query` vector and the other to generate the `Keys` and `Values`
+
+### Image Decoder
+
+Takes the [Image Information Creator](#image-information-creator) and uses its output to create an image.
+
+### Classifier Free Guidance
+
+It is a parameter to see how much our guide fuides the model towards a specific objective