Revised material and added images

2025-10-30 12:36:55 +01:00
parent a04c9bcde1
commit 2821a2eaf5
1 changed files with 92 additions and 7 deletions
--- a/Chapters/10-Autoencoders/INDEX.md
+++ b/Chapters/10-Autoencoders/INDEX.md
@@ -1,6 +1,7 @@
 # Autoencoders

-Here we are trying to make a `model` to learn an **identity** function
+Here we are trying to make a `model` to learn an **identity** function without
+making it learn the **actual identity function**

 $$
 h_{\theta} (x) \approx x
@@ -13,6 +14,8 @@ The innovation comes from the fact that we can ***compress*** `data` by using
 an `NN` that has **less `neurons` per layer than `input` dimension**, or have
 **less `connections` (sparse)**

+![autoencoder image](./pngs/autoencoder.png)
+
 ## Compression

 In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense**
@@ -49,7 +52,8 @@ identity mapping**

 ## Layerwise Training

-To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`.
+To train an `autoencoder` we train `layer` by `layer`,
+minimizing the `vanishing gradients` problem.

 The trick is to train one `layer`, then use it as the input for the other `layer`
 and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately.
@@ -57,6 +61,10 @@ and training over it as if it were our $x$. Rinse and repeat for 3 `layers` appr
 If you want, **at last**, you can put another `layer` that you train over `data` to
 **fine tune**

+> [!TIP]
+> This method works because even though the gradient vanishes, since we **already** trained
+> our upper layers, they **already** have working weights
+
 <!-- TODO: See Deep Belief Networks and Deep Boltzmann Machines-->
 <!-- TODO: See Deep autoencoders training-->

@@ -66,6 +74,12 @@ It was developed to analyze medical images and segmentation, step in which we
 add classification to pixels. To train these segmentation models we use **target maps**
 that have the desired classification maps.

+![u-net picture](./pngs/u-net.png)
+
+> [!TIP]
+> During a `skip-connection`, if the dimension resulting from the `upsampling` is smaller,
+> it is possible to **crop** and then concatenate.
+
 ### Architecture

 - **Encoder**:\
@@ -79,6 +93,49 @@ that have the desired classification maps.
    came from. Basically we concatenate a previous convolutional block with the
    convoluted one and we make a convolution of these layers.

+### Pseudo Algorithm:
+
+```python
+
+IMAGE = [[...]]
+
+skip_stack = []
+result = IMAGE[:]
+
+# Encode 4 times
+for _ in range(4):
+
+    # Convolve 2 times
+    for _ in range(2):
+        result = conv(result)
+
+    # Downsample
+    skip_stack.append(result[:])
+    result = max_pool(result)
+
+# Middle convolution
+for _ in range(2):
+    result = conv(result)
+
+# Decode 4 times
+for _ in range(4):
+
+    # Upsample
+    result = upsample(result)
+
+    # Skip Connection
+    skip_connection = skip_stack.pop()
+    result = concat(skip_connection, result)
+
+    # Convolve 2 times
+    for _ in range(2):
+        result = conv(result)
+
+# Last convolution
+RESULT = conv(result)
+
+```
+
 <!-- TODO: See PDF anelli 10 to see complete architecture -->

 ## Variational Autoencoders
@@ -96,13 +153,15 @@ To achieve this, our **point** will become a **distribution** over the `latent-s
 and then we'll sample from there and decode the point. We then operate as normally by
 backpropagating the error.

+![vae picture](./pngs/vae.png)
+
 ### Regularization Term

 We use `Kullback-Leibler` to see the difference in distributions. This has a
 **closed form** in terms of **mean** and **covariance matrices**

 The importance of regularization makes it so that these encoders are both continuous and
-complete (each point is meaningfull). Without it we would have too similar results in our
+complete (each point is meaningful). Without it we would have too similar results in our
 regions. Also this makes it so that we don't have regions ***too concentrated and
 similar to a point, nor too far apart from each other***

@@ -112,6 +171,14 @@ $$
 L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]
 $$

+> [!TIP]
+> The reason behind `KL` as a regularization term is that we **don't want our encoder
+> to cheat by mapping different inputs in a specific region of the latent space**.
+>
+> This regularization makes it so that all points are mapped to a gaussian over the latent space
+>
+> On a side note, `KL` is not the only regularization term available, but is the most common.
+
 ### Probabilistic View

 - $\mathcal{X}$: Set of our data
@@ -144,16 +211,31 @@ and identically distributed:

 - $p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)$

-Since we technically need an integral over the denominator, we use
-**approximate techniques** such as **Variational INference**
+Since we need an integral over the denominator, we use
+**approximate techniques** such as **Variational Inference**, easier to compute.

 ### Variational Inference

-<!-- TODO: See PDF 10 pgs 59 to 65-->
+This approach tries to approximate the **goal distribution with one that is very close**.
+
+Let's find $p(z|x)$, **probability of having that latent vector given the input**, by using
+a gaussian distribution $q_x(z)$, **defined by 2 functions dependent from $x$**
+
+$$
+q_x(z) = \mathcal{N}(g(x), h(x))
+$$
+
+These functions, $g(x)$ and $h(x)$ are part of these function families $g(x) \in G$
+and $h(x) \in H$. Now our final target is then to find the optimal $g$ and $h$ over these
+sets, and this is why we add the `KL` divergence over the loss:
+
+$$
+L(x) = ||x - \hat{x}||^{2}_{2} + KL\left[N(q_x(z), \mathcal{N}(0, 1)\right]
+$$

 ### Reparametrization Trick

-Since $y$ is **technically sampled**, this makes it impossible
+Since $y$ ($\hat{x}$) is **technically sampled**, this makes it impossible
 to backpropagate the `mean` and `std-dev`, thus we add another
 variable, sampled from a *standard gaussian* $\zeta$, so that
 we have
@@ -161,3 +243,6 @@ we have
 $$
 y = \sigma_x \cdot \zeta + \mu_x
 $$
+
+For $\zeta$ we don't need any backpropagation, thus we can easily backpropagate for both `mean`
+and `std-dev`