diff --git a/Chapters/10-Autoencoders/INDEX.md b/Chapters/10-Autoencoders/INDEX.md index 0b79c4f..fb3bea1 100644 --- a/Chapters/10-Autoencoders/INDEX.md +++ b/Chapters/10-Autoencoders/INDEX.md @@ -1,6 +1,7 @@ # Autoencoders -Here we are trying to make a `model` to learn an **identity** function +Here we are trying to make a `model` to learn an **identity** function without +making it learn the **actual identity function** $$ h_{\theta} (x) \approx x @@ -13,6 +14,8 @@ The innovation comes from the fact that we can ***compress*** `data` by using an `NN` that has **less `neurons` per layer than `input` dimension**, or have **less `connections` (sparse)** +![autoencoder image](./pngs/autoencoder.png) + ## Compression In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense** @@ -49,7 +52,8 @@ identity mapping** ## Layerwise Training -To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`. +To train an `autoencoder` we train `layer` by `layer`, +minimizing the `vanishing gradients` problem. The trick is to train one `layer`, then use it as the input for the other `layer` and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately. @@ -57,6 +61,10 @@ and training over it as if it were our $x$. Rinse and repeat for 3 `layers` appr If you want, **at last**, you can put another `layer` that you train over `data` to **fine tune** +> [!TIP] +> This method works because even though the gradient vanishes, since we **already** trained +> our upper layers, they **already** have working weights + @@ -66,6 +74,12 @@ It was developed to analyze medical images and segmentation, step in which we add classification to pixels. To train these segmentation models we use **target maps** that have the desired classification maps. +![u-net picture](./pngs/u-net.png) + +> [!TIP] +> During a `skip-connection`, if the dimension resulting from the `upsampling` is smaller, +> it is possible to **crop** and then concatenate. + ### Architecture - **Encoder**:\ @@ -79,6 +93,49 @@ that have the desired classification maps. came from. Basically we concatenate a previous convolutional block with the convoluted one and we make a convolution of these layers. +### Pseudo Algorithm: + +```python + +IMAGE = [[...]] + +skip_stack = [] +result = IMAGE[:] + +# Encode 4 times +for _ in range(4): + + # Convolve 2 times + for _ in range(2): + result = conv(result) + + # Downsample + skip_stack.append(result[:]) + result = max_pool(result) + +# Middle convolution +for _ in range(2): + result = conv(result) + +# Decode 4 times +for _ in range(4): + + # Upsample + result = upsample(result) + + # Skip Connection + skip_connection = skip_stack.pop() + result = concat(skip_connection, result) + + # Convolve 2 times + for _ in range(2): + result = conv(result) + +# Last convolution +RESULT = conv(result) + +``` + ## Variational Autoencoders @@ -96,13 +153,15 @@ To achieve this, our **point** will become a **distribution** over the `latent-s and then we'll sample from there and decode the point. We then operate as normally by backpropagating the error. +![vae picture](./pngs/vae.png) + ### Regularization Term We use `Kullback-Leibler` to see the difference in distributions. This has a **closed form** in terms of **mean** and **covariance matrices** The importance of regularization makes it so that these encoders are both continuous and -complete (each point is meaningfull). Without it we would have too similar results in our +complete (each point is meaningful). Without it we would have too similar results in our regions. Also this makes it so that we don't have regions ***too concentrated and similar to a point, nor too far apart from each other*** @@ -112,6 +171,14 @@ $$ L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)] $$ +> [!TIP] +> The reason behind `KL` as a regularization term is that we **don't want our encoder +> to cheat by mapping different inputs in a specific region of the latent space**. +> +> This regularization makes it so that all points are mapped to a gaussian over the latent space +> +> On a side note, `KL` is not the only regularization term available, but is the most common. + ### Probabilistic View - $\mathcal{X}$: Set of our data @@ -144,16 +211,31 @@ and identically distributed: - $p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)$ -Since we technically need an integral over the denominator, we use -**approximate techniques** such as **Variational INference** +Since we need an integral over the denominator, we use +**approximate techniques** such as **Variational Inference**, easier to compute. ### Variational Inference - +This approach tries to approximate the **goal distribution with one that is very close**. + +Let's find $p(z|x)$, **probability of having that latent vector given the input**, by using +a gaussian distribution $q_x(z)$, **defined by 2 functions dependent from $x$** + +$$ +q_x(z) = \mathcal{N}(g(x), h(x)) +$$ + +These functions, $g(x)$ and $h(x)$ are part of these function families $g(x) \in G$ +and $h(x) \in H$. Now our final target is then to find the optimal $g$ and $h$ over these +sets, and this is why we add the `KL` divergence over the loss: + +$$ +L(x) = ||x - \hat{x}||^{2}_{2} + KL\left[N(q_x(z), \mathcal{N}(0, 1)\right] +$$ ### Reparametrization Trick -Since $y$ is **technically sampled**, this makes it impossible +Since $y$ ($\hat{x}$) is **technically sampled**, this makes it impossible to backpropagate the `mean` and `std-dev`, thus we add another variable, sampled from a *standard gaussian* $\zeta$, so that we have @@ -161,3 +243,6 @@ we have $$ y = \sigma_x \cdot \zeta + \mu_x $$ + +For $\zeta$ we don't need any backpropagation, thus we can easily backpropagate for both `mean` +and `std-dev`