diff --git a/Chapters/10-Autoencoders/INDEX.md b/Chapters/10-Autoencoders/INDEX.md index e69de29..3474685 100644 --- a/Chapters/10-Autoencoders/INDEX.md +++ b/Chapters/10-Autoencoders/INDEX.md @@ -0,0 +1,113 @@ +# Autoencoders + +Here we are trying to make a `model` to learn an **identity** function + +$$ +h_{\theta} (x) \approx x +$$ + +Now, if we were just to do this, it would be very simple, just pass +the `input` directly to `output`. + +The innovation comes from the fact that we can ***compress*** `data` by using +an `NN` that has **less `neurons` per layer than `input` dimension**, or have +**less `connections` (sparse)** + +## Compression + +In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense** +vector $\vec{y}$ and then later **expand** it into $\vec{z}$, also called +**prediction** of $\vec{x}$ + +$$ +\begin{aligned} + \vec{x} &= [a, b]^{d_x} \\ + \vec{y} &= g(\vec{W_{0}}\vec{x} + b_{0}) \rightarrow \vec{y} = [a_1, b_1]^{d_y} \\ + \vec{z} &= g(\vec{W_{1}}\vec{y} + b_{1}) \rightarrow \vec{z} = [a, b]^{d_x} \\ + \vec{z} &\approx \vec{x} +\end{aligned} +$$ + +## Sparse Training + +A sparse hidden representation comes by penalizing values assigned to `neurons` +(weights). + +$$ +\min_{\theta} +\underbrace{||h_{\theta}(x) - x ||^{2}}_{\text{ + Reconstruction Error +}} + +\underbrace{\lambda \sum_{i}|a_i|}_{\text{ + L1 sparsity +}} +$$ + +The reason on why we want **sparsity** is that we want the **best** representation +in the `latent space`, thus we want to **avoid** our `network` to **learn the +identity mapping** + +## Layerwise Training + +To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`. + +The trick is to train one `layer`, then use it as the input for the other `layer` +and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately. + +If you want, **at last**, you can put another `layer` that you train over `data` to +**fine tune** + + + + +## U-Net + +It was developed to analyze medical images and segmentation, step in which we +add classification to pixels. To train these segmentation models we use **target maps** +that have the desired classification maps. + +### Architecture + +- **Encoder**:\ + We have several convolutional and pooling layers to make the representation smaller. + Once small enough, we'll have a `FCNN` +- **Decoder**:\ + In this phase we restore the representation to the original dimension (`up-sampling`). + Here we have many **deconvolution** layers, however these are learnt functions +- **Skip Connection**:\ + These are connections used to tell **deconvolutional** layers where the feature + came from. Basically we concatenate a previous convolutional block with the + convoluted one and we make a convolution of these layers. + + + +## Variational Autoencoders + +Until now we were reconstructing points in the latent space to points in the +**target space**. + +However, these means that the **immediate neighbours of the data point** are +**meaningless**. + +The idea is to make it such that all **immediate neighbour regions of our data point** +will be decoded as our **data point**. + +To achieve this, our **point** will become a **distribution** over the `latent-space` +and then we'll sample from there and decode the point. We then operate as normally by +backpropagating the error. + +### Regularization Term + +We use `Kullback-Leibler` to see the difference in distributions. This has a +**closed form** in terms of **mean** and **covariance matrices** + +The importance of regularization makes it so that these encoders are both continuous and +complete (each point is meaningfull). Without it we would have too similar results in our +regions. Also this makes it so that we don't have regions ***too concentrated and +similar to a point, nor too far apart from each other*** + +### Loss + +$$ +L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)] +$$