diff --git a/Chapters/10-Autoencoders/INDEX.md b/Chapters/10-Autoencoders/INDEX.md
index e69de29..3474685 100644
--- a/Chapters/10-Autoencoders/INDEX.md
+++ b/Chapters/10-Autoencoders/INDEX.md
@@ -0,0 +1,113 @@
+# Autoencoders
+
+Here we are trying to make a `model` to learn an **identity** function
+
+$$
+h_{\theta} (x) \approx x
+$$
+
+Now, if we were just to do this, it would be very simple, just pass
+the `input` directly to `output`.
+
+The innovation comes from the fact that we can ***compress*** `data` by using
+an `NN` that has **less `neurons` per layer than `input` dimension**, or have
+**less `connections` (sparse)**
+
+## Compression
+
+In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense**
+vector $\vec{y}$ and then later **expand** it into $\vec{z}$, also called
+**prediction** of $\vec{x}$
+
+$$
+\begin{aligned}
+    \vec{x} &= [a, b]^{d_x} \\
+    \vec{y} &= g(\vec{W_{0}}\vec{x} + b_{0}) \rightarrow \vec{y} = [a_1, b_1]^{d_y} \\
+    \vec{z} &= g(\vec{W_{1}}\vec{y} + b_{1}) \rightarrow \vec{z} = [a, b]^{d_x} \\
+    \vec{z} &\approx \vec{x}
+\end{aligned}
+$$
+
+## Sparse Training
+
+A sparse hidden representation comes by penalizing values assigned to `neurons`
+(weights).
+
+$$
+\min_{\theta}
+\underbrace{||h_{\theta}(x) - x ||^{2}}_{\text{
+    Reconstruction Error
+}} +
+\underbrace{\lambda \sum_{i}|a_i|}_{\text{
+    L1 sparsity
+}}
+$$
+
+The reason on why we want **sparsity** is that we want the **best** representation
+in the `latent space`, thus we want to **avoid** our `network` to **learn the
+identity mapping**
+
+## Layerwise Training
+
+To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`.
+
+The trick is to train one `layer`, then use it as the input for the other `layer`
+and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately.
+
+If you want, **at last**, you can put another `layer` that you train over `data` to
+**fine tune**
+
+<!-- TODO: See Deep Belief Networks and Deep Boltzmann Machines-->
+<!-- TODO: See Deep autoencoders training-->
+
+## U-Net
+
+It was developed to analyze medical images and segmentation, step in which we
+add classification to pixels. To train these segmentation models we use **target maps**
+that have the desired classification maps.
+
+### Architecture
+
+- **Encoder**:\
+    We have several convolutional and pooling layers to make the representation smaller.
+    Once small enough, we'll have a `FCNN`
+- **Decoder**:\
+    In this phase we restore the representation to the original dimension (`up-sampling`).
+    Here we have many **deconvolution** layers, however these are learnt functions
+- **Skip Connection**:\
+    These are connections used to tell **deconvolutional** layers where the feature
+    came from. Basically we concatenate a previous convolutional block with the
+    convoluted one and we make a convolution of these layers.
+
+<!-- TODO: See PDF anelli 10 to see complete architecture -->
+
+## Variational Autoencoders
+
+Until now we were reconstructing points in the latent space to points in the
+**target space**.
+
+However, these means that the **immediate neighbours of the data point** are
+**meaningless**.
+
+The idea is to make it such that all **immediate neighbour regions of our data point**
+will be decoded as our **data point**.
+
+To achieve this, our **point** will become a **distribution** over the `latent-space`
+and then we'll sample from there and decode the point. We then operate as normally by
+backpropagating the error.
+
+### Regularization Term
+
+We use `Kullback-Leibler` to see the difference in distributions. This has a
+**closed form** in terms of **mean** and **covariance matrices**
+
+The importance of regularization makes it so that these encoders are both continuous and
+complete (each point is meaningfull). Without it we would have too similar results in our
+regions. Also this makes it so that we don't have regions ***too concentrated and
+similar to a point, nor too far apart from each other***
+
+### Loss
+
+$$
+L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]
+$$