Autoencoders

Here we are trying to make a model to learn an identity function


h_{\theta} (x) \approx x

Now, if we were just to do this, it would be very simple, just pass the input directly to output.

The innovation comes from the fact that we can compress data by using an NN that has less neurons per layer than input dimension, or have less connections (sparse)

Compression

In a very simple fashion, we train a network to compress \vec{x} in a more dense vector \vec{y} and then later expand it into \vec{z}, also called prediction of \vec{x}


\begin{aligned}
    \vec{x} &= [a, b]^{d_x} \\
    \vec{y} &= g(\vec{W_{0}}\vec{x} + b_{0}) \rightarrow \vec{y} = [a_1, b_1]^{d_y} \\
    \vec{z} &= g(\vec{W_{1}}\vec{y} + b_{1}) \rightarrow \vec{z} = [a, b]^{d_x} \\
    \vec{z} &\approx \vec{x}
\end{aligned}

Sparse Training

A sparse hidden representation comes by penalizing values assigned to neurons (weights).


\min_{\theta}
\underbrace{||h_{\theta}(x) - x ||^{2}}_{\text{
    Reconstruction Error
}} +
\underbrace{\lambda \sum_{i}|a_i|}_{\text{
    L1 sparsity
}}

The reason on why we want sparsity is that we want the best representation in the latent space, thus we want to avoid our network to learn the identity mapping

Layerwise Training

To train an autoencoder we train layer by layer, minimizing vanishing gradients.

The trick is to train one layer, then use it as the input for the other layer and training over it as if it were our x. Rinse and repeat for 3 layers approximately.

If you want, at last, you can put another layer that you train over data to fine tune

U-Net

It was developed to analyze medical images and segmentation, step in which we add classification to pixels. To train these segmentation models we use target maps that have the desired classification maps.

Architecture

Encoder:
We have several convolutional and pooling layers to make the representation smaller. Once small enough, we'll have a FCNN
Decoder:
In this phase we restore the representation to the original dimension (up-sampling). Here we have many deconvolution layers, however these are learnt functions
Skip Connection:
These are connections used to tell deconvolutional layers where the feature came from. Basically we concatenate a previous convolutional block with the convoluted one and we make a convolution of these layers.

Variational Autoencoders

Until now we were reconstructing points in the latent space to points in the target space.

However, these means that the immediate neighbours of the data point are meaningless.

The idea is to make it such that all immediate neighbour regions of our data point will be decoded as our data point.

To achieve this, our point will become a distribution over the latent-space and then we'll sample from there and decode the point. We then operate as normally by backpropagating the error.

Regularization Term

We use Kullback-Leibler to see the difference in distributions. This has a closed form in terms of mean and covariance matrices

The importance of regularization makes it so that these encoders are both continuous and complete (each point is meaningfull). Without it we would have too similar results in our regions. Also this makes it so that we don't have regions too concentrated and similar to a point, nor too far apart from each other

Loss


L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]

Probabilistic View

\mathcal{X}: Set of our data
\mathcal{Y}: Latent variable set
p(x|y): Probabilistic encoder, tells us the distribution of x given y
p(y|x): Probabilistic decoder, tells us the distribution of y given x

Note

Bayesian a Posteriori Probability
  \underbrace{p(A|B)}_{\text{Posterior}} = \frac{
      \overbrace{p(B|A)}^{\text{Likelihood}}
      \overbrace{\cdot p(A)}^{\text{Prior}}
  }{
      \underbrace{p(B)}_{\text{Marginalization}}
  }
  = \frac{p(B|A) \cdot p(A)}{\int{p(B|u)p(u)du}}
Posterior: Probability of A being true given B

Likelihood: Probability of B being true given A

Prior: Probability of A being true (knowledge)

Marginalization: Probability of B being true

By making the assumption of the probability of y of being a gaussian with 0 mean and identity deviation, and assuming x and y independent and identically distributed:

p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)

Since we technically need an integral over the denominator, we use approximate techniques such as Variational INference

Variational Inference

Reparametrization Trick

Since y is technically sampled, this makes it impossible to backpropagate the mean and std-dev, thus we add another variable, sampled from a standard gaussian \zeta, so that we have


y = \sigma_x \cdot \zeta + \mu_x

5.2 KiB Raw Blame History