Revised material and added images
This commit is contained in:
parent
a04c9bcde1
commit
2821a2eaf5
@ -1,6 +1,7 @@
|
|||||||
# Autoencoders
|
# Autoencoders
|
||||||
|
|
||||||
Here we are trying to make a `model` to learn an **identity** function
|
Here we are trying to make a `model` to learn an **identity** function without
|
||||||
|
making it learn the **actual identity function**
|
||||||
|
|
||||||
$$
|
$$
|
||||||
h_{\theta} (x) \approx x
|
h_{\theta} (x) \approx x
|
||||||
@ -13,6 +14,8 @@ The innovation comes from the fact that we can ***compress*** `data` by using
|
|||||||
an `NN` that has **less `neurons` per layer than `input` dimension**, or have
|
an `NN` that has **less `neurons` per layer than `input` dimension**, or have
|
||||||
**less `connections` (sparse)**
|
**less `connections` (sparse)**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
## Compression
|
## Compression
|
||||||
|
|
||||||
In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense**
|
In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense**
|
||||||
@ -49,7 +52,8 @@ identity mapping**
|
|||||||
|
|
||||||
## Layerwise Training
|
## Layerwise Training
|
||||||
|
|
||||||
To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`.
|
To train an `autoencoder` we train `layer` by `layer`,
|
||||||
|
minimizing the `vanishing gradients` problem.
|
||||||
|
|
||||||
The trick is to train one `layer`, then use it as the input for the other `layer`
|
The trick is to train one `layer`, then use it as the input for the other `layer`
|
||||||
and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately.
|
and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately.
|
||||||
@ -57,6 +61,10 @@ and training over it as if it were our $x$. Rinse and repeat for 3 `layers` appr
|
|||||||
If you want, **at last**, you can put another `layer` that you train over `data` to
|
If you want, **at last**, you can put another `layer` that you train over `data` to
|
||||||
**fine tune**
|
**fine tune**
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
> This method works because even though the gradient vanishes, since we **already** trained
|
||||||
|
> our upper layers, they **already** have working weights
|
||||||
|
|
||||||
<!-- TODO: See Deep Belief Networks and Deep Boltzmann Machines-->
|
<!-- TODO: See Deep Belief Networks and Deep Boltzmann Machines-->
|
||||||
<!-- TODO: See Deep autoencoders training-->
|
<!-- TODO: See Deep autoencoders training-->
|
||||||
|
|
||||||
@ -66,6 +74,12 @@ It was developed to analyze medical images and segmentation, step in which we
|
|||||||
add classification to pixels. To train these segmentation models we use **target maps**
|
add classification to pixels. To train these segmentation models we use **target maps**
|
||||||
that have the desired classification maps.
|
that have the desired classification maps.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
> During a `skip-connection`, if the dimension resulting from the `upsampling` is smaller,
|
||||||
|
> it is possible to **crop** and then concatenate.
|
||||||
|
|
||||||
### Architecture
|
### Architecture
|
||||||
|
|
||||||
- **Encoder**:\
|
- **Encoder**:\
|
||||||
@ -79,6 +93,49 @@ that have the desired classification maps.
|
|||||||
came from. Basically we concatenate a previous convolutional block with the
|
came from. Basically we concatenate a previous convolutional block with the
|
||||||
convoluted one and we make a convolution of these layers.
|
convoluted one and we make a convolution of these layers.
|
||||||
|
|
||||||
|
### Pseudo Algorithm:
|
||||||
|
|
||||||
|
```python
|
||||||
|
|
||||||
|
IMAGE = [[...]]
|
||||||
|
|
||||||
|
skip_stack = []
|
||||||
|
result = IMAGE[:]
|
||||||
|
|
||||||
|
# Encode 4 times
|
||||||
|
for _ in range(4):
|
||||||
|
|
||||||
|
# Convolve 2 times
|
||||||
|
for _ in range(2):
|
||||||
|
result = conv(result)
|
||||||
|
|
||||||
|
# Downsample
|
||||||
|
skip_stack.append(result[:])
|
||||||
|
result = max_pool(result)
|
||||||
|
|
||||||
|
# Middle convolution
|
||||||
|
for _ in range(2):
|
||||||
|
result = conv(result)
|
||||||
|
|
||||||
|
# Decode 4 times
|
||||||
|
for _ in range(4):
|
||||||
|
|
||||||
|
# Upsample
|
||||||
|
result = upsample(result)
|
||||||
|
|
||||||
|
# Skip Connection
|
||||||
|
skip_connection = skip_stack.pop()
|
||||||
|
result = concat(skip_connection, result)
|
||||||
|
|
||||||
|
# Convolve 2 times
|
||||||
|
for _ in range(2):
|
||||||
|
result = conv(result)
|
||||||
|
|
||||||
|
# Last convolution
|
||||||
|
RESULT = conv(result)
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
<!-- TODO: See PDF anelli 10 to see complete architecture -->
|
<!-- TODO: See PDF anelli 10 to see complete architecture -->
|
||||||
|
|
||||||
## Variational Autoencoders
|
## Variational Autoencoders
|
||||||
@ -96,13 +153,15 @@ To achieve this, our **point** will become a **distribution** over the `latent-s
|
|||||||
and then we'll sample from there and decode the point. We then operate as normally by
|
and then we'll sample from there and decode the point. We then operate as normally by
|
||||||
backpropagating the error.
|
backpropagating the error.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
### Regularization Term
|
### Regularization Term
|
||||||
|
|
||||||
We use `Kullback-Leibler` to see the difference in distributions. This has a
|
We use `Kullback-Leibler` to see the difference in distributions. This has a
|
||||||
**closed form** in terms of **mean** and **covariance matrices**
|
**closed form** in terms of **mean** and **covariance matrices**
|
||||||
|
|
||||||
The importance of regularization makes it so that these encoders are both continuous and
|
The importance of regularization makes it so that these encoders are both continuous and
|
||||||
complete (each point is meaningfull). Without it we would have too similar results in our
|
complete (each point is meaningful). Without it we would have too similar results in our
|
||||||
regions. Also this makes it so that we don't have regions ***too concentrated and
|
regions. Also this makes it so that we don't have regions ***too concentrated and
|
||||||
similar to a point, nor too far apart from each other***
|
similar to a point, nor too far apart from each other***
|
||||||
|
|
||||||
@ -112,6 +171,14 @@ $$
|
|||||||
L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]
|
L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]
|
||||||
$$
|
$$
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
> The reason behind `KL` as a regularization term is that we **don't want our encoder
|
||||||
|
> to cheat by mapping different inputs in a specific region of the latent space**.
|
||||||
|
>
|
||||||
|
> This regularization makes it so that all points are mapped to a gaussian over the latent space
|
||||||
|
>
|
||||||
|
> On a side note, `KL` is not the only regularization term available, but is the most common.
|
||||||
|
|
||||||
### Probabilistic View
|
### Probabilistic View
|
||||||
|
|
||||||
- $\mathcal{X}$: Set of our data
|
- $\mathcal{X}$: Set of our data
|
||||||
@ -144,16 +211,31 @@ and identically distributed:
|
|||||||
|
|
||||||
- $p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)$
|
- $p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)$
|
||||||
|
|
||||||
Since we technically need an integral over the denominator, we use
|
Since we need an integral over the denominator, we use
|
||||||
**approximate techniques** such as **Variational INference**
|
**approximate techniques** such as **Variational Inference**, easier to compute.
|
||||||
|
|
||||||
### Variational Inference
|
### Variational Inference
|
||||||
|
|
||||||
<!-- TODO: See PDF 10 pgs 59 to 65-->
|
This approach tries to approximate the **goal distribution with one that is very close**.
|
||||||
|
|
||||||
|
Let's find $p(z|x)$, **probability of having that latent vector given the input**, by using
|
||||||
|
a gaussian distribution $q_x(z)$, **defined by 2 functions dependent from $x$**
|
||||||
|
|
||||||
|
$$
|
||||||
|
q_x(z) = \mathcal{N}(g(x), h(x))
|
||||||
|
$$
|
||||||
|
|
||||||
|
These functions, $g(x)$ and $h(x)$ are part of these function families $g(x) \in G$
|
||||||
|
and $h(x) \in H$. Now our final target is then to find the optimal $g$ and $h$ over these
|
||||||
|
sets, and this is why we add the `KL` divergence over the loss:
|
||||||
|
|
||||||
|
$$
|
||||||
|
L(x) = ||x - \hat{x}||^{2}_{2} + KL\left[N(q_x(z), \mathcal{N}(0, 1)\right]
|
||||||
|
$$
|
||||||
|
|
||||||
### Reparametrization Trick
|
### Reparametrization Trick
|
||||||
|
|
||||||
Since $y$ is **technically sampled**, this makes it impossible
|
Since $y$ ($\hat{x}$) is **technically sampled**, this makes it impossible
|
||||||
to backpropagate the `mean` and `std-dev`, thus we add another
|
to backpropagate the `mean` and `std-dev`, thus we add another
|
||||||
variable, sampled from a *standard gaussian* $\zeta$, so that
|
variable, sampled from a *standard gaussian* $\zeta$, so that
|
||||||
we have
|
we have
|
||||||
@ -161,3 +243,6 @@ we have
|
|||||||
$$
|
$$
|
||||||
y = \sigma_x \cdot \zeta + \mu_x
|
y = \sigma_x \cdot \zeta + \mu_x
|
||||||
$$
|
$$
|
||||||
|
|
||||||
|
For $\zeta$ we don't need any backpropagation, thus we can easily backpropagate for both `mean`
|
||||||
|
and `std-dev`
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user