Revised material and added images

This commit is contained in:
Christian Risi 2025-10-30 12:36:55 +01:00
parent a04c9bcde1
commit 2821a2eaf5

View File

@ -1,6 +1,7 @@
# Autoencoders
Here we are trying to make a `model` to learn an **identity** function
Here we are trying to make a `model` to learn an **identity** function without
making it learn the **actual identity function**
$$
h_{\theta} (x) \approx x
@ -13,6 +14,8 @@ The innovation comes from the fact that we can ***compress*** `data` by using
an `NN` that has **less `neurons` per layer than `input` dimension**, or have
**less `connections` (sparse)**
![autoencoder image](./pngs/autoencoder.png)
## Compression
In a very simple fashion, we train a network to compress $\vec{x}$ in a more **dense**
@ -49,7 +52,8 @@ identity mapping**
## Layerwise Training
To train an `autoencoder` we train `layer` by `layer`, minimizing `vanishing gradients`.
To train an `autoencoder` we train `layer` by `layer`,
minimizing the `vanishing gradients` problem.
The trick is to train one `layer`, then use it as the input for the other `layer`
and training over it as if it were our $x$. Rinse and repeat for 3 `layers` approximately.
@ -57,6 +61,10 @@ and training over it as if it were our $x$. Rinse and repeat for 3 `layers` appr
If you want, **at last**, you can put another `layer` that you train over `data` to
**fine tune**
> [!TIP]
> This method works because even though the gradient vanishes, since we **already** trained
> our upper layers, they **already** have working weights
<!-- TODO: See Deep Belief Networks and Deep Boltzmann Machines-->
<!-- TODO: See Deep autoencoders training-->
@ -66,6 +74,12 @@ It was developed to analyze medical images and segmentation, step in which we
add classification to pixels. To train these segmentation models we use **target maps**
that have the desired classification maps.
![u-net picture](./pngs/u-net.png)
> [!TIP]
> During a `skip-connection`, if the dimension resulting from the `upsampling` is smaller,
> it is possible to **crop** and then concatenate.
### Architecture
- **Encoder**:\
@ -79,6 +93,49 @@ that have the desired classification maps.
came from. Basically we concatenate a previous convolutional block with the
convoluted one and we make a convolution of these layers.
### Pseudo Algorithm:
```python
IMAGE = [[...]]
skip_stack = []
result = IMAGE[:]
# Encode 4 times
for _ in range(4):
# Convolve 2 times
for _ in range(2):
result = conv(result)
# Downsample
skip_stack.append(result[:])
result = max_pool(result)
# Middle convolution
for _ in range(2):
result = conv(result)
# Decode 4 times
for _ in range(4):
# Upsample
result = upsample(result)
# Skip Connection
skip_connection = skip_stack.pop()
result = concat(skip_connection, result)
# Convolve 2 times
for _ in range(2):
result = conv(result)
# Last convolution
RESULT = conv(result)
```
<!-- TODO: See PDF anelli 10 to see complete architecture -->
## Variational Autoencoders
@ -96,13 +153,15 @@ To achieve this, our **point** will become a **distribution** over the `latent-s
and then we'll sample from there and decode the point. We then operate as normally by
backpropagating the error.
![vae picture](./pngs/vae.png)
### Regularization Term
We use `Kullback-Leibler` to see the difference in distributions. This has a
**closed form** in terms of **mean** and **covariance matrices**
The importance of regularization makes it so that these encoders are both continuous and
complete (each point is meaningfull). Without it we would have too similar results in our
complete (each point is meaningful). Without it we would have too similar results in our
regions. Also this makes it so that we don't have regions ***too concentrated and
similar to a point, nor too far apart from each other***
@ -112,6 +171,14 @@ $$
L(x) = ||x - \hat{x}||^{2}_{2} + KL[N(\mu_{x}, \Sigma_{x}), N(0, 1)]
$$
> [!TIP]
> The reason behind `KL` as a regularization term is that we **don't want our encoder
> to cheat by mapping different inputs in a specific region of the latent space**.
>
> This regularization makes it so that all points are mapped to a gaussian over the latent space
>
> On a side note, `KL` is not the only regularization term available, but is the most common.
### Probabilistic View
- $\mathcal{X}$: Set of our data
@ -144,16 +211,31 @@ and identically distributed:
- $p(y) = \mathcal{N}(0, I) \rightarrow p(x|y) = \mathcal{N}(f(y), cI)$
Since we technically need an integral over the denominator, we use
**approximate techniques** such as **Variational INference**
Since we need an integral over the denominator, we use
**approximate techniques** such as **Variational Inference**, easier to compute.
### Variational Inference
<!-- TODO: See PDF 10 pgs 59 to 65-->
This approach tries to approximate the **goal distribution with one that is very close**.
Let's find $p(z|x)$, **probability of having that latent vector given the input**, by using
a gaussian distribution $q_x(z)$, **defined by 2 functions dependent from $x$**
$$
q_x(z) = \mathcal{N}(g(x), h(x))
$$
These functions, $g(x)$ and $h(x)$ are part of these function families $g(x) \in G$
and $h(x) \in H$. Now our final target is then to find the optimal $g$ and $h$ over these
sets, and this is why we add the `KL` divergence over the loss:
$$
L(x) = ||x - \hat{x}||^{2}_{2} + KL\left[N(q_x(z), \mathcal{N}(0, 1)\right]
$$
### Reparametrization Trick
Since $y$ is **technically sampled**, this makes it impossible
Since $y$ ($\hat{x}$) is **technically sampled**, this makes it impossible
to backpropagate the `mean` and `std-dev`, thus we add another
variable, sampled from a *standard gaussian* $\zeta$, so that
we have
@ -161,3 +243,6 @@ we have
$$
y = \sigma_x \cdot \zeta + \mu_x
$$
For $\zeta$ we don't need any backpropagation, thus we can easily backpropagate for both `mean`
and `std-dev`