142 lines
5.3 KiB
Markdown
142 lines
5.3 KiB
Markdown
# Diffusion, Text2image Architecture
|
|
|
|
<!-- TODO: Read Imagen Paper -->
|
|
<!-- TODO: Read Hypernetwork Paper -->
|
|
<!-- TODO: Read PerceiverIO Paper -->
|
|
<!-- TODO: Read SelfDoc Paper -->
|
|
|
|
## Conditioner (usually a Text Encoder)
|
|
|
|
Encodes text into several tokens (e.g. 77) over many dimensions (e.g. 768), however it could be an encoder of other formats such as inpaintings, silhouettes, depth maps, etc...
|
|
|
|
Usually, for text, we use a transformer such as Clip-Text or BERT
|
|
|
|
> [!NOTE]
|
|
> Clip text is trained to associate images with captions so that encoded images are equal to their caption in latent space
|
|
|
|
## Image Information Creator
|
|
|
|
This component creates info over the **latent space** directly and has a parameter called `steps` which controls the number of steps, typically 50 or 100.
|
|
|
|
At its core it has a U-Net NN and a schedule algorithm, usually. It takes what the [Conditioner](#conditioner-usually-text-encoder) produces, plus a noise tensor.
|
|
|
|
In order to work with text embeddings, we need to interleave RESNet blocks with attention blocks and use some residual connections
|
|
|
|
## Creator U-Net
|
|
|
|
<!-- TODO: add images -->
|
|
|
|
> [!TIP]
|
|
> To speed up the process, operate over a **compressed** version of the image made with a **variational autoencoder**
|
|
|
|
For each step we add some noise to an image and the core needs to predict which kind of noise was added. This is because we discretize the amount of noise we add at each step.
|
|
|
|
To create examples, we generate some noise and then we multiply it for a
|
|
**noise amount level** that will determine how much noise we will add to our
|
|
training image. This step is called **Forward Diffusion** and is done
|
|
only for training.
|
|
|
|
We'll pass both the disturbed picture and its noise level as inputs to the
|
|
Image Information Creator, while the noise will be the desired output, so
|
|
each U-Net is trained to predict the noise pattern over a noisy image.
|
|
We then subtract this noise from the noisy picture, effectively denoising it.
|
|
This step is called **Reverse Diffusion** and is what we actually do during
|
|
inference.
|
|
|
|
While during training we passed an real picture and denoised it, during
|
|
inference time, we'll produce a random image and ask to denoise it. However
|
|
over this case, the image will be completely generated by our network, as there
|
|
is no real underlying image.
|
|
|
|
> [!CAUTION]
|
|
> We can use this method even in latent space, which is 48 times smaller than the pixel one.
|
|
>
|
|
> If you train over a dimension, that'll be your max dimension, go over it and you'll get artifacts
|
|
|
|
### Speeding up Creation
|
|
|
|
Instead of operating over pixel space, we can oprate over a latent space by
|
|
employing a a [VAE](../10-Autoencoders/INDEX.md#variational-autoencoders).
|
|
|
|
All operations of forward and reverse diffusions will be done here. Now the noise
|
|
will be a random tensor that we will call **latent noise**.
|
|
|
|
The reason why we can compress images over latent space is because images are
|
|
highly regular over a probabilistic point of view. However this means losing
|
|
fine details that will be recovered at the decoding step by fine tuning the
|
|
decoder.
|
|
|
|
> [!NOTE]
|
|
> You'll probably hear of **VAE Files**. These are models that were trained
|
|
> upon specific objectives to be able to recover finer details, but only
|
|
> over those specific type of images, as per the `no free lunch theorem`
|
|
|
|
### Input Conditioning
|
|
|
|
To condition our output based on another input, we use **Cross-Attention** like
|
|
Transformers.
|
|
|
|
The idea is that we are going to use a U-Net output, $\varphi_i(\vec{z})$, and
|
|
the output of either a Transformer or other networks, $\tau_\theta(\vec{y})$.
|
|
|
|
We then use attention over these inputs:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
&Q = W_Q \varphi_i(\vec{z})\\
|
|
&K = W_K \tau_\theta(\vec{y})\\
|
|
&V = W_V \tau_\theta(\vec{y})
|
|
\\
|
|
&Softmax\left(
|
|
\frac{
|
|
Q \times K^T
|
|
}{
|
|
\sqrt{d}
|
|
}
|
|
\right) V
|
|
\end{aligned}
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> While these sequences may have all different dimensions, and notations are
|
|
> a bit all over the place, the important thing is
|
|
> that the output has the same dimensions as $\varphi_i(\vec{z})$, which is
|
|
> our (latent) image tensor.
|
|
|
|
We take 2 sequences, one to produce thew `Query` vector and the other to generate the `Keys` and `Values`
|
|
|
|
## Other Conditioning
|
|
|
|
### Image to image
|
|
|
|
Here $\tau_\theta(\vec{y})$ comes from $y$ being an image
|
|
|
|
### Inpainting
|
|
|
|
Same as Img2Img, with the difference that we add noise only to what we want to
|
|
inpaint
|
|
|
|
### Depth Conditioning
|
|
|
|
We use an image to which a model extracts its depth map and a text prompt.
|
|
The image will resemble the first one used, but with a different style
|
|
conditioned by the prompt
|
|
|
|
## Guiding Diffusion
|
|
|
|
By tuning a parameter called **`classifier guidance`** we can tell
|
|
our model how close it should follow the conditioning.
|
|
|
|
Without this, the model may generate something that could satisfy several
|
|
conditioning, for example, assuming the prompt *"a girl in the park"*, the model
|
|
could generate a girl in a park holding hands with a boy.
|
|
|
|
Historically this required another model, a classifier guide, that could
|
|
steer our generative model. However, nowadays we use the architecture above to
|
|
achieve the same result. This is called **Classifier-Free Guidance**.
|
|
|
|
Here, the parameter is just a scaling factor over the conditioner.
|
|
The stronger, the less ambiguities in the image.
|
|
|
|
<!-- TODO: add stable diffusion 1 vs 2 -->
|