Diffusion

Text2image Architecture

Conditioner (usually Text Encoder)

Encodes text into several tokens (e.g. 77) over many dimensions (e.g. 768), however it could be an encoder of other formats such as inpaintings, silhouettes, depth maps, etc...

Usually, for text, we use a transformer such as Clip-Text or BERT

Note

Clip text is trained to associate images with captions such that encoded images are equal to their caption in latent space

Image Information Creator

This component creates info over the latent space directly and has a parameter called steps which controls the number of steps, typically 50 or 100.

At its core it has a UNet NN and a schedule algorithm, usually. It takes what the Conditioner produces, plus a noise tensor.

In order to work with text embeddings, we need to interleave RESNet blocks with attention blocks and use some residual connections

Creator U-Net

Tip

TO speed up the process, operate over a compressed version of the image made with a variational autoencoder

For each step we add some noise to an image and the core needs to predict which kind of noise was added. This is because we discretize the amount of noise we add at each step.

Caution

We can use this method even in latent space, which is 48 times smaller than the pixel one.

If you train over a dimension, that'll be your max dimension, go over it and you'll get artifacts

When in latent space, however, we must have a VAE that should be fine tuned at the decoding step to paint fine details.

When we have text, its encoded version will control how our U-Net will choose the noise level

Noise Scheduling

We can tell the model how much noise we want at each step during the decoding phase.

Cross Attention

We take 2 sequences, one to produce thew Query vector and the other to generate the Keys and Values

Image Decoder

Takes the Image Information Creator and uses its output to create an image.

Classifier Free Guidance

It is a parameter to see how much our guide fuides the model towards a specific objective

2.3 KiB Raw Blame History