Deep-Learning/Chapters/17-Flow-Generative-Models/INDEX.md

# Flow Models

In generative modelling, what we are trying to achive is a remap of a known
distribution, like a Guassian, to another distribution, like colors of pixels in
face images.

le't imagine now that all distributions live in the same space, a bit like cities.
Our objective is to make our citizen move from a city to another, thus we need
to find the right mapping.

## Vector field of a flow

> [!CAUTION]
> Here we talk about speed, position and velocity. This is just an analogy to
> make the learning process smoother.

Let's sary that we have a [flow](./../15-Appendix-A/INDEX.md#flow), which in this
context can be seen as the position function of $\vec{x}$ at time $t$. The
associated vector field is simply:

$$
v_{t}(\vec{x}) = \frac{d \,\varphi_t(\vec{x})}{d \, t}\bigg \rvert_{t= 0}
$$

This means that our vector field, read here as velocity, is the derivative of
the flow, read as position, over time and evaluated at time $t = 0$. This means
that we just need to know $x_0$, as the position at $t = 0$ is the initial
position.

In particular, for the flow properties, we have that:

$$
\begin{aligned}
    \varphi_t{(\vec{x})} &= \vec{x}(t) \\
    \varphi_0{(\vec{x})} &= \vec{x}_o \\
    \frac{d \,\varphi_t(\vec{x}_0)}{d \, t} &= f(\varphi_t(\vec{x}_0))
\end{aligned}
$$

The last holds as the vector field is in function of t, thus even $\vec{x}$
changes in function of time, and is equal to the flow.

![flow and vector field of the flow evolution](./pngs/diffusion-mit.png)

> Image taken from [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)

> [!NOTE]
> As it is more evident by the image, the flow is basically a map of position of
> points. Since the position of each point changes over time, a flow warps
> space.
>
> An interesting thing is that it gives us a snapshot of points are at time $t$.
>
> Instead, the vector field $v$ gives us how these position will be modified
> in the next pictures, giving us the instant velocity of all points, **allowing
> us to predict where points will move before taking the next snapshot**.

## Mapping distributions via flow

It follows that a velocity function $v$ can take $x$ to a probability $p$ only
if it's flow goes there:

$$
v_t \text{ generates } p_t \iff \mathcal{X}_t = \varphi_t(\mathcal{X}_0) \sim p_t
$$

in other words, taken a random variable $\mathcal{X_0}$, we can say that is sampled
from $p_t$ at time $t$ only if it goes into its probability boundaries.

## Normalizing Flows

It is a technique in which we try to find several simple flows that will
create a map of a distribution $q$ to a distribution $p$.

Let's say that we want to learn a map $T_W : q \rightarrow p$, that we will call
$T_W\#q$ pushforward of $q$ by $T$, we would need to minimize a Kullback-Leibler
divergence:

$$
\begin{aligned}
    D_{KL}(T_W\#q || p) &= \int_x \frac{\log(p(x))}{\log(T_W\#q(x))} p(x) dx = \\
        &= \underbrace{\int_x \log(p(x))p(x)\,dx}_\text{static, not controllable} -
        \underbrace{\int_x \log(T_W\#q(x))p(x)\,dx}_\text{modifiable}
\end{aligned}
$$

Since we can ignore the static part, conditioned by only the true distribution,
we take only the second part, which is equal to the expected value. We
discretize it and we get

$$
\mathbb{E_{x \sim p}}[\log(\,T_W\#q(x)\,)] = \sum_{x} \log(\,T_W\#q(x)\,)p(x)
$$

Guess what, we have all these values, and this is similar to the
[Negative Log Likelihood](./../4-Loss-Functions/INDEX.md#nllloss), so, with a
bit of notation changing, we get:

$$
Loss(X, Y) = - \sum_{i}^{N} \log(T_W(\vec{x})^{(i)})
$$

The thing is that we can't compute this directly, as there's no ground truth
we can use to guide this. So, let's rewrite this in terms by applying the
[change of variables](./../15-Appendix-A/INDEX.md#change-of-variables-in-probability) we get that:

$$
p(\vec{y}) = q(T_W^{-1}(\vec{y})) \cdot \bigg | J_{T_W^{-1}}(\vec{y})\bigg |
$$

Since it's complicated to derive $T_W$ because it must be *invertible* and
*differential*, this is simpler to achieve by composing $T_W$ as
$T_W = \varphi_K \circ \dots \varphi_1$. However these flows lose expressivity
and they are complex to evaluate as $K$ goes bigger.

## Continuous Normalizing Flows

To solve the problem of expressivity, a solution is to solve the ODE with the
euler method

## Flow Matching

The idea is to train a velocity function that will bring us to the right
distribution, and this is our flow matching. By treating it like an energy based
model, we get:

$$
E_t(\mathcal{X}_0, \mathcal{X}_1) =
    || v^W_t(\mathcal{X}_t) - v_t(\mathcal{X}_t) ||^2
$$

Since we need our point to travel how we want it, we influence its velocity.
Thus, by minimizing this error, we get the target trajectory.

However, there's a problem... We don't know $v_t(\mathcal{X}_t)$ and we must
find a method to derive it

### Midpoint Method (aka Modified Euler's method)

The actual algorithm from [wikipedia](https://en.wikipedia.org/wiki/Midpoint_method) is:

$$
y_{n + 1} = y_{n} + hf\left(y_n + \frac{h}{2}f(y_n, t_n), t_n + \frac{h}{2}\right)
$$

translated to code becomes:

```python
# Technically its velocity, not speed
#   velocity:   vector
#   speed:      magnitude
def compute_speed(old_position: list[], current_time: float) -> list[]:
    # this function is what
    #   we want to find
    pass

def compute_new_position(
    old_position: list[],
    old_speed: list[],
    current_time: float,
    time_step: float
):
    step = time_step / 2

    half_point_speed = compute_speed(
        old_position + step * old_speed,
        time + step
    )

    new_pos = old_position + step * half_point_speed
    return new_pos
```

Obviously we don't know what the speed function is, thus we just need to **learn**
it. Now, inverting the formula we get that (with some abuse of notations) our
target velocity in that point is:

$$
f_{t + \frac{h}{2}} = \frac{y_{n + 1} - y_n}{h}
$$

And since during training time we can compute this, as we have both
$y_n$ and $y_{n+1}$, we just need to set h to an arbitrary value, say $1$ and we
can easily compute this.

Now, we just need to learn the velocity function, and this is just the gradient
descent of the difference between our learnt function and the computed one.
Then, if we want to use an energy model, we get:

$$
E_t(\mathcal{X}_0, \mathcal{X}_1) =
    || v^W_t(\mathcal{X}_t) - (\mathcal{X}_1 - \mathcal{X}_0) ||^2
$$

> [!NOTE]
> There's another method, applying Markov chains, though it's
> basically a flow matching with Euler's standard method where
> $h \rightarrow \infty$ and so has multiple steps

> [!CAUTION]
> This method is perfectly equivalent with using a conditional flow matching where
> $z$ is sampled from a dirac distribution (basically a Normal distribution with
> $\sigma = 0$)

## Conditional Flow Matching

Now, let's say that we want to describe the velocity $v$ in other ways, the problem
is that there isn't a unique path to go from $q$ to $p$

![infinite probability paths](./pngs/infinite-paths.gif)

> GIF taken from https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/

So the problem is now to describe this probability path over time. However we don't
know $p$ analytic function, but just a bunch of data points sampled from it.

The idea to solve this is to define a probability path, read velocity, that is
conditioned by a variable $z$, conditioning variable, so that it is $p(x | t, z)$,
so that, once we choose a particular $z$, we get a $p(x | t)$ that brings $x$
from $q$ to $p$ and that has an analytic form.

The trick here is to find the probability $p(x, t)$ by marginalizing over
$p(x |t, z)$. In practice this means that we just have to sum (or integrate) over
all possible $z$ values.

Since $z$ is a random variable, a couple of techniques are one of sampling
from a Linear Interpolation, or from a conical gaussian path[^a-visual-dive]

$$
\text{Linear Interpolation} \\
\begin{cases}
    \mu = (1 - t)\cdot x_q + t \cdot x_p \\
    \sigma = 0 \\
    p(x | t, z = (x_q, x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = x_p - x_q
\\
\text{ }
\\
\text{ }
\\
\text{Conical Gaussian Path} \\
\begin{cases}
    \mu =  t \cdot x_p \\
    \sigma = (1 - t)^2 \\
    p(x | t, z = (x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = \frac{x - x_p}{1 - t}
$$

To find the velocity, it's possible to use the [continuity equation](https://en.wikipedia.org/wiki/Continuity_equation)

[^a-visual-dive]: [A Visual Dive into Conditional Flow Matching | 26th November 2025](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)

## Conditional Optimal Flow

One thing that could happen during the construction of our paths is that they cross.
To solve this problem, we may choose couples of data such that they are coupled.
This means that they won't cross.

In practice we only use minibatch Optimal Transport as it's costly to compute

## ReFlow

Technically speaking a reflow is nothing more than what we already saw in flow
matching[^flow-with]. The difference lies in how we retrain the model and how
we sample points.

We first train a model like in flow matching, and then we train another model
where true labels are not provided. In fact the taget lables provided will come
from our original model.

While we normally would think that our original model learnt corssing trajectories,
this is not the case. Plus, a better sampling provided by chaning euler to
4th-order Runge-Kutta, we provide a better integration method.

By combining both techniques, our reflowed model will learn straighter trajectories.

> [!WARNING]
> It's not necessary that the 2 models are different or equal, nor you need to
> re-use the former model weights.

[^flow-with]: [Flow With What You Know | 26th November 2025](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)

## Guidance for generation

As for diffusion, we now know how to generate the target distribution, but
what if the target distribution is the distribution of valid images and we want
a picture of a dog?

### Classifier Guidance

We could take our unconditioned model and use a classifier, plus another input used
to guide, to change our model parameters (so, our velocity). However this means
that we need a classifier that tells us if our generated output matches the label
or not:

$$
v_{W, t}(x | y) = v(x) + wb_t \nabla \log p_{Y |t}(y | x)
$$

This means that our new velocity will be influenced by the weighted
conditioned probability of obtaining $y$ starting from $x$. Now we don't need to
retrain our model from 0.

Since we are tuning 2 different models together, their magnitudes do not combine
well, leading to potential problems. Moreover the classifier is not *"perfect"*,
so we need to consider its errors as well

### Classifier Free Guidance

Instead of using a classifier, we can fine retrain our model to consider conditioning.
Starting from the classifier equation for the velocity, we can demonstrate that:

$$
u_{W, t}(x | y) = (1 - w)v(x) + w \cdot v(x | y)
$$

Now, we can reuse the previous model and we will retrain it by passing
$y = \emptyset \text{ or } y \in \mathcal{Y}$ with probability $\eta$ for
being empty.

Once retrained, during inference can sample by using this formula, where
$y = \emptyset$ if it comes from an unguided sampling and $y \in \mathcal{Y}$ if
it's unguided.

$$
d X_t = [(1 - w)v_{W, t}(X_t | \emptyset) + wv_{W, t}(X_t | y)] dt \\
w > 1
$$

As you can notice, since $w > 1$, we are slowing the conditioned velocity with
the unconditioned velocity, that is dampened. Moreover, if the prompt is unconditioned,
this would be equal to the unmodified unconditional velocity.

## Stable Diffusion 3

## References

- [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
- [Introduction to Flow Matching and Diffusion Models](https://diffusion.csail.mit.edu/)
- [A Visual Dive into Conditional Flow Matching](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
- [Rectified Flow](https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html)
- [An introduction to Flow Matching ](https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html#coupling)
- [Flow With What You Know](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
- [Diffusion Meets Flow Matching](https://diffusionflow.github.io/)