2025-11-27 21:49:43 +01:00

12 KiB

Flow Models

In generative modelling, what we are trying to achive is a remap of a known distribution, like a Guassian, to another distribution, like colors of pixels in face images.

le't imagine now that all distributions live in the same space, a bit like cities. Our objective is to make our citizen move from a city to another, thus we need to find the right mapping.

Vector field of a flow

Caution

Here we talk about speed, position and velocity. This is just an analogy to make the learning process smoother.

Let's sary that we have a flow, which in this context can be seen as the position function of \vec{x} at time t. The associated vector field is simply:


v_{t}(\vec{x}) = \frac{d \,\varphi_t(\vec{x})}{d \, t}\bigg \rvert_{t= 0}

This means that our vector field, read here as velocity, is the derivative of the flow, read as position, over time and evaluated at time t = 0. This means that we just need to know x_0, as the position at t = 0 is the initial position.

In particular, for the flow properties, we have that:


\begin{aligned}
    \varphi_t{(\vec{x})} &= \vec{x}(t) \\
    \varphi_0{(\vec{x})} &= \vec{x}_o \\
    \frac{d \,\varphi_t(\vec{x}_0)}{d \, t} &= f(\varphi_t(\vec{x}_0))
\end{aligned}

The last holds as the vector field is in function of t, thus even \vec{x} changes in function of time, and is equal to the flow.

flow and vector field of the flow evolution

Image taken from An Introduction to Flow Matching and Diffusion Models

Note

As it is more evident by the image, the flow is basically a map of position of points. Since the position of each point changes over time, a flow warps space.

An interesting thing is that it gives us a snapshot of points are at time t.

Instead, the vector field v gives us how these position will be modified in the next pictures, giving us the instant velocity of all points, allowing us to predict where points will move before taking the next snapshot.

Mapping distributions via flow

It follows that a velocity function v can take x to a probability p only if it's flow goes there:


v_t \text{ generates } p_t \iff \mathcal{X}_t = \varphi_t(\mathcal{X}_0) \sim p_t

in other words, taken a random variable \mathcal{X_0}, we can say that is sampled from p_t at time t only if it goes into its probability boundaries.

Normalizing Flows

It is a technique in which we try to find several simple flows that will create a map of a distribution q to a distribution p.

Let's say that we want to learn a map T_W : q \rightarrow p, that we will call T_W\#q pushforward of q by T, we would need to minimize a Kullback-Leibler divergence:


\begin{aligned}
    D_{KL}(T_W\#q || p) &= \int_x \frac{\log(p(x))}{\log(T_W\#q(x))} p(x) dx = \\
        &= \underbrace{\int_x \log(p(x))p(x)\,dx}_\text{static, not controllable} -
        \underbrace{\int_x \log(T_W\#q(x))p(x)\,dx}_\text{modifiable}
\end{aligned}

Since we can ignore the static part, conditioned by only the true distribution, we take only the second part, which is equal to the expected value. We discretize it and we get


\mathbb{E_{x \sim p}}[\log(\,T_W\#q(x)\,)] = \sum_{x} \log(\,T_W\#q(x)\,)p(x)

Guess what, we have all these values, and this is similar to the Negative Log Likelihood, so, with a bit of notation changing, we get:


Loss(X, Y) = - \sum_{i}^{N} \log(T_W(\vec{x})^{(i)})

The thing is that we can't compute this directly, as there's no ground truth we can use to guide this. So, let's rewrite this in terms by applying the change of variables we get that:


p(\vec{y}) = q(T_W^{-1}(\vec{y})) \cdot \bigg | J_{T_W^{-1}}(\vec{y})\bigg |

Since it's complicated to derive T_W because it must be invertible and differential, this is simpler to achieve by composing T_W as T_W = \varphi_K \circ \dots \varphi_1. However these flows lose expressivity and they are complex to evaluate as K goes bigger.

Continuous Normalizing Flows

To solve the problem of expressivity, a solution is to solve the ODE with the euler method

Flow Matching

The idea is to train a velocity function that will bring us to the right distribution, and this is our flow matching. By treating it like an energy based model, we get:


E_t(\mathcal{X}_0, \mathcal{X}_1) =
    || v^W_t(\mathcal{X}_t) - v_t(\mathcal{X}_t) ||^2

Since we need our point to travel how we want it, we influence its velocity. Thus, by minimizing this error, we get the target trajectory.

However, there's a problem... We don't know v_t(\mathcal{X}_t) and we must find a method to derive it

Midpoint Method (aka Modified Euler's method)

The actual algorithm from wikipedia is:


y_{n + 1} = y_{n} + hf\left(y_n + \frac{h}{2}f(y_n, t_n), t_n + \frac{h}{2}\right)

translated to code becomes:

# Technically its velocity, not speed
#   velocity:   vector
#   speed:      magnitude
def compute_speed(old_position: list[], current_time: float) -> list[]:
    # this function is what
    #   we want to find
    pass

def compute_new_position(
    old_position: list[],
    old_speed: list[],
    current_time: float,
    time_step: float
):
    step = time_step / 2

    half_point_speed = compute_speed(
        old_position + step * old_speed,
        time + step
    )

    new_pos = old_position + step * half_point_speed
    return new_pos

Obviously we don't know what the speed function is, thus we just need to learn it. Now, inverting the formula we get that (with some abuse of notations) our target velocity in that point is:


f_{t + \frac{h}{2}} = \frac{y_{n + 1} - y_n}{h}

And since during training time we can compute this, as we have both y_n and y_{n+1}, we just need to set h to an arbitrary value, say 1 and we can easily compute this.

Now, we just need to learn the velocity function, and this is just the gradient descent of the difference between our learnt function and the computed one. Then, if we want to use an energy model, we get:


E_t(\mathcal{X}_0, \mathcal{X}_1) =
    || v^W_t(\mathcal{X}_t) - (\mathcal{X}_1 - \mathcal{X}_0) ||^2

Note

There's another method, applying Markov chains, though it's basically a flow matching with Euler's standard method where h \rightarrow \infty and so has multiple steps

Caution

This method is perfectly equivalent with using a conditional flow matching where z is sampled from a dirac distribution (basically a Normal distribution with \sigma = 0)

Conditional Flow Matching

Now, let's say that we want to describe the velocity v in other ways, the problem is that there isn't a unique path to go from q to p

infinite probability paths

GIF taken from https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/

So the problem is now to describe this probability path over time. However we don't know p analytic function, but just a bunch of data points sampled from it.

The idea to solve this is to define a probability path, read velocity, that is conditioned by a variable z, conditioning variable, so that it is p(x | t, z), so that, once we choose a particular z, we get a p(x | t) that brings x from q to p and that has an analytic form.

The trick here is to find the probability p(x, t) by marginalizing over p(x |t, z). In practice this means that we just have to sum (or integrate) over all possible z values.

Since z is a random variable, a couple of techniques are one of sampling from a Linear Interpolation, or from a conical gaussian path1


\text{Linear Interpolation} \\
\begin{cases}
    \mu = (1 - t)\cdot x_q + t \cdot x_p \\
    \sigma = 0 \\
    p(x | t, z = (x_q, x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = x_p - x_q
\\
\text{ }
\\
\text{ }
\\
\text{Conical Gaussian Path} \\
\begin{cases}
    \mu =  t \cdot x_p \\
    \sigma = (1 - t)^2 \\
    p(x | t, z = (x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = \frac{x - x_p}{1 - t}

To find the velocity, it's possible to use the continuity equation

Conditional Optimal Flow

One thing that could happen during the construction of our paths is that they cross. To solve this problem, we may choose couples of data such that they are coupled. This means that they won't cross.

In practice we only use minibatch Optimal Transport as it's costly to compute

ReFlow

Technically speaking a reflow is nothing more than what we already saw in flow matching2. The difference lies in how we retrain the model and how we sample points.

We first train a model like in flow matching, and then we train another model where true labels are not provided. In fact the taget lables provided will come from our original model.

While we normally would think that our original model learnt corssing trajectories, this is not the case. Plus, a better sampling provided by chaning euler to 4th-order Runge-Kutta, we provide a better integration method.

By combining both techniques, our reflowed model will learn straighter trajectories.

Warning

It's not necessary that the 2 models are different or equal, nor you need to re-use the former model weights.

Guidance for generation

As for diffusion, we now know how to generate the target distribution, but what if the target distribution is the distribution of valid images and we want a picture of a dog?

Classifier Guidance

We could take our unconditioned model and use a classifier, plus another input used to guide, to change our model parameters (so, our velocity). However this means that we need a classifier that tells us if our generated output matches the label or not:


v_{W, t}(x | y) = v(x) + wb_t \nabla \log p_{Y |t}(y | x)

This means that our new velocity will be influenced by the weighted conditioned probability of obtaining y starting from x. Now we don't need to retrain our model from 0.

Since we are tuning 2 different models together, their magnitudes do not combine well, leading to potential problems. Moreover the classifier is not "perfect", so we need to consider its errors as well

Classifier Free Guidance

Instead of using a classifier, we can fine retrain our model to consider conditioning. Starting from the classifier equation for the velocity, we can demonstrate that:


u_{W, t}(x | y) = (1 - w)v(x) + w \cdot v(x | y)

Now, we can reuse the previous model and we will retrain it by passing y = \emptyset \text{ or } y \in \mathcal{Y} with probability \eta for being empty.

Once retrained, during inference can sample by using this formula, where y = \emptyset if it comes from an unguided sampling and y \in \mathcal{Y} if it's unguided.


d X_t = [(1 - w)v_{W, t}(X_t | \emptyset) + wv_{W, t}(X_t | y)] dt \\
w > 1

As you can notice, since w > 1, we are slowing the conditioned velocity with the unconditioned velocity, that is dampened. Moreover, if the prompt is unconditioned, this would be equal to the unmodified unconditional velocity.

Stable Diffusion 3

References