353 lines
12 KiB
Markdown
353 lines
12 KiB
Markdown
# Flow Models
|
|
|
|
In generative modelling, what we are trying to achive is a remap of a known
|
|
distribution, like a Guassian, to another distribution, like colors of pixels in
|
|
face images.
|
|
|
|
le't imagine now that all distributions live in the same space, a bit like cities.
|
|
Our objective is to make our citizen move from a city to another, thus we need
|
|
to find the right mapping.
|
|
|
|
## Vector field of a flow
|
|
|
|
> [!CAUTION]
|
|
> Here we talk about speed, position and velocity. This is just an analogy to
|
|
> make the learning process smoother.
|
|
|
|
Let's sary that we have a [flow](./../15-Appendix-A/INDEX.md#flow), which in this
|
|
context can be seen as the position function of $\vec{x}$ at time $t$. The
|
|
associated vector field is simply:
|
|
|
|
$$
|
|
v_{t}(\vec{x}) = \frac{d \,\varphi_t(\vec{x})}{d \, t}\bigg \rvert_{t= 0}
|
|
$$
|
|
|
|
This means that our vector field, read here as velocity, is the derivative of
|
|
the flow, read as position, over time and evaluated at time $t = 0$. This means
|
|
that we just need to know $x_0$, as the position at $t = 0$ is the initial
|
|
position.
|
|
|
|
In particular, for the flow properties, we have that:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\varphi_t{(\vec{x})} &= \vec{x}(t) \\
|
|
\varphi_0{(\vec{x})} &= \vec{x}_o \\
|
|
\frac{d \,\varphi_t(\vec{x}_0)}{d \, t} &= f(\varphi_t(\vec{x}_0))
|
|
\end{aligned}
|
|
$$
|
|
|
|
The last holds as the vector field is in function of t, thus even $\vec{x}$
|
|
changes in function of time, and is equal to the flow.
|
|
|
|

|
|
|
|
> Image taken from [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
|
|
|
|
> [!NOTE]
|
|
> As it is more evident by the image, the flow is basically a map of position of
|
|
> points. Since the position of each point changes over time, a flow warps
|
|
> space.
|
|
>
|
|
> An interesting thing is that it gives us a snapshot of points are at time $t$.
|
|
>
|
|
> Instead, the vector field $v$ gives us how these position will be modified
|
|
> in the next pictures, giving us the instant velocity of all points, **allowing
|
|
> us to predict where points will move before taking the next snapshot**.
|
|
|
|
## Mapping distributions via flow
|
|
|
|
It follows that a velocity function $v$ can take $x$ to a probability $p$ only
|
|
if it's flow goes there:
|
|
|
|
$$
|
|
v_t \text{ generates } p_t \iff \mathcal{X}_t = \varphi_t(\mathcal{X}_0) \sim p_t
|
|
$$
|
|
|
|
in other words, taken a random variable $\mathcal{X_0}$, we can say that is sampled
|
|
from $p_t$ at time $t$ only if it goes into its probability boundaries.
|
|
|
|
## Normalizing Flows
|
|
|
|
It is a technique in which we try to find several simple flows that will
|
|
create a map of a distribution $q$ to a distribution $p$.
|
|
|
|
Let's say that we want to learn a map $T_W : q \rightarrow p$, that we will call
|
|
$T_W\#q$ pushforward of $q$ by $T$, we would need to minimize a Kullback-Leibler
|
|
divergence:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
D_{KL}(T_W\#q || p) &= \int_x \frac{\log(p(x))}{\log(T_W\#q(x))} p(x) dx = \\
|
|
&= \underbrace{\int_x \log(p(x))p(x)\,dx}_\text{static, not controllable} -
|
|
\underbrace{\int_x \log(T_W\#q(x))p(x)\,dx}_\text{modifiable}
|
|
\end{aligned}
|
|
$$
|
|
|
|
Since we can ignore the static part, conditioned by only the true distribution,
|
|
we take only the second part, which is equal to the expected value. We
|
|
discretize it and we get
|
|
|
|
$$
|
|
\mathbb{E_{x \sim p}}[\log(\,T_W\#q(x)\,)] = \sum_{x} \log(\,T_W\#q(x)\,)p(x)
|
|
$$
|
|
|
|
Guess what, we have all these values, and this is similar to the
|
|
[Negative Log Likelihood](./../4-Loss-Functions/INDEX.md#nllloss), so, with a
|
|
bit of notation changing, we get:
|
|
|
|
$$
|
|
Loss(X, Y) = - \sum_{i}^{N} \log(T_W(\vec{x})^{(i)})
|
|
$$
|
|
|
|
The thing is that we can't compute this directly, as there's no ground truth
|
|
we can use to guide this. So, let's rewrite this in terms by applying the
|
|
[change of variables](./../15-Appendix-A/INDEX.md#change-of-variables-in-probability) we get that:
|
|
|
|
$$
|
|
p(\vec{y}) = q(T_W^{-1}(\vec{y})) \cdot \bigg | J_{T_W^{-1}}(\vec{y})\bigg |
|
|
$$
|
|
|
|
Since it's complicated to derive $T_W$ because it must be *invertible* and
|
|
*differential*, this is simpler to achieve by composing $T_W$ as
|
|
$T_W = \varphi_K \circ \dots \varphi_1$. However these flows lose expressivity
|
|
and they are complex to evaluate as $K$ goes bigger.
|
|
|
|
## Continuous Normalizing Flows
|
|
|
|
To solve the problem of expressivity, a solution is to solve the ODE with the
|
|
euler method
|
|
|
|
## Flow Matching
|
|
|
|
The idea is to train a velocity function that will bring us to the right
|
|
distribution, and this is our flow matching. By treating it like an energy based
|
|
model, we get:
|
|
|
|
$$
|
|
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|
|
|| v^W_t(\mathcal{X}_t) - v_t(\mathcal{X}_t) ||^2
|
|
$$
|
|
|
|
Since we need our point to travel how we want it, we influence its velocity.
|
|
Thus, by minimizing this error, we get the target trajectory.
|
|
|
|
However, there's a problem... We don't know $v_t(\mathcal{X}_t)$ and we must
|
|
find a method to derive it
|
|
|
|
### Midpoint Method (aka Modified Euler's method)
|
|
|
|
The actual algorithm from [wikipedia](https://en.wikipedia.org/wiki/Midpoint_method) is:
|
|
|
|
$$
|
|
y_{n + 1} = y_{n} + hf\left(y_n + \frac{h}{2}f(y_n, t_n), t_n + \frac{h}{2}\right)
|
|
$$
|
|
|
|
translated to code becomes:
|
|
|
|
```python
|
|
# Technically its velocity, not speed
|
|
# velocity: vector
|
|
# speed: magnitude
|
|
def compute_speed(old_position: list[], current_time: float) -> list[]:
|
|
# this function is what
|
|
# we want to find
|
|
pass
|
|
|
|
def compute_new_position(
|
|
old_position: list[],
|
|
old_speed: list[],
|
|
current_time: float,
|
|
time_step: float
|
|
):
|
|
step = time_step / 2
|
|
|
|
half_point_speed = compute_speed(
|
|
old_position + step * old_speed,
|
|
time + step
|
|
)
|
|
|
|
new_pos = old_position + step * half_point_speed
|
|
return new_pos
|
|
```
|
|
|
|
Obviously we don't know what the speed function is, thus we just need to **learn**
|
|
it. Now, inverting the formula we get that (with some abuse of notations) our
|
|
target velocity in that point is:
|
|
|
|
$$
|
|
f_{t + \frac{h}{2}} = \frac{y_{n + 1} - y_n}{h}
|
|
$$
|
|
|
|
And since during training time we can compute this, as we have both
|
|
$y_n$ and $y_{n+1}$, we just need to set h to an arbitrary value, say $1$ and we
|
|
can easily compute this.
|
|
|
|
Now, we just need to learn the velocity function, and this is just the gradient
|
|
descent of the difference between our learnt function and the computed one.
|
|
Then, if we want to use an energy model, we get:
|
|
|
|
$$
|
|
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|
|
|| v^W_t(\mathcal{X}_t) - (\mathcal{X}_1 - \mathcal{X}_0) ||^2
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> There's another method, applying Markov chains, though it's
|
|
> basically a flow matching with Euler's standard method where
|
|
> $h \rightarrow \infty$ and so has multiple steps
|
|
|
|
> [!CAUTION]
|
|
> This method is perfectly equivalent with using a conditional flow matching where
|
|
> $z$ is sampled from a dirac distribution (basically a Normal distribution with
|
|
> $\sigma = 0$)
|
|
|
|
## Conditional Flow Matching
|
|
|
|
Now, let's say that we want to describe the velocity $v$ in other ways, the problem
|
|
is that there isn't a unique path to go from $q$ to $p$
|
|
|
|

|
|
|
|
> GIF taken from https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/
|
|
|
|
So the problem is now to describe this probability path over time. However we don't
|
|
know $p$ analytic function, but just a bunch of data points sampled from it.
|
|
|
|
The idea to solve this is to define a probability path, read velocity, that is
|
|
conditioned by a variable $z$, conditioning variable, so that it is $p(x | t, z)$,
|
|
so that, once we choose a particular $z$, we get a $p(x | t)$ that brings $x$
|
|
from $q$ to $p$ and that has an analytic form.
|
|
|
|
The trick here is to find the probability $p(x, t)$ by marginalizing over
|
|
$p(x |t, z)$. In practice this means that we just have to sum (or integrate) over
|
|
all possible $z$ values.
|
|
|
|
Since $z$ is a random variable, a couple of techniques are one of sampling
|
|
from a Linear Interpolation, or from a conical gaussian path[^a-visual-dive]
|
|
|
|
$$
|
|
\text{Linear Interpolation} \\
|
|
\begin{cases}
|
|
\mu = (1 - t)\cdot x_q + t \cdot x_p \\
|
|
\sigma = 0 \\
|
|
p(x | t, z = (x_q, x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
|
|
\end{cases}
|
|
\\
|
|
\text{ }
|
|
\\
|
|
\text{Velocity} \rightarrow v(x, t) = x_p - x_q
|
|
\\
|
|
\text{ }
|
|
\\
|
|
\text{ }
|
|
\\
|
|
\text{Conical Gaussian Path} \\
|
|
\begin{cases}
|
|
\mu = t \cdot x_p \\
|
|
\sigma = (1 - t)^2 \\
|
|
p(x | t, z = (x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
|
|
\end{cases}
|
|
\\
|
|
\text{ }
|
|
\\
|
|
\text{Velocity} \rightarrow v(x, t) = \frac{x - x_p}{1 - t}
|
|
$$
|
|
|
|
To find the velocity, it's possible to use the [continuity equation](https://en.wikipedia.org/wiki/Continuity_equation)
|
|
|
|
[^a-visual-dive]: [A Visual Dive into Conditional Flow Matching | 26th November 2025](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
|
|
|
|
## Conditional Optimal Flow
|
|
|
|
One thing that could happen during the construction of our paths is that they cross.
|
|
To solve this problem, we may choose couples of data such that they are coupled.
|
|
This means that they won't cross.
|
|
|
|
In practice we only use minibatch Optimal Transport as it's costly to compute
|
|
|
|
## ReFlow
|
|
|
|
Technically speaking a reflow is nothing more than what we already saw in flow
|
|
matching[^flow-with]. The difference lies in how we retrain the model and how
|
|
we sample points.
|
|
|
|
We first train a model like in flow matching, and then we train another model
|
|
where true labels are not provided. In fact the taget lables provided will come
|
|
from our original model.
|
|
|
|
While we normally would think that our original model learnt corssing trajectories,
|
|
this is not the case. Plus, a better sampling provided by chaning euler to
|
|
4th-order Runge-Kutta, we provide a better integration method.
|
|
|
|
By combining both techniques, our reflowed model will learn straighter trajectories.
|
|
|
|
> [!WARNING]
|
|
> It's not necessary that the 2 models are different or equal, nor you need to
|
|
> re-use the former model weights.
|
|
|
|
[^flow-with]: [Flow With What You Know | 26th November 2025](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
|
|
|
|
## Guidance for generation
|
|
|
|
As for diffusion, we now know how to generate the target distribution, but
|
|
what if the target distribution is the distribution of valid images and we want
|
|
a picture of a dog?
|
|
|
|
### Classifier Guidance
|
|
|
|
We could take our unconditioned model and use a classifier, plus another input used
|
|
to guide, to change our model parameters (so, our velocity). However this means
|
|
that we need a classifier that tells us if our generated output matches the label
|
|
or not:
|
|
|
|
$$
|
|
v_{W, t}(x | y) = v(x) + wb_t \nabla \log p_{Y |t}(y | x)
|
|
$$
|
|
|
|
This means that our new velocity will be influenced by the weighted
|
|
conditioned probability of obtaining $y$ starting from $x$. Now we don't need to
|
|
retrain our model from 0.
|
|
|
|
Since we are tuning 2 different models together, their magnitudes do not combine
|
|
well, leading to potential problems. Moreover the classifier is not *"perfect"*,
|
|
so we need to consider its errors as well
|
|
|
|
### Classifier Free Guidance
|
|
|
|
Instead of using a classifier, we can fine retrain our model to consider conditioning.
|
|
Starting from the classifier equation for the velocity, we can demonstrate that:
|
|
|
|
$$
|
|
u_{W, t}(x | y) = (1 - w)v(x) + w \cdot v(x | y)
|
|
$$
|
|
|
|
Now, we can reuse the previous model and we will retrain it by passing
|
|
$y = \emptyset \text{ or } y \in \mathcal{Y}$ with probability $\eta$ for
|
|
being empty.
|
|
|
|
Once retrained, during inference can sample by using this formula, where
|
|
$y = \emptyset$ if it comes from an unguided sampling and $y \in \mathcal{Y}$ if
|
|
it's unguided.
|
|
|
|
$$
|
|
d X_t = [(1 - w)v_{W, t}(X_t | \emptyset) + wv_{W, t}(X_t | y)] dt \\
|
|
w > 1
|
|
$$
|
|
|
|
As you can notice, since $w > 1$, we are slowing the conditioned velocity with
|
|
the unconditioned velocity, that is dampened. Moreover, if the prompt is unconditioned,
|
|
this would be equal to the unmodified unconditional velocity.
|
|
|
|
## Stable Diffusion 3
|
|
|
|
## References
|
|
|
|
- [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
|
|
- [Introduction to Flow Matching and Diffusion Models](https://diffusion.csail.mit.edu/)
|
|
- [A Visual Dive into Conditional Flow Matching](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
|
|
- [Rectified Flow](https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html)
|
|
- [An introduction to Flow Matching ](https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html#coupling)
|
|
- [Flow With What You Know](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
|
|
- [Diffusion Meets Flow Matching](https://diffusionflow.github.io/)
|