Compare commits
5 Commits
gape_01-pa
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3a7f2efa3e | ||
|
|
9640ae1898 | ||
|
|
5d5ecff103 | ||
|
|
55a1d38b63 | ||
|
|
490beb316c |
@ -120,6 +120,66 @@ H(f) = \begin{bmatrix}
|
||||
\end{bmatrix}
|
||||
$$
|
||||
|
||||
## [Flow](https://en.wikipedia.org/wiki/Flow_(mathematics))[^wiki-flow]
|
||||
|
||||
A flow over a set $A$ is a mapping of $R$ over $A$:
|
||||
|
||||
$$
|
||||
a \in A, t \in \R \\
|
||||
\varphi(a, t) \in A
|
||||
$$
|
||||
|
||||
Moreover, since $\varphi(a, t) \in A$ it also applies:
|
||||
|
||||
$$
|
||||
a \in A, t \in \R, s \in \R \\
|
||||
\begin{aligned}
|
||||
\varphi(\varphi(a, t), s) &= \varphi(a, t + s) \in A \\
|
||||
\varphi(a, 0) &= a
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
In other words, applying a flow over a flow of a variable is like applying
|
||||
the flow over the variable and the sum of real numbers (think of summing times).
|
||||
|
||||
Also, 0 is the neutral element of a flow.
|
||||
|
||||
## [Vector Field](https://en.wikipedia.org/wiki/Vector_field)[^wiki-vector-field]
|
||||
|
||||
It is a mapping from from a set $A \subset \R^n$ so that:
|
||||
|
||||
$$
|
||||
V: A \rightarrow \R^n
|
||||
$$
|
||||
|
||||
So, this means that for each element of $A$, which we can consider point, it
|
||||
associates another vector, which we may consider a velocity (but also points).
|
||||
|
||||
So, in a way, it can be seen as the amount of movement of that point in space.
|
||||
|
||||
## Change of Variables in probability[^stack-change-var]
|
||||
|
||||
let's change from 2 random variables, $X$ and $Y$ where $X$ has a CDF that is $F_X$
|
||||
and $Y = g(X)$ and $g$ is monotonic:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
P(Y \leq y) &= P(g(X) \leq y) = P(g^{-1}(g(X)) \leq g^{-1}(y)) = \\
|
||||
&= P(X \leq x) \rightarrow \\
|
||||
\rightarrow F_Y(y) &= F_X(x) = F_X(g^{-1}(y))
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Now, let's derive both handles of the equation for y:
|
||||
|
||||
$$
|
||||
f_Y(y) = f_X(g^{-1}(y)) \cdot \frac{d\, g^{-1}(y)}{d \, y}
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
> In case x and y are in higher dimensions, the last term is the determinant of
|
||||
> the Jacobian matrix, or Jacobian
|
||||
|
||||
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
||||
|
||||
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
||||
@ -127,3 +187,9 @@ $$
|
||||
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
||||
|
||||
[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
|
||||
|
||||
[^wiki-flow]: [Wikipedia | Flow (Mathematics) | 23rd November 2025](https://en.wikipedia.org/wiki/Flow_(mathematics))
|
||||
|
||||
[^wiki-vector-field]: [Wikipedia | Vector Field |23rd november 2025](https://en.wikipedia.org/wiki/Vector_field)
|
||||
|
||||
[^stack-change-var]: [StackExchange | Derivation of change of variables of a probability density function? | 25th November 2025](https://stats.stackexchange.com/questions/239588/derivation-of-change-of-variables-of-a-probability-density-function)
|
||||
252
Chapters/16-Energy-Based-Models/INDEX.md
Normal file
252
Chapters/16-Energy-Based-Models/INDEX.md
Normal file
@ -0,0 +1,252 @@
|
||||
# Energy Based Models
|
||||
|
||||
These models takes 2 inputs, one of which is the . Then they are
|
||||
passed through another function that output their compatibility, or
|
||||
***"goodness"***[^stanford-lecun]. The lower the value, the better the
|
||||
compatibility.
|
||||
|
||||
The objective of this model is finding the most compatible value for the
|
||||
energy function across all possible values:
|
||||
|
||||
$$
|
||||
\hat{y} = \argmin_{y \in \mathcal{Y}}(E(x, \mathcal{Y})) \\
|
||||
x \coloneqq \text{ network output | } \mathcal{Y} \coloneqq
|
||||
\text{ set of possible labels}
|
||||
$$
|
||||
|
||||
The difficult part here is choosing said $\hat{y}$ as the space $\mathcal{Y}$ may
|
||||
be too large, or even infinite, and there could be multiple $y$ that may have the
|
||||
same energy, so multiple solutions.
|
||||
|
||||
Since we want to model our function $E(x, y)$, it needs to change according to
|
||||
some weights $W$. Since each time we change $W$ we are changing also the output
|
||||
of $E(x, y)$, we are technically defining a family of energy functions.
|
||||
|
||||
However, not all of them give us what we need, thus, by tuning $W$, we explore
|
||||
this space to find a $E_W(x, y)$ that gives acceptable results.
|
||||
|
||||
During inference, since we will have built a function with a structure
|
||||
specifically crafted to be like that, we will generate a $y^*_0$ randomically
|
||||
and then we will update it through the gradient descent
|
||||
|
||||
> [!TIP]
|
||||
> If this thing is unclear, think that you have fixed your function $E_{W}$ and
|
||||
> notice that $x$ is unchangeable too. During gradient descent, you can update
|
||||
> the value of $y^*$ by the same algorithm used for $W$.
|
||||
>
|
||||
> Now, you are trying to find a minimum for $E_W$, meaning that $y^*$ will be
|
||||
> the optimal solution when the energy becomes 0 or around it.
|
||||
|
||||
## Designing a good Loss[^stanford-lecun]
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> - $y_i$: correct label for $x_i$
|
||||
> - $y^*_i$: lowest energy label
|
||||
> - $\bar{y}_i$: lowest energy incorrect label
|
||||
|
||||
### Using the energy function
|
||||
|
||||
Since we want the energy for $y_i$ to be 0 for $x_i$, everything else is a loss for
|
||||
us:
|
||||
|
||||
$$
|
||||
L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i)
|
||||
$$
|
||||
|
||||
However, this function does not increase energy for other example, possibly
|
||||
resulting in the constant function:
|
||||
|
||||
$$
|
||||
E_W(x_i, y_j) = 0 \,\, \forall i \in {X}, \forall j \in {Y}
|
||||
$$
|
||||
|
||||
In other words, it gives 0 for all possible labels, resulting in a collapsed plane
|
||||
where all energies are equal.
|
||||
|
||||
### Generalized Perceptron Loss
|
||||
|
||||
Another way is to give a lower bound for point $y_i$ and pushing away everything
|
||||
else:
|
||||
|
||||
$$
|
||||
L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i) - \min_{y \in \mathcal{Y}}E_{W}(x_i, y)
|
||||
$$
|
||||
|
||||
When the loss becomes 0, it means that both terms are equal, meaning that
|
||||
$E(x_i, y_i)$ has the lowest value. However this doesn't imply that there's not
|
||||
anothe $y_j$ so that $E(x_i, y_i) = E(x_i, y_j)$, implying that this method is
|
||||
still susceptible to flat planes.
|
||||
|
||||
### Good losses
|
||||
|
||||
To avoid the problem of a collapsed plane, we can use several losses:
|
||||
|
||||
- [**Hinge Loss**](../4-Loss-Functions/INDEX.md#hingeembeddingloss)
|
||||
- **Log Loss**
|
||||
- **MCE Loss**
|
||||
- **Square-Square Loss**
|
||||
- **Square-Exponential Loss**
|
||||
- [**Negative Log-Likelihood Loss**](../4-Loss-Functions/INDEX.md#nllloss)
|
||||
- **Minimum Empirical Error Loss**
|
||||
|
||||
All of these operates to increase the distance between $y_i$ and $\bar{y}_i$ so
|
||||
that the *"best incorrect answer"* is at least $margin$ away from the correct one(s).
|
||||
|
||||
> [!WARNING]
|
||||
> Negative Log Likelihood makes the plane way too harsh, making it like a ravine
|
||||
> where good values are in the ravine and bad ones are on top[^anelli-15]
|
||||
|
||||
## Energy Based Model Architectures
|
||||
|
||||
> [!CAUTION]
|
||||
> Here $x$, $y$ may be scalars or vectors
|
||||
|
||||

|
||||
|
||||
### Regression
|
||||
|
||||
The energy function for a regression is simply:
|
||||
|
||||
$$
|
||||
E_W(x, y) = \frac{1}{2}|| G_W(x) - y ||_2^2
|
||||
$$
|
||||
|
||||
This architecture, during training will modify $W$ so that $G_W(x_i) \sim y_i$ and
|
||||
way different from all the other $y_j$.
|
||||
During inference, $y^*_i$ will be the one that is the most similar to $G_W(x_i)$
|
||||
|
||||
### Implicit Regression
|
||||
|
||||
We usually use this architecture when we want more than one possible answer.
|
||||
The trick is to map each admissible $y$ to the same trasnformation of $x_i$:
|
||||
|
||||
$$
|
||||
E_W(x, y) = \frac{1}{2}|| G_{W_X}(x) - G_{W_Y}(y) ||_2^2
|
||||
$$
|
||||
|
||||
During training both $G_{W_X}(x_i) \sim G_{W_Y}(\mathcal{Y}_i)$, while during
|
||||
inference we will choose a value from $\mathcal{Y}^*_i$
|
||||
that will have the least energy
|
||||
|
||||
### Binary Classification
|
||||
|
||||
For binary classification problems, we have:
|
||||
|
||||
$$
|
||||
E_W(x, y) = - yG_W(x)
|
||||
$$
|
||||
|
||||
During training it will make $G_W(x_i) \gt 0$ for $y_i = 1$ and viceversa. During
|
||||
inference, $y_i^*$ will have the same sign of $G_W(x_i)$.
|
||||
|
||||
### Multi-Class Classification
|
||||
|
||||
For multiclass classification we will have:
|
||||
|
||||
$$
|
||||
E_W = \sum_{k = 0}^{C} \delta(y - k) \cdot G_W(x_i)_{[k]} \\
|
||||
\delta(u) \coloneqq \text{ Kronecker impulse - } \begin{cases}
|
||||
1 \rightarrow u = 0 \\
|
||||
0 \rightarrow u \neq 0
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
This is just a way to say that $G_W(x_i)$ will produce several scores, an
|
||||
array of values, and the energy will be equal to the one for the class $y_i$.
|
||||
|
||||
During training $G_W(x_i)_{[k]} \sim 0$ for $y_i = k$ and all the other will
|
||||
become high. During inference $y^*_i = k$ for the lowest $k^{th}$ value of
|
||||
$G_W(x_i)$
|
||||
|
||||
## Latent-Variable Architectures
|
||||
|
||||
We introduce in the system a *"latent"* variable $z$ that will help our model
|
||||
to get more details. We never receive this value, nor it is generated based on
|
||||
inputs[^yt-week-7]:
|
||||
|
||||
$$
|
||||
\hat{y}, \hat{z} = \argmin_{y,z} E(x, y, z)
|
||||
$$
|
||||
|
||||
However, if you like to find it only by looking at $y$ and using a probabilistic
|
||||
approach, we get:
|
||||
|
||||
$$
|
||||
\hat{y} = \argmin_y \lim_{\beta \rightarrow \infin} - \frac{1}{\beta}
|
||||
\log \int_{z} e^{-\beta E(x, y, z)}
|
||||
$$
|
||||
|
||||
An advantage is that if we operate over a set $\mathcal{Z}$, we get a set
|
||||
$\hat{\mathcal{Y}}$ of predictions. However, we need to limit the informativeness
|
||||
of our variable $z$, otherwise it is possible to perfectly predict $y$,
|
||||
|
||||
## Relation between probabilities and Energy
|
||||
|
||||
We can think of energy and probablity of being the same thing, but over
|
||||
2 different points of view:
|
||||
|
||||
$$
|
||||
P(y^* | x) = \frac{
|
||||
\underbrace{e^{-\beta E(x, y^*)}}_{\text{make energy small}}
|
||||
}{
|
||||
\int_{y \in \mathcal{Y}} \underbrace{e^{-\beta E(x, y)}}_{
|
||||
\text{make energy big}}
|
||||
}
|
||||
\\
|
||||
L(Y^*, W) = \underbrace{E_W(Y^*,X)}_{\text{make small}} + \frac{1}{\beta}
|
||||
\log \int_{y} \underbrace{e^{-\beta E(x, y)}}_{\text{make energy big}}
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> - $\text{gibbs distribution} \rightarrow e^{-\beta E(x, y^*)}$
|
||||
> - $\beta$: akin to an inverse temperature
|
||||
|
||||
If you can avoid it, never work with probabilities, as they are often intractable
|
||||
and give less customization over the scoring function.
|
||||
|
||||
> [!WARNING]
|
||||
> There may be reasons on why you would prefer having actual probabilities,
|
||||
> rather than scores, and that's when you need 2 agents that need to interact
|
||||
> with each other.
|
||||
>
|
||||
> Since scores are calibrated only over the model we are working with, values
|
||||
> across agents will differ, thus meaning different things.
|
||||
>
|
||||
> However, this problem does not exist if the agents are trained end-to-end.
|
||||
|
||||
## Contrastive-Methods
|
||||
|
||||
These methods have the objective of **<ins>increasing energy for negative
|
||||
examples and lower it for positive examples</ins>**
|
||||
|
||||
Basically we measure the distance, usually with a cosine similarity, and that
|
||||
becomes our energy function. Then we take a loss function that maximizes over
|
||||
similarity.
|
||||
|
||||
To make this method work, we need to feed it negative examples as well, otherwise
|
||||
we would not widen the dissimilarity between positive and negative examples.[^atcold-1]
|
||||
|
||||
> [!NOTE]
|
||||
> There are also non contrastive methods that only uses positive examples,
|
||||
> eliminating the need to get negative examples
|
||||
|
||||
## Self Supervised Learning
|
||||
|
||||
It's a method of learning parts of data by other data parts. One example is BERT,
|
||||
or T5 that operates over Masked Language Tasks that involves predicting pieces
|
||||
of missing text, by looking at the remaining data.
|
||||
|
||||
|
||||
<!--
|
||||
MARK: Footnotes
|
||||
-->
|
||||
|
||||
[^stanford-lecun]: [A Tutorial on Energy-Based Learning](https://web.stanford.edu/class/cs379c/archive/2012/suggested_reading_list/documents/LeCunetal06.pdf)
|
||||
|
||||
[^yt-week-7]: [Youtube | Week 7 – Lecture: Energy based models and self-supervised learning | 22nd November 2025](https://www.youtube.com/watch?v=tVwV14YkbYs&t=1411s)
|
||||
|
||||
[^anelli-15]: Vito Walter Anelli | Ch. 15 pg. 17 | 2025-205
|
||||
|
||||
[^atcold-1]: [DEEP LEARNING | Week 8 Ch. 8.1 | 22nd Novemeber 2025](https://atcold.github.io/NYU-DLSP20/en/week08/08-1/)
|
||||
File diff suppressed because it is too large
Load Diff
BIN
Chapters/16-Energy-Based-Models/pngs/emb-architectures.png
Normal file
BIN
Chapters/16-Energy-Based-Models/pngs/emb-architectures.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 416 KiB |
352
Chapters/17-Flow-Generative-Models/INDEX.md
Normal file
352
Chapters/17-Flow-Generative-Models/INDEX.md
Normal file
@ -0,0 +1,352 @@
|
||||
# Flow Models
|
||||
|
||||
In generative modelling, what we are trying to achive is a remap of a known
|
||||
distribution, like a Guassian, to another distribution, like colors of pixels in
|
||||
face images.
|
||||
|
||||
le't imagine now that all distributions live in the same space, a bit like cities.
|
||||
Our objective is to make our citizen move from a city to another, thus we need
|
||||
to find the right mapping.
|
||||
|
||||
## Vector field of a flow
|
||||
|
||||
> [!CAUTION]
|
||||
> Here we talk about speed, position and velocity. This is just an analogy to
|
||||
> make the learning process smoother.
|
||||
|
||||
Let's sary that we have a [flow](./../15-Appendix-A/INDEX.md#flow), which in this
|
||||
context can be seen as the position function of $\vec{x}$ at time $t$. The
|
||||
associated vector field is simply:
|
||||
|
||||
$$
|
||||
v_{t}(\vec{x}) = \frac{d \,\varphi_t(\vec{x})}{d \, t}\bigg \rvert_{t= 0}
|
||||
$$
|
||||
|
||||
This means that our vector field, read here as velocity, is the derivative of
|
||||
the flow, read as position, over time and evaluated at time $t = 0$. This means
|
||||
that we just need to know $x_0$, as the position at $t = 0$ is the initial
|
||||
position.
|
||||
|
||||
In particular, for the flow properties, we have that:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\varphi_t{(\vec{x})} &= \vec{x}(t) \\
|
||||
\varphi_0{(\vec{x})} &= \vec{x}_o \\
|
||||
\frac{d \,\varphi_t(\vec{x}_0)}{d \, t} &= f(\varphi_t(\vec{x}_0))
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
The last holds as the vector field is in function of t, thus even $\vec{x}$
|
||||
changes in function of time, and is equal to the flow.
|
||||
|
||||

|
||||
|
||||
> Image taken from [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
|
||||
|
||||
> [!NOTE]
|
||||
> As it is more evident by the image, the flow is basically a map of position of
|
||||
> points. Since the position of each point changes over time, a flow warps
|
||||
> space.
|
||||
>
|
||||
> An interesting thing is that it gives us a snapshot of points are at time $t$.
|
||||
>
|
||||
> Instead, the vector field $v$ gives us how these position will be modified
|
||||
> in the next pictures, giving us the instant velocity of all points, **allowing
|
||||
> us to predict where points will move before taking the next snapshot**.
|
||||
|
||||
## Mapping distributions via flow
|
||||
|
||||
It follows that a velocity function $v$ can take $x$ to a probability $p$ only
|
||||
if it's flow goes there:
|
||||
|
||||
$$
|
||||
v_t \text{ generates } p_t \iff \mathcal{X}_t = \varphi_t(\mathcal{X}_0) \sim p_t
|
||||
$$
|
||||
|
||||
in other words, taken a random variable $\mathcal{X_0}$, we can say that is sampled
|
||||
from $p_t$ at time $t$ only if it goes into its probability boundaries.
|
||||
|
||||
## Normalizing Flows
|
||||
|
||||
It is a technique in which we try to find several simple flows that will
|
||||
create a map of a distribution $q$ to a distribution $p$.
|
||||
|
||||
Let's say that we want to learn a map $T_W : q \rightarrow p$, that we will call
|
||||
$T_W\#q$ pushforward of $q$ by $T$, we would need to minimize a Kullback-Leibler
|
||||
divergence:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
D_{KL}(T_W\#q || p) &= \int_x \frac{\log(p(x))}{\log(T_W\#q(x))} p(x) dx = \\
|
||||
&= \underbrace{\int_x \log(p(x))p(x)\,dx}_\text{static, not controllable} -
|
||||
\underbrace{\int_x \log(T_W\#q(x))p(x)\,dx}_\text{modifiable}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Since we can ignore the static part, conditioned by only the true distribution,
|
||||
we take only the second part, which is equal to the expected value. We
|
||||
discretize it and we get
|
||||
|
||||
$$
|
||||
\mathbb{E_{x \sim p}}[\log(\,T_W\#q(x)\,)] = \sum_{x} \log(\,T_W\#q(x)\,)p(x)
|
||||
$$
|
||||
|
||||
Guess what, we have all these values, and this is similar to the
|
||||
[Negative Log Likelihood](./../4-Loss-Functions/INDEX.md#nllloss), so, with a
|
||||
bit of notation changing, we get:
|
||||
|
||||
$$
|
||||
Loss(X, Y) = - \sum_{i}^{N} \log(T_W(\vec{x})^{(i)})
|
||||
$$
|
||||
|
||||
The thing is that we can't compute this directly, as there's no ground truth
|
||||
we can use to guide this. So, let's rewrite this in terms by applying the
|
||||
[change of variables](./../15-Appendix-A/INDEX.md#change-of-variables-in-probability) we get that:
|
||||
|
||||
$$
|
||||
p(\vec{y}) = q(T_W^{-1}(\vec{y})) \cdot \bigg | J_{T_W^{-1}}(\vec{y})\bigg |
|
||||
$$
|
||||
|
||||
Since it's complicated to derive $T_W$ because it must be *invertible* and
|
||||
*differential*, this is simpler to achieve by composing $T_W$ as
|
||||
$T_W = \varphi_K \circ \dots \varphi_1$. However these flows lose expressivity
|
||||
and they are complex to evaluate as $K$ goes bigger.
|
||||
|
||||
## Continuous Normalizing Flows
|
||||
|
||||
To solve the problem of expressivity, a solution is to solve the ODE with the
|
||||
euler method
|
||||
|
||||
## Flow Matching
|
||||
|
||||
The idea is to train a velocity function that will bring us to the right
|
||||
distribution, and this is our flow matching. By treating it like an energy based
|
||||
model, we get:
|
||||
|
||||
$$
|
||||
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|
||||
|| v^W_t(\mathcal{X}_t) - v_t(\mathcal{X}_t) ||^2
|
||||
$$
|
||||
|
||||
Since we need our point to travel how we want it, we influence its velocity.
|
||||
Thus, by minimizing this error, we get the target trajectory.
|
||||
|
||||
However, there's a problem... We don't know $v_t(\mathcal{X}_t)$ and we must
|
||||
find a method to derive it
|
||||
|
||||
### Midpoint Method (aka Modified Euler's method)
|
||||
|
||||
The actual algorithm from [wikipedia](https://en.wikipedia.org/wiki/Midpoint_method) is:
|
||||
|
||||
$$
|
||||
y_{n + 1} = y_{n} + hf\left(y_n + \frac{h}{2}f(y_n, t_n), t_n + \frac{h}{2}\right)
|
||||
$$
|
||||
|
||||
translated to code becomes:
|
||||
|
||||
```python
|
||||
# Technically its velocity, not speed
|
||||
# velocity: vector
|
||||
# speed: magnitude
|
||||
def compute_speed(old_position: list[], current_time: float) -> list[]:
|
||||
# this function is what
|
||||
# we want to find
|
||||
pass
|
||||
|
||||
def compute_new_position(
|
||||
old_position: list[],
|
||||
old_speed: list[],
|
||||
current_time: float,
|
||||
time_step: float
|
||||
):
|
||||
step = time_step / 2
|
||||
|
||||
half_point_speed = compute_speed(
|
||||
old_position + step * old_speed,
|
||||
time + step
|
||||
)
|
||||
|
||||
new_pos = old_position + step * half_point_speed
|
||||
return new_pos
|
||||
```
|
||||
|
||||
Obviously we don't know what the speed function is, thus we just need to **learn**
|
||||
it. Now, inverting the formula we get that (with some abuse of notations) our
|
||||
target velocity in that point is:
|
||||
|
||||
$$
|
||||
f_{t + \frac{h}{2}} = \frac{y_{n + 1} - y_n}{h}
|
||||
$$
|
||||
|
||||
And since during training time we can compute this, as we have both
|
||||
$y_n$ and $y_{n+1}$, we just need to set h to an arbitrary value, say $1$ and we
|
||||
can easily compute this.
|
||||
|
||||
Now, we just need to learn the velocity function, and this is just the gradient
|
||||
descent of the difference between our learnt function and the computed one.
|
||||
Then, if we want to use an energy model, we get:
|
||||
|
||||
$$
|
||||
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|
||||
|| v^W_t(\mathcal{X}_t) - (\mathcal{X}_1 - \mathcal{X}_0) ||^2
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
> There's another method, applying Markov chains, though it's
|
||||
> basically a flow matching with Euler's standard method where
|
||||
> $h \rightarrow \infty$ and so has multiple steps
|
||||
|
||||
> [!CAUTION]
|
||||
> This method is perfectly equivalent with using a conditional flow matching where
|
||||
> $z$ is sampled from a dirac distribution (basically a Normal distribution with
|
||||
> $\sigma = 0$)
|
||||
|
||||
## Conditional Flow Matching
|
||||
|
||||
Now, let's say that we want to describe the velocity $v$ in other ways, the problem
|
||||
is that there isn't a unique path to go from $q$ to $p$
|
||||
|
||||

|
||||
|
||||
> GIF taken from https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/
|
||||
|
||||
So the problem is now to describe this probability path over time. However we don't
|
||||
know $p$ analytic function, but just a bunch of data points sampled from it.
|
||||
|
||||
The idea to solve this is to define a probability path, read velocity, that is
|
||||
conditioned by a variable $z$, conditioning variable, so that it is $p(x | t, z)$,
|
||||
so that, once we choose a particular $z$, we get a $p(x | t)$ that brings $x$
|
||||
from $q$ to $p$ and that has an analytic form.
|
||||
|
||||
The trick here is to find the probability $p(x, t)$ by marginalizing over
|
||||
$p(x |t, z)$. In practice this means that we just have to sum (or integrate) over
|
||||
all possible $z$ values.
|
||||
|
||||
Since $z$ is a random variable, a couple of techniques are one of sampling
|
||||
from a Linear Interpolation, or from a conical gaussian path[^a-visual-dive]
|
||||
|
||||
$$
|
||||
\text{Linear Interpolation} \\
|
||||
\begin{cases}
|
||||
\mu = (1 - t)\cdot x_q + t \cdot x_p \\
|
||||
\sigma = 0 \\
|
||||
p(x | t, z = (x_q, x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
|
||||
\end{cases}
|
||||
\\
|
||||
\text{ }
|
||||
\\
|
||||
\text{Velocity} \rightarrow v(x, t) = x_p - x_q
|
||||
\\
|
||||
\text{ }
|
||||
\\
|
||||
\text{ }
|
||||
\\
|
||||
\text{Conical Gaussian Path} \\
|
||||
\begin{cases}
|
||||
\mu = t \cdot x_p \\
|
||||
\sigma = (1 - t)^2 \\
|
||||
p(x | t, z = (x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
|
||||
\end{cases}
|
||||
\\
|
||||
\text{ }
|
||||
\\
|
||||
\text{Velocity} \rightarrow v(x, t) = \frac{x - x_p}{1 - t}
|
||||
$$
|
||||
|
||||
To find the velocity, it's possible to use the [continuity equation](https://en.wikipedia.org/wiki/Continuity_equation)
|
||||
|
||||
[^a-visual-dive]: [A Visual Dive into Conditional Flow Matching | 26th November 2025](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
|
||||
|
||||
## Conditional Optimal Flow
|
||||
|
||||
One thing that could happen during the construction of our paths is that they cross.
|
||||
To solve this problem, we may choose couples of data such that they are coupled.
|
||||
This means that they won't cross.
|
||||
|
||||
In practice we only use minibatch Optimal Transport as it's costly to compute
|
||||
|
||||
## ReFlow
|
||||
|
||||
Technically speaking a reflow is nothing more than what we already saw in flow
|
||||
matching[^flow-with]. The difference lies in how we retrain the model and how
|
||||
we sample points.
|
||||
|
||||
We first train a model like in flow matching, and then we train another model
|
||||
where true labels are not provided. In fact the taget lables provided will come
|
||||
from our original model.
|
||||
|
||||
While we normally would think that our original model learnt corssing trajectories,
|
||||
this is not the case. Plus, a better sampling provided by chaning euler to
|
||||
4th-order Runge-Kutta, we provide a better integration method.
|
||||
|
||||
By combining both techniques, our reflowed model will learn straighter trajectories.
|
||||
|
||||
> [!WARNING]
|
||||
> It's not necessary that the 2 models are different or equal, nor you need to
|
||||
> re-use the former model weights.
|
||||
|
||||
[^flow-with]: [Flow With What You Know | 26th November 2025](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
|
||||
|
||||
## Guidance for generation
|
||||
|
||||
As for diffusion, we now know how to generate the target distribution, but
|
||||
what if the target distribution is the distribution of valid images and we want
|
||||
a picture of a dog?
|
||||
|
||||
### Classifier Guidance
|
||||
|
||||
We could take our unconditioned model and use a classifier, plus another input used
|
||||
to guide, to change our model parameters (so, our velocity). However this means
|
||||
that we need a classifier that tells us if our generated output matches the label
|
||||
or not:
|
||||
|
||||
$$
|
||||
v_{W, t}(x | y) = v(x) + wb_t \nabla \log p_{Y |t}(y | x)
|
||||
$$
|
||||
|
||||
This means that our new velocity will be influenced by the weighted
|
||||
conditioned probability of obtaining $y$ starting from $x$. Now we don't need to
|
||||
retrain our model from 0.
|
||||
|
||||
Since we are tuning 2 different models together, their magnitudes do not combine
|
||||
well, leading to potential problems. Moreover the classifier is not *"perfect"*,
|
||||
so we need to consider its errors as well
|
||||
|
||||
### Classifier Free Guidance
|
||||
|
||||
Instead of using a classifier, we can fine retrain our model to consider conditioning.
|
||||
Starting from the classifier equation for the velocity, we can demonstrate that:
|
||||
|
||||
$$
|
||||
u_{W, t}(x | y) = (1 - w)v(x) + w \cdot v(x | y)
|
||||
$$
|
||||
|
||||
Now, we can reuse the previous model and we will retrain it by passing
|
||||
$y = \emptyset \text{ or } y \in \mathcal{Y}$ with probability $\eta$ for
|
||||
being empty.
|
||||
|
||||
Once retrained, during inference can sample by using this formula, where
|
||||
$y = \emptyset$ if it comes from an unguided sampling and $y \in \mathcal{Y}$ if
|
||||
it's unguided.
|
||||
|
||||
$$
|
||||
d X_t = [(1 - w)v_{W, t}(X_t | \emptyset) + wv_{W, t}(X_t | y)] dt \\
|
||||
w > 1
|
||||
$$
|
||||
|
||||
As you can notice, since $w > 1$, we are slowing the conditioned velocity with
|
||||
the unconditioned velocity, that is dampened. Moreover, if the prompt is unconditioned,
|
||||
this would be equal to the unmodified unconditional velocity.
|
||||
|
||||
## Stable Diffusion 3
|
||||
|
||||
## References
|
||||
|
||||
- [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
|
||||
- [Introduction to Flow Matching and Diffusion Models](https://diffusion.csail.mit.edu/)
|
||||
- [A Visual Dive into Conditional Flow Matching](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
|
||||
- [Rectified Flow](https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html)
|
||||
- [An introduction to Flow Matching ](https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html#coupling)
|
||||
- [Flow With What You Know](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
|
||||
- [Diffusion Meets Flow Matching](https://diffusionflow.github.io/)
|
||||
BIN
Chapters/17-Flow-Generative-Models/pngs/diffusion-mit.png
Normal file
BIN
Chapters/17-Flow-Generative-Models/pngs/diffusion-mit.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 191 KiB |
BIN
Chapters/17-Flow-Generative-Models/pngs/infinite-paths.gif
Normal file
BIN
Chapters/17-Flow-Generative-Models/pngs/infinite-paths.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 17 MiB |
176
Chapters/18-Advanced-Attention/INDEX.md
Normal file
176
Chapters/18-Advanced-Attention/INDEX.md
Normal file
@ -0,0 +1,176 @@
|
||||
# Advanced Attention
|
||||
|
||||
## KV Caching
|
||||
|
||||
The idea behind this is that during autoregression in autoregressive transformers,
|
||||
such as GPT, values for $K$ and $V$ are always recomputed, wasting computing power
|
||||
|
||||

|
||||
> GIF taken from https://medium.com/@joaolages/kv-caching-explained-276520203249
|
||||
|
||||
As we can notice, all tokens for previous steps gets recomputer, as well
|
||||
K and V values. So, a solution is to cache all keys and values until step $n$
|
||||
and compute only that value.
|
||||
|
||||

|
||||
> GIF taken from https://medium.com/@joaolages/kv-caching-explained-276520203249
|
||||
|
||||
Moreover, if we discard taking $QK^T$ for previous steps, we can just obtain
|
||||
the token we are interested in.
|
||||
|
||||
To compute the size needed to have a $KV$ cache, let's go step by step:
|
||||
|
||||
- For each layer, we need to store both $K$ and $V$ that are of the same
|
||||
dimensions (in this context), the number of heads, so the *"number"*
|
||||
of $K$ matrices, the head dimension and the number of tokens incoming,
|
||||
the sequence lenght, and
|
||||
we need to know the number of bytes for `d_type`, usually a `float16`, thus
|
||||
2 Bytes:
|
||||
|
||||
$$
|
||||
\text{Layer\_Space} = 2 \times \text{n\_heads} \times \text{d\_heads} \times
|
||||
\text{seq\_len} \times \text{d\_type}
|
||||
$$
|
||||
|
||||
- Now, during training, we pass a minibatch, so we will have a tensor of
|
||||
dimensions $N \times \text{seq\_length} \times \text{d\_model}$. When they
|
||||
are processed by $W_K$ and $W_Q$, we will have to store times $N$ more
|
||||
values:
|
||||
|
||||
$$
|
||||
\text{Batch\_Layer\_Space} = \text{Layer\_Space} \times N
|
||||
$$
|
||||
|
||||
- If you have $L$ layers, during training, at the end you'll need space
|
||||
equivalent to:
|
||||
|
||||
$$
|
||||
\text{Model\_Space} = \text{Batch\_Layer\_Space} \times L
|
||||
$$
|
||||
|
||||
## Multi-Query and Grouped-Query Attention
|
||||
|
||||
The idea here is that we don't need different keys and vectors, but only
|
||||
different queries. These approaches drastrically reduce memory consumption at
|
||||
a slight cost of accuracy.
|
||||
|
||||
In **multi-query** approach, we have only one $K$ and $V$ with a number of
|
||||
queries equal to the number of heads. In the **grouped-query** approach, we have
|
||||
a hyperparameter $G$ that will determine how many $K$s and $V$s matrices are
|
||||
in the layer, while the number of queries remains equal to the number of attention
|
||||
heads. Then, queries will be grouped to some $K_g$ and $V_g$
|
||||
|
||||

|
||||
> Image take from [Multi-Query & Grouped-Query Attention](https://tinkerd.net/blog/machine-learning/multi-query-attention/#multi-query-attention-mqa)
|
||||
|
||||
Now, the new layer size becomes:
|
||||
|
||||
$$
|
||||
\text{Layer\_Space} = 2 \times G \times \text{d\_heads} \times
|
||||
\text{seq\_len} \times \text{d\_type}
|
||||
$$
|
||||
|
||||
## Multi-Head Latent Attention
|
||||
|
||||
The idea is to reduce memory consumption by `rank` factoring $Q$, $K$, and $V$
|
||||
computation. This means that each matrix will be factorized in 2 matrices so
|
||||
that:
|
||||
|
||||
$$
|
||||
A = M_l \times M_r
|
||||
\\
|
||||
A \in \R^{in \times out}, M_l \in \R^{in \times rank},
|
||||
M_r \in \R^{rank \times out}
|
||||
$$
|
||||
|
||||
The problem, though, is that this method introduces compression, and the lower
|
||||
$rank$ is, the more compression artifacts.
|
||||
|
||||
What we are going to compress are the weight matrices for $Q$, $K$ and $V$:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
Q = X \times W_Q^L \times W_Q^R
|
||||
\end{aligned} \\
|
||||
X \in \R^{N\times S \times d_{\text{model}}},
|
||||
W_Q \in \R^{d_{\text{model}} \times (n_\text{head} \cdot d_\text{head})}
|
||||
\\
|
||||
W_Q^L\in \R^{d_{\text{model}} \times rank},
|
||||
W_Q^R\in \R^{rank \times (n_\text{head} \cdot d_\text{head})}
|
||||
\\
|
||||
\text{}
|
||||
\\
|
||||
W_Q \simeq W_Q^L \times W_Q^R
|
||||
$$
|
||||
|
||||
For simplicity, we didn't write equations for $K$ and $V$ that are basically
|
||||
the same. However, now we may think that we have just increased the number of
|
||||
operations from 1 matmul to 2 per each matrix. But the real power lies
|
||||
when we take a look at the actual computation:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
H_i &= \text{softmax}\left(
|
||||
\frac{
|
||||
XW_Q^LW_{Q,i}^R \times (XW_{KV}^LW_{K,i}^R)^T
|
||||
}{
|
||||
\sqrt{\text{d\_model}}
|
||||
}
|
||||
\right) \times X W_{KV}^L W_{V,i}^R \\
|
||||
&= \text{softmax}\left(
|
||||
\frac{
|
||||
XW_Q^LW_{Q,i}^R \times W_{K,i}^{R^T} W_{KV}^{L^T} X^T
|
||||
}{
|
||||
\sqrt{\text{d\_model}}
|
||||
}
|
||||
\right) \times X W_{KV}^L W_{V,i}^R \\
|
||||
&= \text{softmax}\left(
|
||||
\frac{
|
||||
C_{Q} \times W_{Q,i}^R W_{K,i}^{R^T} \times C_{KV}^T
|
||||
}{
|
||||
\sqrt{\text{d\_model}}
|
||||
}
|
||||
\right) \times C_{KV} W_{V,i}^R \\
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
As it can be seen, $C_Q$ and $C_{KV}$ do not depend on the head, so they can be
|
||||
computed once and shared across all heads, then we have
|
||||
$W_{Q,i}^R \times W_{K,i}^{R^T}$ that while it depends on the head number, it
|
||||
can be computer ahead of time as it does not depend on the input.
|
||||
|
||||
So, for each attention head we need just 3 matmuls plus another 2 happening
|
||||
at runtime. Moreover, if we want to, we can still apply caching over $C_{KV}$
|
||||
|
||||
## Decoupled RoPE for Multi-Latent Head Attention
|
||||
|
||||
Since `RoPE` is a positional embedding used **during** attention, this causes
|
||||
problems if we use a standard Multi Latent Head Attention. In fact, it
|
||||
shoudl be used on both, separately though, matrices $Q$ and $K$.
|
||||
|
||||
Since they come from $C_Q \times W_{Q, i}^R$ and
|
||||
$W_{KV, i}^{R^T} \times C_{KV}^T$, this means that we can't cache
|
||||
$W_{Q,i}^R \times W_{K,i}^{R^T}$ anymore.
|
||||
|
||||
A solution is to cache head matrices, but not their product, and compute
|
||||
new pieces that will be used in `RoPE` and then concatenated
|
||||
to the actual query and keys:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
Q_{R,i} &= RoPE(C_Q \times W_{QR,i}^R) \\
|
||||
K_{R,i} &= RoPE(X \times W_{KR,i}^L)
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
These matrices will then be concatenated with the reconstruciton of $Q$ and $K$.
|
||||
|
||||
## References
|
||||
|
||||
- [KV Caching Explained: Optimizing Transformer Inference Efficiency](https://huggingface.co/blog/not-lain/kv-caching)
|
||||
- [Transformers KV Caching Explained](https://medium.com/@joaolages/kv-caching-explained-276520203249)
|
||||
- [How to calculate size of KV cache](https://www.rohan-paul.com/p/how-to-calculate-size-of-kv-cache)
|
||||
- [Multi-Query & Grouped-Query Attention](https://tinkerd.net/blog/machine-learning/multi-query-attention/#multi-query-attention-mqa)
|
||||
- [Understanding Multi-Head Latent Attention](https://planetbanatt.net/articles/mla.html)
|
||||
- [https://machinelearningmastery.com/a-gentle-introduction-to-multi-head-latent-attention-mla/](https://machinelearningmastery.com/a-gentle-introduction-to-multi-head-latent-attention-mla/)
|
||||
- [DeepSeek-V3 Explained 1: Multi-head Latent Attention](https://medium.com/data-science/deepseek-v3-explained-1-multi-head-latent-attention-ed6bee2a67c4)
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 334 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 296 KiB |
BIN
Chapters/18-Advanced-Attention/pngs/grouped-head-attention.png
Normal file
BIN
Chapters/18-Advanced-Attention/pngs/grouped-head-attention.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 230 KiB |
Loading…
x
Reference in New Issue
Block a user