Compare commits

...

5 Commits

Author SHA1 Message Date
Christian Risi
3a7f2efa3e Added Advanced Attention Methods 2025-11-29 21:02:47 +01:00
Christian Risi
9640ae1898 Added Self-Supervised Learning 2025-11-29 21:02:01 +01:00
Christian Risi
5d5ecff103 Added Energy Based Models 2025-11-27 21:50:25 +01:00
Christian Risi
55a1d38b63 Added Flow and Vector Fields 2025-11-27 21:50:03 +01:00
Christian Risi
490beb316c Added Flow Generative Models 2025-11-27 21:49:43 +01:00
11 changed files with 4135 additions and 0 deletions

View File

@ -120,6 +120,66 @@ H(f) = \begin{bmatrix}
\end{bmatrix}
$$
## [Flow](https://en.wikipedia.org/wiki/Flow_(mathematics))[^wiki-flow]
A flow over a set $A$ is a mapping of $R$ over $A$:
$$
a \in A, t \in \R \\
\varphi(a, t) \in A
$$
Moreover, since $\varphi(a, t) \in A$ it also applies:
$$
a \in A, t \in \R, s \in \R \\
\begin{aligned}
\varphi(\varphi(a, t), s) &= \varphi(a, t + s) \in A \\
\varphi(a, 0) &= a
\end{aligned}
$$
In other words, applying a flow over a flow of a variable is like applying
the flow over the variable and the sum of real numbers (think of summing times).
Also, 0 is the neutral element of a flow.
## [Vector Field](https://en.wikipedia.org/wiki/Vector_field)[^wiki-vector-field]
It is a mapping from from a set $A \subset \R^n$ so that:
$$
V: A \rightarrow \R^n
$$
So, this means that for each element of $A$, which we can consider point, it
associates another vector, which we may consider a velocity (but also points).
So, in a way, it can be seen as the amount of movement of that point in space.
## Change of Variables in probability[^stack-change-var]
let's change from 2 random variables, $X$ and $Y$ where $X$ has a CDF that is $F_X$
and $Y = g(X)$ and $g$ is monotonic:
$$
\begin{aligned}
P(Y \leq y) &= P(g(X) \leq y) = P(g^{-1}(g(X)) \leq g^{-1}(y)) = \\
&= P(X \leq x) \rightarrow \\
\rightarrow F_Y(y) &= F_X(x) = F_X(g^{-1}(y))
\end{aligned}
$$
Now, let's derive both handles of the equation for y:
$$
f_Y(y) = f_X(g^{-1}(y)) \cdot \frac{d\, g^{-1}(y)}{d \, y}
$$
> [!NOTE]
> In case x and y are in higher dimensions, the last term is the determinant of
> the Jacobian matrix, or Jacobian
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
@ -127,3 +187,9 @@ $$
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
[^wiki-flow]: [Wikipedia | Flow (Mathematics) | 23rd November 2025](https://en.wikipedia.org/wiki/Flow_(mathematics))
[^wiki-vector-field]: [Wikipedia | Vector Field |23rd november 2025](https://en.wikipedia.org/wiki/Vector_field)
[^stack-change-var]: [StackExchange | Derivation of change of variables of a probability density function? | 25th November 2025](https://stats.stackexchange.com/questions/239588/derivation-of-change-of-variables-of-a-probability-density-function)

View File

@ -0,0 +1,252 @@
# Energy Based Models
These models takes 2 inputs, one of which is the . Then they are
passed through another function that output their compatibility, or
***"goodness"***[^stanford-lecun]. The lower the value, the better the
compatibility.
The objective of this model is finding the most compatible value for the
energy function across all possible values:
$$
\hat{y} = \argmin_{y \in \mathcal{Y}}(E(x, \mathcal{Y})) \\
x \coloneqq \text{ network output | } \mathcal{Y} \coloneqq
\text{ set of possible labels}
$$
The difficult part here is choosing said $\hat{y}$ as the space $\mathcal{Y}$ may
be too large, or even infinite, and there could be multiple $y$ that may have the
same energy, so multiple solutions.
Since we want to model our function $E(x, y)$, it needs to change according to
some weights $W$. Since each time we change $W$ we are changing also the output
of $E(x, y)$, we are technically defining a family of energy functions.
However, not all of them give us what we need, thus, by tuning $W$, we explore
this space to find a $E_W(x, y)$ that gives acceptable results.
During inference, since we will have built a function with a structure
specifically crafted to be like that, we will generate a $y^*_0$ randomically
and then we will update it through the gradient descent
> [!TIP]
> If this thing is unclear, think that you have fixed your function $E_{W}$ and
> notice that $x$ is unchangeable too. During gradient descent, you can update
> the value of $y^*$ by the same algorithm used for $W$.
>
> Now, you are trying to find a minimum for $E_W$, meaning that $y^*$ will be
> the optimal solution when the energy becomes 0 or around it.
## Designing a good Loss[^stanford-lecun]
> [!NOTE]
>
> - $y_i$: correct label for $x_i$
> - $y^*_i$: lowest energy label
> - $\bar{y}_i$: lowest energy incorrect label
### Using the energy function
Since we want the energy for $y_i$ to be 0 for $x_i$, everything else is a loss for
us:
$$
L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i)
$$
However, this function does not increase energy for other example, possibly
resulting in the constant function:
$$
E_W(x_i, y_j) = 0 \,\, \forall i \in {X}, \forall j \in {Y}
$$
In other words, it gives 0 for all possible labels, resulting in a collapsed plane
where all energies are equal.
### Generalized Perceptron Loss
Another way is to give a lower bound for point $y_i$ and pushing away everything
else:
$$
L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i) - \min_{y \in \mathcal{Y}}E_{W}(x_i, y)
$$
When the loss becomes 0, it means that both terms are equal, meaning that
$E(x_i, y_i)$ has the lowest value. However this doesn't imply that there's not
anothe $y_j$ so that $E(x_i, y_i) = E(x_i, y_j)$, implying that this method is
still susceptible to flat planes.
### Good losses
To avoid the problem of a collapsed plane, we can use several losses:
- [**Hinge Loss**](../4-Loss-Functions/INDEX.md#hingeembeddingloss)
- **Log Loss**
- **MCE Loss**
- **Square-Square Loss**
- **Square-Exponential Loss**
- [**Negative Log-Likelihood Loss**](../4-Loss-Functions/INDEX.md#nllloss)
- **Minimum Empirical Error Loss**
All of these operates to increase the distance between $y_i$ and $\bar{y}_i$ so
that the *"best incorrect answer"* is at least $margin$ away from the correct one(s).
> [!WARNING]
> Negative Log Likelihood makes the plane way too harsh, making it like a ravine
> where good values are in the ravine and bad ones are on top[^anelli-15]
## Energy Based Model Architectures
> [!CAUTION]
> Here $x$, $y$ may be scalars or vectors
![energy based model architectures](./pngs/emb-architectures.png)
### Regression
The energy function for a regression is simply:
$$
E_W(x, y) = \frac{1}{2}|| G_W(x) - y ||_2^2
$$
This architecture, during training will modify $W$ so that $G_W(x_i) \sim y_i$ and
way different from all the other $y_j$.
During inference, $y^*_i$ will be the one that is the most similar to $G_W(x_i)$
### Implicit Regression
We usually use this architecture when we want more than one possible answer.
The trick is to map each admissible $y$ to the same trasnformation of $x_i$:
$$
E_W(x, y) = \frac{1}{2}|| G_{W_X}(x) - G_{W_Y}(y) ||_2^2
$$
During training both $G_{W_X}(x_i) \sim G_{W_Y}(\mathcal{Y}_i)$, while during
inference we will choose a value from $\mathcal{Y}^*_i$
that will have the least energy
### Binary Classification
For binary classification problems, we have:
$$
E_W(x, y) = - yG_W(x)
$$
During training it will make $G_W(x_i) \gt 0$ for $y_i = 1$ and viceversa. During
inference, $y_i^*$ will have the same sign of $G_W(x_i)$.
### Multi-Class Classification
For multiclass classification we will have:
$$
E_W = \sum_{k = 0}^{C} \delta(y - k) \cdot G_W(x_i)_{[k]} \\
\delta(u) \coloneqq \text{ Kronecker impulse - } \begin{cases}
1 \rightarrow u = 0 \\
0 \rightarrow u \neq 0
\end{cases}
$$
This is just a way to say that $G_W(x_i)$ will produce several scores, an
array of values, and the energy will be equal to the one for the class $y_i$.
During training $G_W(x_i)_{[k]} \sim 0$ for $y_i = k$ and all the other will
become high. During inference $y^*_i = k$ for the lowest $k^{th}$ value of
$G_W(x_i)$
## Latent-Variable Architectures
We introduce in the system a *"latent"* variable $z$ that will help our model
to get more details. We never receive this value, nor it is generated based on
inputs[^yt-week-7]:
$$
\hat{y}, \hat{z} = \argmin_{y,z} E(x, y, z)
$$
However, if you like to find it only by looking at $y$ and using a probabilistic
approach, we get:
$$
\hat{y} = \argmin_y \lim_{\beta \rightarrow \infin} - \frac{1}{\beta}
\log \int_{z} e^{-\beta E(x, y, z)}
$$
An advantage is that if we operate over a set $\mathcal{Z}$, we get a set
$\hat{\mathcal{Y}}$ of predictions. However, we need to limit the informativeness
of our variable $z$, otherwise it is possible to perfectly predict $y$,
## Relation between probabilities and Energy
We can think of energy and probablity of being the same thing, but over
2 different points of view:
$$
P(y^* | x) = \frac{
\underbrace{e^{-\beta E(x, y^*)}}_{\text{make energy small}}
}{
\int_{y \in \mathcal{Y}} \underbrace{e^{-\beta E(x, y)}}_{
\text{make energy big}}
}
\\
L(Y^*, W) = \underbrace{E_W(Y^*,X)}_{\text{make small}} + \frac{1}{\beta}
\log \int_{y} \underbrace{e^{-\beta E(x, y)}}_{\text{make energy big}}
$$
> [!NOTE]
>
> - $\text{gibbs distribution} \rightarrow e^{-\beta E(x, y^*)}$
> - $\beta$: akin to an inverse temperature
If you can avoid it, never work with probabilities, as they are often intractable
and give less customization over the scoring function.
> [!WARNING]
> There may be reasons on why you would prefer having actual probabilities,
> rather than scores, and that's when you need 2 agents that need to interact
> with each other.
>
> Since scores are calibrated only over the model we are working with, values
> across agents will differ, thus meaning different things.
>
> However, this problem does not exist if the agents are trained end-to-end.
## Contrastive-Methods
These methods have the objective of **<ins>increasing energy for negative
examples and lower it for positive examples</ins>**
Basically we measure the distance, usually with a cosine similarity, and that
becomes our energy function. Then we take a loss function that maximizes over
similarity.
To make this method work, we need to feed it negative examples as well, otherwise
we would not widen the dissimilarity between positive and negative examples.[^atcold-1]
> [!NOTE]
> There are also non contrastive methods that only uses positive examples,
> eliminating the need to get negative examples
## Self Supervised Learning
It's a method of learning parts of data by other data parts. One example is BERT,
or T5 that operates over Masked Language Tasks that involves predicting pieces
of missing text, by looking at the remaining data.
<!--
MARK: Footnotes
-->
[^stanford-lecun]: [A Tutorial on Energy-Based Learning](https://web.stanford.edu/class/cs379c/archive/2012/suggested_reading_list/documents/LeCunetal06.pdf)
[^yt-week-7]: [Youtube | Week 7 Lecture: Energy based models and self-supervised learning | 22nd November 2025](https://www.youtube.com/watch?v=tVwV14YkbYs&t=1411s)
[^anelli-15]: Vito Walter Anelli | Ch. 15 pg. 17 | 2025-205
[^atcold-1]: [DEEP LEARNING | Week 8 Ch. 8.1 | 22nd Novemeber 2025](https://atcold.github.io/NYU-DLSP20/en/week08/08-1/)

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 416 KiB

View File

@ -0,0 +1,352 @@
# Flow Models
In generative modelling, what we are trying to achive is a remap of a known
distribution, like a Guassian, to another distribution, like colors of pixels in
face images.
le't imagine now that all distributions live in the same space, a bit like cities.
Our objective is to make our citizen move from a city to another, thus we need
to find the right mapping.
## Vector field of a flow
> [!CAUTION]
> Here we talk about speed, position and velocity. This is just an analogy to
> make the learning process smoother.
Let's sary that we have a [flow](./../15-Appendix-A/INDEX.md#flow), which in this
context can be seen as the position function of $\vec{x}$ at time $t$. The
associated vector field is simply:
$$
v_{t}(\vec{x}) = \frac{d \,\varphi_t(\vec{x})}{d \, t}\bigg \rvert_{t= 0}
$$
This means that our vector field, read here as velocity, is the derivative of
the flow, read as position, over time and evaluated at time $t = 0$. This means
that we just need to know $x_0$, as the position at $t = 0$ is the initial
position.
In particular, for the flow properties, we have that:
$$
\begin{aligned}
\varphi_t{(\vec{x})} &= \vec{x}(t) \\
\varphi_0{(\vec{x})} &= \vec{x}_o \\
\frac{d \,\varphi_t(\vec{x}_0)}{d \, t} &= f(\varphi_t(\vec{x}_0))
\end{aligned}
$$
The last holds as the vector field is in function of t, thus even $\vec{x}$
changes in function of time, and is equal to the flow.
![flow and vector field of the flow evolution](./pngs/diffusion-mit.png)
> Image taken from [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
> [!NOTE]
> As it is more evident by the image, the flow is basically a map of position of
> points. Since the position of each point changes over time, a flow warps
> space.
>
> An interesting thing is that it gives us a snapshot of points are at time $t$.
>
> Instead, the vector field $v$ gives us how these position will be modified
> in the next pictures, giving us the instant velocity of all points, **allowing
> us to predict where points will move before taking the next snapshot**.
## Mapping distributions via flow
It follows that a velocity function $v$ can take $x$ to a probability $p$ only
if it's flow goes there:
$$
v_t \text{ generates } p_t \iff \mathcal{X}_t = \varphi_t(\mathcal{X}_0) \sim p_t
$$
in other words, taken a random variable $\mathcal{X_0}$, we can say that is sampled
from $p_t$ at time $t$ only if it goes into its probability boundaries.
## Normalizing Flows
It is a technique in which we try to find several simple flows that will
create a map of a distribution $q$ to a distribution $p$.
Let's say that we want to learn a map $T_W : q \rightarrow p$, that we will call
$T_W\#q$ pushforward of $q$ by $T$, we would need to minimize a Kullback-Leibler
divergence:
$$
\begin{aligned}
D_{KL}(T_W\#q || p) &= \int_x \frac{\log(p(x))}{\log(T_W\#q(x))} p(x) dx = \\
&= \underbrace{\int_x \log(p(x))p(x)\,dx}_\text{static, not controllable} -
\underbrace{\int_x \log(T_W\#q(x))p(x)\,dx}_\text{modifiable}
\end{aligned}
$$
Since we can ignore the static part, conditioned by only the true distribution,
we take only the second part, which is equal to the expected value. We
discretize it and we get
$$
\mathbb{E_{x \sim p}}[\log(\,T_W\#q(x)\,)] = \sum_{x} \log(\,T_W\#q(x)\,)p(x)
$$
Guess what, we have all these values, and this is similar to the
[Negative Log Likelihood](./../4-Loss-Functions/INDEX.md#nllloss), so, with a
bit of notation changing, we get:
$$
Loss(X, Y) = - \sum_{i}^{N} \log(T_W(\vec{x})^{(i)})
$$
The thing is that we can't compute this directly, as there's no ground truth
we can use to guide this. So, let's rewrite this in terms by applying the
[change of variables](./../15-Appendix-A/INDEX.md#change-of-variables-in-probability) we get that:
$$
p(\vec{y}) = q(T_W^{-1}(\vec{y})) \cdot \bigg | J_{T_W^{-1}}(\vec{y})\bigg |
$$
Since it's complicated to derive $T_W$ because it must be *invertible* and
*differential*, this is simpler to achieve by composing $T_W$ as
$T_W = \varphi_K \circ \dots \varphi_1$. However these flows lose expressivity
and they are complex to evaluate as $K$ goes bigger.
## Continuous Normalizing Flows
To solve the problem of expressivity, a solution is to solve the ODE with the
euler method
## Flow Matching
The idea is to train a velocity function that will bring us to the right
distribution, and this is our flow matching. By treating it like an energy based
model, we get:
$$
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|| v^W_t(\mathcal{X}_t) - v_t(\mathcal{X}_t) ||^2
$$
Since we need our point to travel how we want it, we influence its velocity.
Thus, by minimizing this error, we get the target trajectory.
However, there's a problem... We don't know $v_t(\mathcal{X}_t)$ and we must
find a method to derive it
### Midpoint Method (aka Modified Euler's method)
The actual algorithm from [wikipedia](https://en.wikipedia.org/wiki/Midpoint_method) is:
$$
y_{n + 1} = y_{n} + hf\left(y_n + \frac{h}{2}f(y_n, t_n), t_n + \frac{h}{2}\right)
$$
translated to code becomes:
```python
# Technically its velocity, not speed
# velocity: vector
# speed: magnitude
def compute_speed(old_position: list[], current_time: float) -> list[]:
# this function is what
# we want to find
pass
def compute_new_position(
old_position: list[],
old_speed: list[],
current_time: float,
time_step: float
):
step = time_step / 2
half_point_speed = compute_speed(
old_position + step * old_speed,
time + step
)
new_pos = old_position + step * half_point_speed
return new_pos
```
Obviously we don't know what the speed function is, thus we just need to **learn**
it. Now, inverting the formula we get that (with some abuse of notations) our
target velocity in that point is:
$$
f_{t + \frac{h}{2}} = \frac{y_{n + 1} - y_n}{h}
$$
And since during training time we can compute this, as we have both
$y_n$ and $y_{n+1}$, we just need to set h to an arbitrary value, say $1$ and we
can easily compute this.
Now, we just need to learn the velocity function, and this is just the gradient
descent of the difference between our learnt function and the computed one.
Then, if we want to use an energy model, we get:
$$
E_t(\mathcal{X}_0, \mathcal{X}_1) =
|| v^W_t(\mathcal{X}_t) - (\mathcal{X}_1 - \mathcal{X}_0) ||^2
$$
> [!NOTE]
> There's another method, applying Markov chains, though it's
> basically a flow matching with Euler's standard method where
> $h \rightarrow \infty$ and so has multiple steps
> [!CAUTION]
> This method is perfectly equivalent with using a conditional flow matching where
> $z$ is sampled from a dirac distribution (basically a Normal distribution with
> $\sigma = 0$)
## Conditional Flow Matching
Now, let's say that we want to describe the velocity $v$ in other ways, the problem
is that there isn't a unique path to go from $q$ to $p$
![infinite probability paths](./pngs/infinite-paths.gif)
> GIF taken from https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/
So the problem is now to describe this probability path over time. However we don't
know $p$ analytic function, but just a bunch of data points sampled from it.
The idea to solve this is to define a probability path, read velocity, that is
conditioned by a variable $z$, conditioning variable, so that it is $p(x | t, z)$,
so that, once we choose a particular $z$, we get a $p(x | t)$ that brings $x$
from $q$ to $p$ and that has an analytic form.
The trick here is to find the probability $p(x, t)$ by marginalizing over
$p(x |t, z)$. In practice this means that we just have to sum (or integrate) over
all possible $z$ values.
Since $z$ is a random variable, a couple of techniques are one of sampling
from a Linear Interpolation, or from a conical gaussian path[^a-visual-dive]
$$
\text{Linear Interpolation} \\
\begin{cases}
\mu = (1 - t)\cdot x_q + t \cdot x_p \\
\sigma = 0 \\
p(x | t, z = (x_q, x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = x_p - x_q
\\
\text{ }
\\
\text{ }
\\
\text{Conical Gaussian Path} \\
\begin{cases}
\mu = t \cdot x_p \\
\sigma = (1 - t)^2 \\
p(x | t, z = (x_p)) = \mathcal{N}(\mu, \sigma^2 I) \\
\end{cases}
\\
\text{ }
\\
\text{Velocity} \rightarrow v(x, t) = \frac{x - x_p}{1 - t}
$$
To find the velocity, it's possible to use the [continuity equation](https://en.wikipedia.org/wiki/Continuity_equation)
[^a-visual-dive]: [A Visual Dive into Conditional Flow Matching | 26th November 2025](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
## Conditional Optimal Flow
One thing that could happen during the construction of our paths is that they cross.
To solve this problem, we may choose couples of data such that they are coupled.
This means that they won't cross.
In practice we only use minibatch Optimal Transport as it's costly to compute
## ReFlow
Technically speaking a reflow is nothing more than what we already saw in flow
matching[^flow-with]. The difference lies in how we retrain the model and how
we sample points.
We first train a model like in flow matching, and then we train another model
where true labels are not provided. In fact the taget lables provided will come
from our original model.
While we normally would think that our original model learnt corssing trajectories,
this is not the case. Plus, a better sampling provided by chaning euler to
4th-order Runge-Kutta, we provide a better integration method.
By combining both techniques, our reflowed model will learn straighter trajectories.
> [!WARNING]
> It's not necessary that the 2 models are different or equal, nor you need to
> re-use the former model weights.
[^flow-with]: [Flow With What You Know | 26th November 2025](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
## Guidance for generation
As for diffusion, we now know how to generate the target distribution, but
what if the target distribution is the distribution of valid images and we want
a picture of a dog?
### Classifier Guidance
We could take our unconditioned model and use a classifier, plus another input used
to guide, to change our model parameters (so, our velocity). However this means
that we need a classifier that tells us if our generated output matches the label
or not:
$$
v_{W, t}(x | y) = v(x) + wb_t \nabla \log p_{Y |t}(y | x)
$$
This means that our new velocity will be influenced by the weighted
conditioned probability of obtaining $y$ starting from $x$. Now we don't need to
retrain our model from 0.
Since we are tuning 2 different models together, their magnitudes do not combine
well, leading to potential problems. Moreover the classifier is not *"perfect"*,
so we need to consider its errors as well
### Classifier Free Guidance
Instead of using a classifier, we can fine retrain our model to consider conditioning.
Starting from the classifier equation for the velocity, we can demonstrate that:
$$
u_{W, t}(x | y) = (1 - w)v(x) + w \cdot v(x | y)
$$
Now, we can reuse the previous model and we will retrain it by passing
$y = \emptyset \text{ or } y \in \mathcal{Y}$ with probability $\eta$ for
being empty.
Once retrained, during inference can sample by using this formula, where
$y = \emptyset$ if it comes from an unguided sampling and $y \in \mathcal{Y}$ if
it's unguided.
$$
d X_t = [(1 - w)v_{W, t}(X_t | \emptyset) + wv_{W, t}(X_t | y)] dt \\
w > 1
$$
As you can notice, since $w > 1$, we are slowing the conditioned velocity with
the unconditioned velocity, that is dampened. Moreover, if the prompt is unconditioned,
this would be equal to the unmodified unconditional velocity.
## Stable Diffusion 3
## References
- [An Introduction to Flow Matching and Diffusion Models](https://arxiv.org/pdf/2506.02070)
- [Introduction to Flow Matching and Diffusion Models](https://diffusion.csail.mit.edu/)
- [A Visual Dive into Conditional Flow Matching](https://dl.heeere.com/conditional-flow-matching/blog/conditional-flow-matching/)
- [Rectified Flow](https://www.cs.utexas.edu/~lqiang/rectflow/html/intro.html)
- [An introduction to Flow Matching ](https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html#coupling)
- [Flow With What You Know](https://drscotthawley.github.io/blog/posts/FlowModels.html#abstract)
- [Diffusion Meets Flow Matching](https://diffusionflow.github.io/)

Binary file not shown.

After

Width:  |  Height:  |  Size: 191 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 MiB

View File

@ -0,0 +1,176 @@
# Advanced Attention
## KV Caching
The idea behind this is that during autoregression in autoregressive transformers,
such as GPT, values for $K$ and $V$ are always recomputed, wasting computing power
![autoregression without caching](./pngs/decoder-autoregression-no-cache.gif)
> GIF taken from https://medium.com/@joaolages/kv-caching-explained-276520203249
As we can notice, all tokens for previous steps gets recomputer, as well
K and V values. So, a solution is to cache all keys and values until step $n$
and compute only that value.
![autoregression with caching](./pngs/decoder-autoregression-cache.gif)
> GIF taken from https://medium.com/@joaolages/kv-caching-explained-276520203249
Moreover, if we discard taking $QK^T$ for previous steps, we can just obtain
the token we are interested in.
To compute the size needed to have a $KV$ cache, let's go step by step:
- For each layer, we need to store both $K$ and $V$ that are of the same
dimensions (in this context), the number of heads, so the *"number"*
of $K$ matrices, the head dimension and the number of tokens incoming,
the sequence lenght, and
we need to know the number of bytes for `d_type`, usually a `float16`, thus
2 Bytes:
$$
\text{Layer\_Space} = 2 \times \text{n\_heads} \times \text{d\_heads} \times
\text{seq\_len} \times \text{d\_type}
$$
- Now, during training, we pass a minibatch, so we will have a tensor of
dimensions $N \times \text{seq\_length} \times \text{d\_model}$. When they
are processed by $W_K$ and $W_Q$, we will have to store times $N$ more
values:
$$
\text{Batch\_Layer\_Space} = \text{Layer\_Space} \times N
$$
- If you have $L$ layers, during training, at the end you'll need space
equivalent to:
$$
\text{Model\_Space} = \text{Batch\_Layer\_Space} \times L
$$
## Multi-Query and Grouped-Query Attention
The idea here is that we don't need different keys and vectors, but only
different queries. These approaches drastrically reduce memory consumption at
a slight cost of accuracy.
In **multi-query** approach, we have only one $K$ and $V$ with a number of
queries equal to the number of heads. In the **grouped-query** approach, we have
a hyperparameter $G$ that will determine how many $K$s and $V$s matrices are
in the layer, while the number of queries remains equal to the number of attention
heads. Then, queries will be grouped to some $K_g$ and $V_g$
![grouped head attention](./pngs/grouped-head-attention.png)
> Image take from [Multi-Query & Grouped-Query Attention](https://tinkerd.net/blog/machine-learning/multi-query-attention/#multi-query-attention-mqa)
Now, the new layer size becomes:
$$
\text{Layer\_Space} = 2 \times G \times \text{d\_heads} \times
\text{seq\_len} \times \text{d\_type}
$$
## Multi-Head Latent Attention
The idea is to reduce memory consumption by `rank` factoring $Q$, $K$, and $V$
computation. This means that each matrix will be factorized in 2 matrices so
that:
$$
A = M_l \times M_r
\\
A \in \R^{in \times out}, M_l \in \R^{in \times rank},
M_r \in \R^{rank \times out}
$$
The problem, though, is that this method introduces compression, and the lower
$rank$ is, the more compression artifacts.
What we are going to compress are the weight matrices for $Q$, $K$ and $V$:
$$
\begin{aligned}
Q = X \times W_Q^L \times W_Q^R
\end{aligned} \\
X \in \R^{N\times S \times d_{\text{model}}},
W_Q \in \R^{d_{\text{model}} \times (n_\text{head} \cdot d_\text{head})}
\\
W_Q^L\in \R^{d_{\text{model}} \times rank},
W_Q^R\in \R^{rank \times (n_\text{head} \cdot d_\text{head})}
\\
\text{}
\\
W_Q \simeq W_Q^L \times W_Q^R
$$
For simplicity, we didn't write equations for $K$ and $V$ that are basically
the same. However, now we may think that we have just increased the number of
operations from 1 matmul to 2 per each matrix. But the real power lies
when we take a look at the actual computation:
$$
\begin{aligned}
H_i &= \text{softmax}\left(
\frac{
XW_Q^LW_{Q,i}^R \times (XW_{KV}^LW_{K,i}^R)^T
}{
\sqrt{\text{d\_model}}
}
\right) \times X W_{KV}^L W_{V,i}^R \\
&= \text{softmax}\left(
\frac{
XW_Q^LW_{Q,i}^R \times W_{K,i}^{R^T} W_{KV}^{L^T} X^T
}{
\sqrt{\text{d\_model}}
}
\right) \times X W_{KV}^L W_{V,i}^R \\
&= \text{softmax}\left(
\frac{
C_{Q} \times W_{Q,i}^R W_{K,i}^{R^T} \times C_{KV}^T
}{
\sqrt{\text{d\_model}}
}
\right) \times C_{KV} W_{V,i}^R \\
\end{aligned}
$$
As it can be seen, $C_Q$ and $C_{KV}$ do not depend on the head, so they can be
computed once and shared across all heads, then we have
$W_{Q,i}^R \times W_{K,i}^{R^T}$ that while it depends on the head number, it
can be computer ahead of time as it does not depend on the input.
So, for each attention head we need just 3 matmuls plus another 2 happening
at runtime. Moreover, if we want to, we can still apply caching over $C_{KV}$
## Decoupled RoPE for Multi-Latent Head Attention
Since `RoPE` is a positional embedding used **during** attention, this causes
problems if we use a standard Multi Latent Head Attention. In fact, it
shoudl be used on both, separately though, matrices $Q$ and $K$.
Since they come from $C_Q \times W_{Q, i}^R$ and
$W_{KV, i}^{R^T} \times C_{KV}^T$, this means that we can't cache
$W_{Q,i}^R \times W_{K,i}^{R^T}$ anymore.
A solution is to cache head matrices, but not their product, and compute
new pieces that will be used in `RoPE` and then concatenated
to the actual query and keys:
$$
\begin{aligned}
Q_{R,i} &= RoPE(C_Q \times W_{QR,i}^R) \\
K_{R,i} &= RoPE(X \times W_{KR,i}^L)
\end{aligned}
$$
These matrices will then be concatenated with the reconstruciton of $Q$ and $K$.
## References
- [KV Caching Explained: Optimizing Transformer Inference Efficiency](https://huggingface.co/blog/not-lain/kv-caching)
- [Transformers KV Caching Explained](https://medium.com/@joaolages/kv-caching-explained-276520203249)
- [How to calculate size of KV cache](https://www.rohan-paul.com/p/how-to-calculate-size-of-kv-cache)
- [Multi-Query & Grouped-Query Attention](https://tinkerd.net/blog/machine-learning/multi-query-attention/#multi-query-attention-mqa)
- [Understanding Multi-Head Latent Attention](https://planetbanatt.net/articles/mla.html)
- [https://machinelearningmastery.com/a-gentle-introduction-to-multi-head-latent-attention-mla/](https://machinelearningmastery.com/a-gentle-introduction-to-multi-head-latent-attention-mla/)
- [DeepSeek-V3 Explained 1: Multi-head Latent Attention](https://medium.com/data-science/deepseek-v3-explained-1-multi-head-latent-attention-ed6bee2a67c4)

Binary file not shown.

After

Width:  |  Height:  |  Size: 334 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 296 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB