Added Energy Based Models

2025-11-27 21:50:25 +01:00
parent 55a1d38b63
commit 5d5ecff103
3 changed files with 3538 additions and 0 deletions
--- a/Chapters/16-Energy-Based-Models/INDEX.md
+++ b/Chapters/16-Energy-Based-Models/INDEX.md
@@ -0,0 +1,249 @@
+# Energy Based Models
+
+These models takes 2 inputs, one of which is the . Then they are
+passed through another function that output their compatibility, or
+***"goodness"***[^stanford-lecun]. The lower the value, the better the
+compatibility.
+
+The objective of this model is finding the most compatible value for the
+energy function across all possible values:
+
+$$
+\hat{y} = \argmin_{y \in \mathcal{Y}}(E(x, \mathcal{Y})) \\
+x \coloneqq \text{ network output | } \mathcal{Y} \coloneqq
+\text{ set of possible labels}
+$$
+
+The difficult part here is choosing said $\hat{y}$ as the space $\mathcal{Y}$ may
+be too large, or even infinite, and there could be multiple $y$ that may have the
+same energy, so multiple solutions.
+
+Since we want to model our function $E(x, y)$, it needs to change according to
+some weights $W$. Since each time we change $W$ we are changing also the output
+of $E(x, y)$, we are technically defining a family of energy functions.
+
+However, not all of them give us what we need, thus, by tuning $W$, we explore
+this space to find a $E_W(x, y)$ that gives acceptable results.
+
+During inference, since we will have built a function with a structure
+specifically crafted to be like that, we will generate a $y^*_0$ randomically
+and then we will update it through the gradient descent
+
+> [!TIP]
+> If this thing is unclear, think that you have fixed your function $E_{W}$ and
+> notice that $x$ is unchangeable too. During gradient descent, you can update
+> the value of $y^*$ by the same algorithm used for $W$.
+>
+> Now, you are trying to find a minimum for $E_W$, meaning that $y^*$ will be
+> the optimal solution when the energy becomes 0 or around it.
+
+## Designing a good Loss[^stanford-lecun]
+
+> [!NOTE]
+>
+> - $y_i$: correct label for $x_i$
+> - $y^*_i$: lowest energy label
+> - $\bar{y}_i$: lowest energy incorrect label
+
+### Using the energy function
+
+Since we want the energy for $y_i$ to be 0 for $x_i$, everything else is a loss for
+us:
+
+$$
+L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i)
+$$
+
+However, this function does not increase energy for other example, possibly
+resulting in the constant function:
+
+$$
+E_W(x_i, y_j) = 0 \,\, \forall i \in {X}, \forall j \in {Y}
+$$
+
+In other words, it gives 0 for all possible labels, resulting in a collapsed plane
+where all energies are equal.
+
+### Generalized Perceptron Loss
+
+Another way is to give a lower bound for point $y_i$ and pushing away everything
+else:
+
+$$
+L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i) - \min_{y \in \mathcal{Y}}E_{W}(x_i, y)
+$$
+
+When the loss becomes 0, it means that both terms are equal, meaning that
+$E(x_i, y_i)$ has the lowest value. However this doesn't imply that there's not
+anothe $y_j$ so that $E(x_i, y_i) = E(x_i, y_j)$, implying that this method is
+still susceptible to flat planes.
+
+### Good losses
+
+To avoid the problem of a collapsed plane, we can use several losses:
+
+- [**Hinge Loss**](../4-Loss-Functions/INDEX.md#hingeembeddingloss)
+- **Log Loss**
+- **MCE Loss**
+- **Square-Square Loss**
+- **Square-Exponential Loss**
+- [**Negative Log-Likelihood Loss**](../4-Loss-Functions/INDEX.md#nllloss)
+- **Minimum Empirical Error Loss**
+
+All of these operates to increase the distance between $y_i$ and $\bar{y}_i$ so
+that the *"best incorrect answer"* is at least $margin$ away from the correct one(s).
+
+> [!WARNING]
+> Negative Log Likelihood makes the plane way too harsh, making it like a ravine
+> where good values are in the ravine and bad ones are on top[^anelli-15]
+
+## Energy Based Model Architectures
+
+> [!CAUTION]
+> Here $x$, $y$ may be scalars or vectors
+
+![energy based model architectures](./pngs/emb-architectures.png)
+
+### Regression
+
+The energy function for a regression is simply:
+
+$$
+E_W(x, y) = \frac{1}{2}|| G_W(x) - y ||_2^2
+$$
+
+This architecture, during training will modify $W$ so that $G_W(x_i) \sim y_i$ and
+way different from all the other $y_j$.
+During inference, $y^*_i$ will be the one that is the most similar to $G_W(x_i)$
+
+### Implicit Regression
+
+We usually use this architecture when we want more than one possible answer.
+The trick is to map each admissible $y$ to the same trasnformation of $x_i$:
+
+$$
+E_W(x, y) = \frac{1}{2}|| G_{W_X}(x) -  G_{W_Y}(y) ||_2^2
+$$
+
+During training both $G_{W_X}(x_i) \sim G_{W_Y}(\mathcal{Y}_i)$, while during
+inference we will choose a value from $\mathcal{Y}^*_i$
+that will have the least energy
+
+### Binary Classification
+
+For binary classification problems, we have:
+
+$$
+E_W(x, y) = - yG_W(x)
+$$
+
+During training it will make $G_W(x_i) \gt 0$ for $y_i = 1$ and viceversa. During
+inference, $y_i^*$ will have the same sign of $G_W(x_i)$.
+
+### Multi-Class Classification
+
+For multiclass classification we will have:
+
+$$
+E_W = \sum_{k = 0}^{C} \delta(y - k) \cdot G_W(x_i)_{[k]} \\
+\delta(u) \coloneqq \text{ Kronecker impulse - } \begin{cases}
+    1 \rightarrow u = 0 \\
+    0 \rightarrow u \neq 0
+\end{cases}
+$$
+
+This is just a way to say that $G_W(x_i)$ will produce several scores, an
+array of values, and the energy will be equal to the one for the class $y_i$.
+
+During training $G_W(x_i)_{[k]} \sim 0$ for $y_i = k$ and all the other will
+become high. During inference $y^*_i = k$ for the lowest $k^{th}$ value of
+$G_W(x_i)$
+
+## Latent-Variable Architectures
+
+We introduce in the system a *"latent"* variable $z$ that will help our model
+to get more details. We never receive this value, nor it is generated based on
+inputs[^yt-week-7]:
+
+$$
+\hat{y}, \hat{z} = \argmin_{y,z} E(x, y, z)
+$$
+
+However, if you like to find it only by looking at $y$ and using a probabilistic
+approach, we get:
+
+$$
+\hat{y} = \argmin_y \lim_{\beta \rightarrow \infin} - \frac{1}{\beta}
+ \log \int_{z} e^{-\beta E(x, y, z)}
+$$
+
+An advantage is that if we operate over a set $\mathcal{Z}$, we get a set
+$\hat{\mathcal{Y}}$ of predictions. However, we need to limit the informativeness
+of our variable $z$, otherwise it is possible to perfectly predict $y$,
+
+## Relation between probabilities and Energy
+
+We can think of energy and probablity of being the same thing, but over
+2 different points of view:
+
+$$
+P(y^* | x) = \frac{
+    \underbrace{e^{-\beta E(x, y^*)}}_{\text{make energy small}}
+}{
+    \int_{y \in \mathcal{Y}} \underbrace{e^{-\beta E(x, y)}}_{
+        \text{make energy big}}
+}
+\\
+L(Y^*, W) = \underbrace{E_W(Y^*,X)}_{\text{make small}} + \frac{1}{\beta}
+ \log \int_{y} \underbrace{e^{-\beta E(x, y)}}_{\text{make energy big}}
+$$
+
+> [!NOTE]
+>
+> - $\text{gibbs distribution} \rightarrow e^{-\beta E(x, y^*)}$
+> - $\beta$: akin to an inverse temperature
+
+If you can avoid it, never work with probabilities, as they are often intractable
+and give less customization over the scoring function.
+
+> [!WARNING]
+> There may be reasons on why you would prefer having actual probabilities,
+> rather than scores, and that's when you need 2 agents that need to interact
+> with each other.
+>
+> Since scores are calibrated only over the model we are working with, values
+> across agents will differ, thus meaning different things.
+>
+> However, this problem does not exist if the agents are trained end-to-end.
+
+## Contrastive-Methods
+
+These methods have the objective of **<ins>increasing energy for negative
+examples and lower it for positive examples</ins>**
+
+Basically we measure the distance, usually with a cosine similarity, and that
+becomes our energy function. Then we take a loss function that maximizes over
+similarity.
+
+To make this method work, we need to feed it negative examples as well, otherwise
+we would not widen the dissimilarity between positive and negative examples.[^atcold-1]
+
+> [!NOTE]
+> There are also non contrastive methods that only uses positive examples,
+> eliminating the need to get negative examples
+
+## Self Supervised Learning
+
+
+
+<!--
+    MARK: Footnotes
+-->
+
+[^stanford-lecun]: [A Tutorial on Energy-Based Learning](https://web.stanford.edu/class/cs379c/archive/2012/suggested_reading_list/documents/LeCunetal06.pdf)
+
+[^yt-week-7]: [Youtube | Week 7 – Lecture: Energy based models and self-supervised learning | 22nd November 2025](https://www.youtube.com/watch?v=tVwV14YkbYs&t=1411s)
+
+[^anelli-15]: Vito Walter Anelli | Ch. 15 pg. 17 | 2025-205
+
+[^atcold-1]: [DEEP LEARNING | Week 8 Ch. 8.1 | 22nd Novemeber 2025](https://atcold.github.io/NYU-DLSP20/en/week08/08-1/)
--- a/Chapters/16-Energy-Based-Models/excalidraw/emb-architectures.excalidraw.json
+++ b/Chapters/16-Energy-Based-Models/excalidraw/emb-architectures.excalidraw.json
--- a/Chapters/16-Energy-Based-Models/pngs/emb-architectures.png
+++ b/Chapters/16-Energy-Based-Models/pngs/emb-architectures.png