# Energy Based Models These models takes 2 inputs, one of which is the . Then they are passed through another function that output their compatibility, or ***"goodness"***[^stanford-lecun]. The lower the value, the better the compatibility. The objective of this model is finding the most compatible value for the energy function across all possible values: $$ \hat{y} = \argmin_{y \in \mathcal{Y}}(E(x, \mathcal{Y})) \\ x \coloneqq \text{ network output | } \mathcal{Y} \coloneqq \text{ set of possible labels} $$ The difficult part here is choosing said $\hat{y}$ as the space $\mathcal{Y}$ may be too large, or even infinite, and there could be multiple $y$ that may have the same energy, so multiple solutions. Since we want to model our function $E(x, y)$, it needs to change according to some weights $W$. Since each time we change $W$ we are changing also the output of $E(x, y)$, we are technically defining a family of energy functions. However, not all of them give us what we need, thus, by tuning $W$, we explore this space to find a $E_W(x, y)$ that gives acceptable results. During inference, since we will have built a function with a structure specifically crafted to be like that, we will generate a $y^*_0$ randomically and then we will update it through the gradient descent > [!TIP] > If this thing is unclear, think that you have fixed your function $E_{W}$ and > notice that $x$ is unchangeable too. During gradient descent, you can update > the value of $y^*$ by the same algorithm used for $W$. > > Now, you are trying to find a minimum for $E_W$, meaning that $y^*$ will be > the optimal solution when the energy becomes 0 or around it. ## Designing a good Loss[^stanford-lecun] > [!NOTE] > > - $y_i$: correct label for $x_i$ > - $y^*_i$: lowest energy label > - $\bar{y}_i$: lowest energy incorrect label ### Using the energy function Since we want the energy for $y_i$ to be 0 for $x_i$, everything else is a loss for us: $$ L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i) $$ However, this function does not increase energy for other example, possibly resulting in the constant function: $$ E_W(x_i, y_j) = 0 \,\, \forall i \in {X}, \forall j \in {Y} $$ In other words, it gives 0 for all possible labels, resulting in a collapsed plane where all energies are equal. ### Generalized Perceptron Loss Another way is to give a lower bound for point $y_i$ and pushing away everything else: $$ L_i(y_i, E_{W}(x_i, y_i)) = E_{W}(x_i, y_i) - \min_{y \in \mathcal{Y}}E_{W}(x_i, y) $$ When the loss becomes 0, it means that both terms are equal, meaning that $E(x_i, y_i)$ has the lowest value. However this doesn't imply that there's not anothe $y_j$ so that $E(x_i, y_i) = E(x_i, y_j)$, implying that this method is still susceptible to flat planes. ### Good losses To avoid the problem of a collapsed plane, we can use several losses: - [**Hinge Loss**](../4-Loss-Functions/INDEX.md#hingeembeddingloss) - **Log Loss** - **MCE Loss** - **Square-Square Loss** - **Square-Exponential Loss** - [**Negative Log-Likelihood Loss**](../4-Loss-Functions/INDEX.md#nllloss) - **Minimum Empirical Error Loss** All of these operates to increase the distance between $y_i$ and $\bar{y}_i$ so that the *"best incorrect answer"* is at least $margin$ away from the correct one(s). > [!WARNING] > Negative Log Likelihood makes the plane way too harsh, making it like a ravine > where good values are in the ravine and bad ones are on top[^anelli-15] ## Energy Based Model Architectures > [!CAUTION] > Here $x$, $y$ may be scalars or vectors ![energy based model architectures](./pngs/emb-architectures.png) ### Regression The energy function for a regression is simply: $$ E_W(x, y) = \frac{1}{2}|| G_W(x) - y ||_2^2 $$ This architecture, during training will modify $W$ so that $G_W(x_i) \sim y_i$ and way different from all the other $y_j$. During inference, $y^*_i$ will be the one that is the most similar to $G_W(x_i)$ ### Implicit Regression We usually use this architecture when we want more than one possible answer. The trick is to map each admissible $y$ to the same trasnformation of $x_i$: $$ E_W(x, y) = \frac{1}{2}|| G_{W_X}(x) - G_{W_Y}(y) ||_2^2 $$ During training both $G_{W_X}(x_i) \sim G_{W_Y}(\mathcal{Y}_i)$, while during inference we will choose a value from $\mathcal{Y}^*_i$ that will have the least energy ### Binary Classification For binary classification problems, we have: $$ E_W(x, y) = - yG_W(x) $$ During training it will make $G_W(x_i) \gt 0$ for $y_i = 1$ and viceversa. During inference, $y_i^*$ will have the same sign of $G_W(x_i)$. ### Multi-Class Classification For multiclass classification we will have: $$ E_W = \sum_{k = 0}^{C} \delta(y - k) \cdot G_W(x_i)_{[k]} \\ \delta(u) \coloneqq \text{ Kronecker impulse - } \begin{cases} 1 \rightarrow u = 0 \\ 0 \rightarrow u \neq 0 \end{cases} $$ This is just a way to say that $G_W(x_i)$ will produce several scores, an array of values, and the energy will be equal to the one for the class $y_i$. During training $G_W(x_i)_{[k]} \sim 0$ for $y_i = k$ and all the other will become high. During inference $y^*_i = k$ for the lowest $k^{th}$ value of $G_W(x_i)$ ## Latent-Variable Architectures We introduce in the system a *"latent"* variable $z$ that will help our model to get more details. We never receive this value, nor it is generated based on inputs[^yt-week-7]: $$ \hat{y}, \hat{z} = \argmin_{y,z} E(x, y, z) $$ However, if you like to find it only by looking at $y$ and using a probabilistic approach, we get: $$ \hat{y} = \argmin_y \lim_{\beta \rightarrow \infin} - \frac{1}{\beta} \log \int_{z} e^{-\beta E(x, y, z)} $$ An advantage is that if we operate over a set $\mathcal{Z}$, we get a set $\hat{\mathcal{Y}}$ of predictions. However, we need to limit the informativeness of our variable $z$, otherwise it is possible to perfectly predict $y$, ## Relation between probabilities and Energy We can think of energy and probablity of being the same thing, but over 2 different points of view: $$ P(y^* | x) = \frac{ \underbrace{e^{-\beta E(x, y^*)}}_{\text{make energy small}} }{ \int_{y \in \mathcal{Y}} \underbrace{e^{-\beta E(x, y)}}_{ \text{make energy big}} } \\ L(Y^*, W) = \underbrace{E_W(Y^*,X)}_{\text{make small}} + \frac{1}{\beta} \log \int_{y} \underbrace{e^{-\beta E(x, y)}}_{\text{make energy big}} $$ > [!NOTE] > > - $\text{gibbs distribution} \rightarrow e^{-\beta E(x, y^*)}$ > - $\beta$: akin to an inverse temperature If you can avoid it, never work with probabilities, as they are often intractable and give less customization over the scoring function. > [!WARNING] > There may be reasons on why you would prefer having actual probabilities, > rather than scores, and that's when you need 2 agents that need to interact > with each other. > > Since scores are calibrated only over the model we are working with, values > across agents will differ, thus meaning different things. > > However, this problem does not exist if the agents are trained end-to-end. ## Contrastive-Methods These methods have the objective of **increasing energy for negative examples and lower it for positive examples** Basically we measure the distance, usually with a cosine similarity, and that becomes our energy function. Then we take a loss function that maximizes over similarity. To make this method work, we need to feed it negative examples as well, otherwise we would not widen the dissimilarity between positive and negative examples.[^atcold-1] > [!NOTE] > There are also non contrastive methods that only uses positive examples, > eliminating the need to get negative examples ## Self Supervised Learning [^stanford-lecun]: [A Tutorial on Energy-Based Learning](https://web.stanford.edu/class/cs379c/archive/2012/suggested_reading_list/documents/LeCunetal06.pdf) [^yt-week-7]: [Youtube | Week 7 – Lecture: Energy based models and self-supervised learning | 22nd November 2025](https://www.youtube.com/watch?v=tVwV14YkbYs&t=1411s) [^anelli-15]: Vito Walter Anelli | Ch. 15 pg. 17 | 2025-205 [^atcold-1]: [DEEP LEARNING | Week 8 Ch. 8.1 | 22nd Novemeber 2025](https://atcold.github.io/NYU-DLSP20/en/week08/08-1/)