# Loss Functions ## MSELoss | AKA L2 $$ MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} (\bar{y}_1 - y_1)^2 \\ (\bar{y}_2 - y_2)^2 \\ ... \\ (\bar{y}_n - y_n)^2 \\ \end{bmatrix}^T $$ Though, it can be reduced to a **scalar** by making either the `sum` of all the values, or the `mean`. ## L1Loss This measures the **M**ean **A**bsolute **E**rror $$ L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} |\bar{y}_1 - y_1| \\ |\bar{y}_2 - y_2| \\ ... \\ |\bar{y}_n - y_n| \\ \end{bmatrix}^T $$ This is more **robust against outliers** as their value is not **squared**. However this is not ***differentiable*** towards **small values**, thus the existance of [SmoothL1Loss](#smoothl1loss--aka-huber-loss) As [MSELoss](#mseloss--aka-l2), it can be reduces into a **scalar** ## SmoothL1Loss | AKA Huber Loss > [!NOTE] > Called `Elastic Network` when used as an > **objective function** $$ L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ ln = \begin{cases} \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta} &\text{ if } |\bar{y}_n -y_n| < \beta \\ |\bar{y}_n -y_n| - 0.5 \cdot \beta &\text{ if } |\bar{y}_n -y_n| \geq \beta \end{cases} $$ This behaves like [MSELoss](#mseloss--aka-l2) for values **under a treshold** and [L1Loss](#l1loss) **otherwise**. It has the **advantage** of being **differentiable** and is **very useful for `computer vision`** As [MSELoss](#mseloss--aka-l2), it can be reduces into a **scalar** ## L1 vs L2 For Image Classification Usually with `L2` losses, we get a **blurrier** image as opposed with `L1` loss. This comes from the fact that `L2` averages all values and does not respect `distances`. Moreover, since `L1` takes the difference, this is constant over **all values** and **does not decrease towards $0$** ## NLLLoss[^NLLLoss] > [!CAUTION] > > Technically speaking the `input` data should come > from a `LogLikelihood` like > [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax). > However this is not enforced by `Pytorch` This is basically the ***distance*** towards real ***class tags***, optionally weighted by $\vec{w}$. $$ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_{y_n} \cdot \bar{y}_{n, y_n} $$ Even here there's the possibility to reduce the vector to a **scalar**: $$ NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} \sum^N_{n=1} \frac{ l_n }{ \sum^N_{n=1} w_n } & \text{ if mode = "mean"}\\ \sum^N_{n=1} l_n & \text{ if mode = "sum"} \end{cases} $$ Technically speaking, in `Pytorch` you have the possibility to ***exclude*** some `classes` during training. Moreover it's possible to pass `weights`, $\vec{w}$, for `classes`, **useful when dealing with unbalanced training set** > [!TIP] > > So, what's $\vec{\bar{y}}$? > > It's the `tensor` containing the probability of > a `point` to belong to those `classes`. > > For example, let's say we have 10 `points` and 3 > `classes`, then $\vec{\bar{y}}_{p,c}$ is the > **`probability` of `point` `p` belonging to `class` > `c`** > > This is why we have > $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$. > In fact, we take the error over the > **actual `class tag` of that `point`**. > > To get a clear idea, check this website[^NLLLoss] > [!NOTE] > While using weights to give more importance to certain classes, or to give a > higher weight to less frequent classes, there's a better method. > > We can use circular buffers to sample an equal amount from all classes and then > fine-tune at the end by using actual classes frequencies. ## CrossEntropyLoss[^Anelli-CEL] Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to see its formal derivation $$ CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_n \cdot \ln\left( \frac{ e^{\bar{y}_{n, y_n}} }{ \sum_c e^{\bar{y}_{n, y_c}} } \right) $$ Even here there's the possibility to reduce the vector to a **scalar**: $$ CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} \sum^N_{n=1} \frac{ l_n }{ \sum^N_{n=1} w_n } & \text{ if mode = "mean"}\\ \sum^N_{n=1} l_n & \text{ if mode = "sum"} \end{cases} $$ > [!NOTE] > > This is basically [NLLLoss](#nllloss) without needing a > [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax) ## AdaptiveLogSoftmaxWithLoss This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`. Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in our `training set`. ## BCELoss | AKA Binary Cross Entropy Loss $$ BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_n \cdot \left( y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)} \right) $$ This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of 2 to represent the loss, thus the *longer equation*. Even here we can reduce with either `mean` or `sum` modifiers ## KLDivLoss | AKA Kullback-Leibler Divergence Loss $$ KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = y_n \cdot \ln{ \frac{ y_n }{ \bar{y}_n } } = y_n \cdot \left( \ln{(y_n)} - \ln{(\bar{y}_n)} \right) $$ This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***. This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$ > [!CAUTION] > This method assumes you have `probablities` but it does not enforce the use of > [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax), > leading to ***numerical instabilities*** ## BCEWithLogitsLoss $$ BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_n \cdot \left(\, y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n) \cdot \ln{(1 - \sigma(\bar{y}_n))}\, \right) $$ This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss) with a [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with ***numerical instabilities and make numbers contrained to [0, 1]*** ## HingeEmbeddingLoss $$ HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = \begin{cases} \bar{y}_n \; \; & y=1 \\ max(0, margin - \bar{y}_n) & y = -1 \end{cases} $$ In order to understand this type of `Loss` let's reason as an actual `model`, thus our ***objective*** is to reduce the `Loss`. By observing the `Loss`, we get that if we ***predict high values*** for `positive classes` we get a ***high `loss`***. At the same time, we observe that if we ***predict low values*** for `negative classes`, we get a ***high `loss`***. Now, what we'll do is: - ***predict low `outputs` for `positive classes`*** - ***predict high `outputs` for `negative classes`*** This makes these 2 `classes` ***more distant*** between each other, and make `points` of each `class` ***closer*** ## MarginRankingLoss $$ MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = max\left( 0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \, \right) $$ here we have 2 predictions of items. The objective is to rank positive items with high values and vice-versa. As before, our goal is to ***minimize*** the `loss`, thus we always want ***negative values*** that are ***larger*** than $margin$: - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$ - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$ By having a margin we ensure that the model doesn't cheat by making $\bar{y}_{1,n} = \bar{y}_{2,n}$ > [!TIP] > Let's say we are trying to use this for `classification`, > we can ***cheat*** a bit to make the `model` more ***robust*** > by having ***all correct predictions*** on $\vec{\bar{y}}_1$ > and on $\vec{\bar{y}}_2$ only the > ***highest wrong prediction*** repeated $n$ times. ## TripletMarginLoss[^tripletmarginloss] $$ TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = max\left( 0,\, d(a_n, p_n) - d(a_n, n_n) + margin \, \right) $$ Here we have: - $\vec{a}$: ***anchor point*** that represents a `class` - $\vec{p}$: ***positive example*** that is a `point` in the same `class` as $\vec{a}$ - $\vec{n}$: ***negative example*** that is a `point` in another `class` with respect to $\vec{a}$. Optimizing here means ***having similar points near to each other*** and ***dissilimal points further from each other***, with the ***latter*** being the most important thing to do (as it's the only *negative* term in the equation) > [!NOTE] > This is how reverse image search used to work in Google ## SoftMarginLoss $$ SoftMarginLoss(\vec{\bar{y}}, \vec{y}) = \sum_n \frac{ \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right) }{ N } $$ This `loss` gives only ***positive*** results, thus the optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$. Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy is to make: - $y_i = -1 \rightarrow \bar{y}_i >> 0$ - $y_i = 1 \rightarrow \bar{y}_i << 0$ ## MultiClassHingeLoss | AKA MultiLabelMarginLoss $$ MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = \sum_{i,j} \frac{ max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,) }{ \text{num\_of\_classes} } \; \;\forall i,j \text{ with } i \neq j $$ Essentially, it works as the [HingeLoss](#hingeembeddingloss), but with ***multiple target classes*** as [CrossEntropyLoss](#crossentropyloss) ## CosineEmbeddingLoss $$ CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = \begin{cases} 1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\ max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin) & y_n = -1 \end{cases} $$ With this loss we make these things: - bring the angle to $0$ between $\bar{y}_{n,1}$ and $\bar{y}_{n,2}$ when $y_n = 1$ - bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$ and $\bar{y}_{n,2}$ when $y_n = -1$, making them ***orthogonal*** [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11 [^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)