From 47eac8ff47b8f744d57dc6a9026a577080008204 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Tue, 15 Apr 2025 17:21:47 +0200 Subject: [PATCH] Added other Loss Functions --- Chapters/4-Loss-Functions/INDEX.md | 244 ++++++++++++++++++++++++++++- 1 file changed, 242 insertions(+), 2 deletions(-) diff --git a/Chapters/4-Loss-Functions/INDEX.md b/Chapters/4-Loss-Functions/INDEX.md index 0693a2d..bd62e29 100644 --- a/Chapters/4-Loss-Functions/INDEX.md +++ b/Chapters/4-Loss-Functions/INDEX.md @@ -187,24 +187,264 @@ $$ ## AdaptiveLogSoftmaxWithLoss + + +This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`. +Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in +our `training set`. + ## BCELoss | AKA Binary Cross Entropy Loss +$$ +BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = - w_n \cdot \left( + y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)} +\right) +$$ + +This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes` + +Even here we can reduce with either `mean` or `sum` modifiers + ## KLDivLoss | AKA Kullback-Leibler Divergence Loss +$$ +KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = y_n \cdot \ln{ + \frac{ + y_n + }{ + \bar{y}_n + } +} = y_n \cdot \left( + \ln{(y_n)} - \ln{(\bar{y}_n)} +\right) +$$ + +This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***. + +This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$ + +> [!CAUTION] +> This method assumes you have `probablities` but it does not enforce the use of +> [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax), +> leading to ***numerical instabilities*** + ## BCEWithLogitsLoss + + +$$ +BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = - w_n \cdot \left(\, + y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n) + \cdot + \ln{(1 - \sigma(\bar{y}_n))}\, +\right) +$$ + +This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss) +with a +[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with +***numerical instabilities*** + ## HingeEmbeddingLoss +$$ +HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = \begin{cases} + \bar{y}_n \; \; & y=1 \\ + max(0, margin - \bar{y}_n) & y = -1 +\end{cases} +$$ + +In order to understand this type of `Loss` let's reason +as an actual `model`, thus our ***objective*** is to reduce +the `Loss`. + +By observing the `Loss`, we get that if we +***predict high values*** for `positive classes` we get a +***high `loss`***. + +At the same time, we observe that if we ***predict low values*** +for `negative classes`, we get a ***high `loss`***. + +Now, what we'll do is: + +- ***predict low `outputs` for `positive classes`*** +- ***predict high `outputs` for `negative classes`*** + +This makes these 2 `classes` ***more distant*** +between each other, +and make `points` of each `class` ***closer*** + ## MarginRankingLoss -## TripletMarginLoss +$$ +MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) = +\begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = max\left( + 0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \, +\right) +$$ + +$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors*** +of predictions of that `point` being `class 1` or `class 2` +(both ***positive***), +while $\vec{y}$ is the ***vector*** of `labels`. + +As before, our goal is to ***minimize*** the `loss`, thus we +always want ***negative values*** that are ***larger*** than +$margin$: + +- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$ +- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$ + +> [!TIP] +> Let's say we are trying to use this for `classification`, +> we can ***cheat*** a bit to make the `model` more ***robust*** +> by having ***all correct predictions*** on $\vec{\bar{y}}_1$ +> and on $\vec{\bar{y}}_2$ only the +> ***highest wrong prediction*** repeated $n$ times. + +## TripletMarginLoss[^tripletmarginloss] + +$$ +TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) = +\begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = max\left( + 0,\, + d(a_n, p_n) - d(a_n, n_n) + margin \, +\right) +$$ + +Here we have: + +- $\vec{a}$: ***anchor point*** that represents a `class` +- $\vec{p}$: ***positive example*** that is a `point` in the same + `class` as $\vec{a}$ +- $\vec{n}$: ***negative example*** that is a `point` in another + `class` with respect to $\vec{a}$. + +Optimizing here means +***having similar points near to each other*** and +***dissilimal points further from each other***, with the +***latter*** being the most important thing to do (as it's +the only *negative* term in the equation) ## SoftMarginLoss -## MultiLabelMarginLoss +$$ +SoftMarginLoss(\vec{\bar{y}}, \vec{y}) = +\sum_n \frac{ + \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right) +}{ + N +} +$$ + +This `loss` gives only ***positive*** results, thus the +optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$. + +Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy +is to make: + +- $y_i = -1 \rightarrow \bar{y}_i >> 0$ +- $y_i = 1 \rightarrow \bar{y}_i << 0$ + +## MultiClassHingeLoss | AKA MultiLabelMarginLoss + + + +$$ +MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) = +\begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = + \sum_{i,j} \frac{ + max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,) + }{ + \text{num\_of\_classes} + } + + \; \;\forall i,j \text{ with } i \neq j +$$ + +Essentially, it works as the [HingeLoss](#hingeembeddingloss), but +with ***multiple classes*** as +[CrossEntropyLoss](#crossentropyloss) ## CosineEmbeddingLoss +$$ +CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) = +\begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = \begin{cases} + 1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\ + max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin) + & y_n = -1 + +\end{cases} +$$ + +With this loss we make these things: + +- bring the angle to $0$ between $\bar{y}_{n,1}$ + and $\bar{y}_{n,2}$ when $y_n = 1$ +- bring the angle to $\pi$ between $\bar{y}_{n,1}$ + and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if + only positive values of $\cos$ are allowed, making them + ***orthogonal*** + [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11 + +[^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)