From 2a96deaebf633a11a5bad8a663565965a49f98f0 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Mon, 17 Nov 2025 17:04:46 +0100 Subject: [PATCH] Revised chapter 4 --- Chapters/4-Loss-Functions/INDEX.md | 61 ++++++++++++++++++------------ 1 file changed, 37 insertions(+), 24 deletions(-) diff --git a/Chapters/4-Loss-Functions/INDEX.md b/Chapters/4-Loss-Functions/INDEX.md index bd62e29..6ad837b 100644 --- a/Chapters/4-Loss-Functions/INDEX.md +++ b/Chapters/4-Loss-Functions/INDEX.md @@ -85,8 +85,15 @@ decrease towards $0$** ## NLLLoss[^NLLLoss] +> [!CAUTION] +> +> Technically speaking the `input` data should come +> from a `LogLikelihood` like +> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax). +> However this is not enforced by `Pytorch` + This is basically the ***distance*** towards -real ***class tags***. +real ***class tags***, optionally weighted by $\vec{w}$. $$ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} @@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_n \\ \end{bmatrix}^T;\\ -l_n = - w_n \cdot \bar{y}_{n, y_n} +l_n = - w_{y_n} \cdot \bar{y}_{n, y_n} $$ Even here there's the possibility to reduce the vector @@ -116,7 +123,7 @@ $$ Technically speaking, in `Pytorch` you have the possibility to ***exclude*** some `classes` during training. Moreover it's possible to pass -`weights` for `classes`, **useful when dealing +`weights`, $\vec{w}$, for `classes`, **useful when dealing with unbalanced training set** > [!TIP] @@ -132,23 +139,24 @@ with unbalanced training set** > `c`** > > This is why we have -> $l_n = - w_n \cdot \bar{y}_{n, y_n}$. +> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$. > In fact, we take the error over the > **actual `class tag` of that `point`**. > > To get a clear idea, check this website[^NLLLoss] - - -> [!WARNING] +> [!NOTE] +> While using weights to give more importance to certain classes, or to give a +> higher weight to less frequent classes, there's a better method. > -> Technically speaking the `input` data should come -> from a `LogLikelihood` like -> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax). -> However this is not enforced by `Pytorch` +> We can use circular buffers to sample an equal amount from all classes and then +> fine-tune at the end by using actual classes frequencies. ## CrossEntropyLoss[^Anelli-CEL] +Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to +see its formal derivation + $$ CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ @@ -182,8 +190,8 @@ $$ > [!NOTE] > -> This is basically a **good version** of -> [NLLLoss](#nllloss) +> This is basically [NLLLoss](#nllloss) without needing a +> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax) ## AdaptiveLogSoftmaxWithLoss @@ -208,7 +216,8 @@ l_n = - w_n \cdot \left( \right) $$ -This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes` +This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of +2 to represent the loss, thus the *longer equation*. Even here we can reduce with either `mean` or `sum` modifiers @@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{ \right) $$ -This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***. +This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***. This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$ @@ -264,7 +273,7 @@ $$ This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss) with a [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with -***numerical instabilities*** +***numerical instabilities and make numbers contrained to [0, 1]*** ## HingeEmbeddingLoss @@ -318,10 +327,8 @@ l_n = max\left( \right) $$ -$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors*** -of predictions of that `point` being `class 1` or `class 2` -(both ***positive***), -while $\vec{y}$ is the ***vector*** of `labels`. +here we have 2 predictions of items. The objective is to rank positive items +with high values and vice-versa. As before, our goal is to ***minimize*** the `loss`, thus we always want ***negative values*** that are ***larger*** than @@ -330,6 +337,9 @@ $margin$: - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$ - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$ +By having a margin we ensure that the model doesn't cheat by making +$\bar{y}_{1,n} = \bar{y}_{2,n}$ + > [!TIP] > Let's say we are trying to use this for `classification`, > we can ***cheat*** a bit to make the `model` more ***robust*** @@ -368,6 +378,9 @@ Optimizing here means ***latter*** being the most important thing to do (as it's the only *negative* term in the equation) +> [!NOTE] +> This is how reverse image search used to work in Google + ## SoftMarginLoss $$ @@ -412,7 +425,7 @@ l_n = $$ Essentially, it works as the [HingeLoss](#hingeembeddingloss), but -with ***multiple classes*** as +with ***multiple target classes*** as [CrossEntropyLoss](#crossentropyloss) ## CosineEmbeddingLoss @@ -438,11 +451,11 @@ With this loss we make these things: - bring the angle to $0$ between $\bar{y}_{n,1}$ and $\bar{y}_{n,2}$ when $y_n = 1$ -- bring the angle to $\pi$ between $\bar{y}_{n,1}$ - and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if - only positive values of $\cos$ are allowed, making them +- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$ + and $\bar{y}_{n,2}$ when $y_n = -1$, making them ***orthogonal*** + [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11