Revised chapter 4

2025-11-17 17:04:46 +01:00
parent 247daf4d56
commit 2a96deaebf
1 changed files with 37 additions and 24 deletions
--- a/Chapters/4-Loss-Functions/INDEX.md
+++ b/Chapters/4-Loss-Functions/INDEX.md
@@ -85,8 +85,15 @@ decrease towards $0$**

 ## NLLLoss[^NLLLoss]

+> [!CAUTION]
+>
+> Technically speaking the `input` data should come
+> from a `LogLikelihood` like
+> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
+> However this is not enforced by `Pytorch`
+
 This is basically the ***distance*** towards
-real ***class tags***.
+real ***class tags***, optionally weighted by $\vec{w}$.

 $$
 NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
@@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_n \\
 \end{bmatrix}^T;\\

-l_n = - w_n \cdot \bar{y}_{n, y_n}
+l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
 $$

 Even here there's the possibility to reduce the vector
@@ -116,7 +123,7 @@ $$
 Technically speaking, in `Pytorch` you have the
 possibility to ***exclude*** some `classes` during
 training. Moreover it's possible to pass
-`weights` for `classes`, **useful when dealing
+`weights`, $\vec{w}$, for `classes`, **useful when dealing
 with unbalanced training set**

 > [!TIP]
@@ -132,23 +139,24 @@ with unbalanced training set**
 > `c`**
 >
 > This is why we have
-> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
+> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
 > In fact, we take the error over the
 > **actual `class tag` of that `point`**.
 >
 > To get a clear idea, check this website[^NLLLoss]

-<!-- Comment to suppress linter -->
-
-> [!WARNING]
+> [!NOTE]
+> While using weights to give more importance to certain classes, or to give a
+> higher weight to less frequent classes, there's a better method.
 >
-> Technically speaking the `input` data should come
-> from a `LogLikelihood` like
-> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
-> However this is not enforced by `Pytorch`
+> We can use circular buffers to sample an equal amount from all classes and then
+> fine-tune at the end by using actual classes frequencies.

 ## CrossEntropyLoss[^Anelli-CEL]

+Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
+see its formal derivation
+
 $$
 CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
@@ -182,8 +190,8 @@ $$

 > [!NOTE]
 >
-> This is basically a **good version** of
-> [NLLLoss](#nllloss)
+> This is basically [NLLLoss](#nllloss) without needing a
+> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)

 ## AdaptiveLogSoftmaxWithLoss

@@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
 \right)
 $$

-This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
+This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
+2 to represent the loss, thus the *longer equation*.

 Even here we can reduce with either `mean` or `sum` modifiers

@@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
 \right)
 $$

-This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
+This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.

 This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$

@@ -264,7 +273,7 @@ $$
 This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
 with a
 [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
-***numerical instabilities***
+***numerical instabilities and make numbers contrained to [0, 1]***

 ## HingeEmbeddingLoss

@@ -318,10 +327,8 @@ l_n = max\left(
 \right)
 $$

-$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
-of predictions of that `point` being `class 1` or `class 2`
-(both ***positive***),
-while $\vec{y}$ is the ***vector*** of `labels`.
+here we have 2 predictions of items. The objective is to rank positive items
+with high values and vice-versa.

 As before, our goal is to ***minimize*** the `loss`, thus we
 always want ***negative values*** that are ***larger*** than
@@ -330,6 +337,9 @@ $margin$:
 - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
 - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$

+By having a margin we ensure that the model doesn't cheat by making
+$\bar{y}_{1,n} = \bar{y}_{2,n}$
+
 > [!TIP]
 > Let's say we are trying to use this for `classification`,
 > we can ***cheat*** a bit to make the `model` more ***robust***
@@ -368,6 +378,9 @@ Optimizing here means
 ***latter*** being the most important thing to do (as it's
 the only *negative* term in the equation)

+> [!NOTE]
+> This is how reverse image search used to work in Google
+
 ## SoftMarginLoss

 $$
@@ -412,7 +425,7 @@ l_n =
 $$

 Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
-with ***multiple classes*** as
+with ***multiple target classes*** as
 [CrossEntropyLoss](#crossentropyloss)

 ## CosineEmbeddingLoss
@@ -438,11 +451,11 @@ With this loss we make these things:

 - bring the angle to $0$ between $\bar{y}_{n,1}$
    and $\bar{y}_{n,2}$ when $y_n = 1$
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
-    and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
-    only positive values of $\cos$ are allowed, making them
+- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
+    and $\bar{y}_{n,2}$ when $y_n = -1$, making them
    ***orthogonal***

+
 [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)

 [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11