Revised chapter 4

2025-11-17 17:04:46 +01:00
parent 247daf4d56
commit 2a96deaebf
1 changed files with 37 additions and 24 deletions
--- a/Chapters/4-Loss-Functions/INDEX.md
+++ b/Chapters/4-Loss-Functions/INDEX.md
@@ -85,8 +85,15 @@ decrease towards $0$**
 ## NLLLoss[^NLLLoss]
 > [!CAUTION]
 >
 > Technically speaking the `input` data should come
 > from a `LogLikelihood` like
 > [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
 > However this is not enforced by `Pytorch`
 This is basically the ***distance*** towards
-real ***class tags***.
+real ***class tags***, optionally weighted by $\vec{w}$.
 $$
 NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
@@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_n \\
 \end{bmatrix}^T;\\
-l_n = - w_n \cdot \bar{y}_{n, y_n}
+l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
 $$
 Even here there's the possibility to reduce the vector
@@ -116,7 +123,7 @@ $$
 Technically speaking, in `Pytorch` you have the
 possibility to ***exclude*** some `classes` during
 training. Moreover it's possible to pass
-`weights` for `classes`, **useful when dealing
+`weights`, $\vec{w}$, for `classes`, **useful when dealing
 with unbalanced training set**
 > [!TIP]
@@ -132,23 +139,24 @@ with unbalanced training set**
 > `c`**
 >
 > This is why we have
-> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
+> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
 > In fact, we take the error over the
 > **actual `class tag` of that `point`**.
 >
 > To get a clear idea, check this website[^NLLLoss]
-<!-- Comment to suppress linter -->
+> [!NOTE]
-
+> While using weights to give more importance to certain classes, or to give a
-> [!WARNING]
+> higher weight to less frequent classes, there's a better method.
 >
-> Technically speaking the `input` data should come
+> We can use circular buffers to sample an equal amount from all classes and then
-> from a `LogLikelihood` like
+> fine-tune at the end by using actual classes frequencies.
 > [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
 > However this is not enforced by `Pytorch`
 ## CrossEntropyLoss[^Anelli-CEL]
 Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
 see its formal derivation
 $$
 CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
@@ -182,8 +190,8 @@ $$
 > [!NOTE]
 >
-> This is basically a **good version** of
+> This is basically [NLLLoss](#nllloss) without needing a
-> [NLLLoss](#nllloss)
+> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)
 ## AdaptiveLogSoftmaxWithLoss
@@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
 \right)
 $$
-This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
+This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
 2 to represent the loss, thus the *longer equation*.
 Even here we can reduce with either `mean` or `sum` modifiers
@@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
 \right)
 $$
-This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
+This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.
 This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
@@ -264,7 +273,7 @@ $$
 This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
 with a
 [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
-***numerical instabilities***
+***numerical instabilities and make numbers contrained to [0, 1]***
 ## HingeEmbeddingLoss
@@ -318,10 +327,8 @@ l_n = max\left(
 \right)
 $$
-$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
+here we have 2 predictions of items. The objective is to rank positive items
-of predictions of that `point` being `class 1` or `class 2`
+with high values and vice-versa.
 (both ***positive***),
 while $\vec{y}$ is the ***vector*** of `labels`.
 As before, our goal is to ***minimize*** the `loss`, thus we
 always want ***negative values*** that are ***larger*** than
@@ -330,6 +337,9 @@ $margin$:
 - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
 - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
 By having a margin we ensure that the model doesn't cheat by making
 $\bar{y}_{1,n} = \bar{y}_{2,n}$
 > [!TIP]
 > Let's say we are trying to use this for `classification`,
 > we can ***cheat*** a bit to make the `model` more ***robust***
@@ -368,6 +378,9 @@ Optimizing here means
 ***latter*** being the most important thing to do (as it's
 the only *negative* term in the equation)
 > [!NOTE]
 > This is how reverse image search used to work in Google
 ## SoftMarginLoss
 $$
@@ -412,7 +425,7 @@ l_n =
 $$
 Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
-with ***multiple classes*** as
+with ***multiple target classes*** as
 [CrossEntropyLoss](#crossentropyloss)
 ## CosineEmbeddingLoss
@@ -438,11 +451,11 @@ With this loss we make these things:
 - bring the angle to $0$ between $\bar{y}_{n,1}$
    and $\bar{y}_{n,2}$ when $y_n = 1$
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
+- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
-    and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
+    and $\bar{y}_{n,2}$ when $y_n = -1$, making them
    only positive values of $\cos$ are allowed, making them
    ***orthogonal***
 [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
 [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11