From 2a96deaebf633a11a5bad8a663565965a49f98f0 Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Mon, 17 Nov 2025 17:04:46 +0100
Subject: [PATCH] Revised chapter 4

---
 Chapters/4-Loss-Functions/INDEX.md | 61 ++++++++++++++++++------------
 1 file changed, 37 insertions(+), 24 deletions(-)

diff --git a/Chapters/4-Loss-Functions/INDEX.md b/Chapters/4-Loss-Functions/INDEX.md
index bd62e29..6ad837b 100644
--- a/Chapters/4-Loss-Functions/INDEX.md
+++ b/Chapters/4-Loss-Functions/INDEX.md
@@ -85,8 +85,15 @@ decrease towards $0$**
 
 ## NLLLoss[^NLLLoss]
 
+> [!CAUTION]
+>
+> Technically speaking the `input` data should come
+> from a `LogLikelihood` like
+> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
+> However this is not enforced by `Pytorch`
+
 This is basically the ***distance*** towards
-real ***class tags***.
+real ***class tags***, optionally weighted by $\vec{w}$.
 
 $$
 NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
@@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
     l_n \\
 \end{bmatrix}^T;\\
 
-l_n = - w_n \cdot \bar{y}_{n, y_n}
+l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
 $$
 
 Even here there's the possibility to reduce the vector
@@ -116,7 +123,7 @@ $$
 Technically speaking, in `Pytorch` you have the
 possibility to ***exclude*** some `classes` during
 training. Moreover it's possible to pass
-`weights` for `classes`, **useful when dealing
+`weights`, $\vec{w}$, for `classes`, **useful when dealing
 with unbalanced training set**
 
 > [!TIP]
@@ -132,23 +139,24 @@ with unbalanced training set**
 > `c`**
 >
 > This is why we have
-> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
+> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
 > In fact, we take the error over the
 > **actual `class tag` of that `point`**.
 >
 > To get a clear idea, check this website[^NLLLoss]
 
-<!-- Comment to suppress linter -->
-
-> [!WARNING]
+> [!NOTE]
+> While using weights to give more importance to certain classes, or to give a
+> higher weight to less frequent classes, there's a better method.
 >
-> Technically speaking the `input` data should come
-> from a `LogLikelihood` like
-> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
-> However this is not enforced by `Pytorch`
+> We can use circular buffers to sample an equal amount from all classes and then
+> fine-tune at the end by using actual classes frequencies.
 
 ## CrossEntropyLoss[^Anelli-CEL]
 
+Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
+see its formal derivation
+
 $$
 CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
     l_1 \\
@@ -182,8 +190,8 @@ $$
 
 > [!NOTE]
 >
-> This is basically a **good version** of
-> [NLLLoss](#nllloss)
+> This is basically [NLLLoss](#nllloss) without needing a
+> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)
 
 ## AdaptiveLogSoftmaxWithLoss
 
@@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
 \right)
 $$
 
-This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
+This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
+2 to represent the loss, thus the *longer equation*.
 
 Even here we can reduce with either `mean` or `sum` modifiers
 
@@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
 \right)
 $$
 
-This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
+This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.
 
 This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
 
@@ -264,7 +273,7 @@ $$
 This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
 with a
 [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
-***numerical instabilities***
+***numerical instabilities and make numbers contrained to [0, 1]***
 
 ## HingeEmbeddingLoss
 
@@ -318,10 +327,8 @@ l_n = max\left(
 \right)
 $$
 
-$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
-of predictions of that `point` being `class 1` or `class 2`
-(both ***positive***),
-while $\vec{y}$ is the ***vector*** of `labels`.
+here we have 2 predictions of items. The objective is to rank positive items
+with high values and vice-versa.
 
 As before, our goal is to ***minimize*** the `loss`, thus we
 always want ***negative values*** that are ***larger*** than
@@ -330,6 +337,9 @@ $margin$:
 - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
 - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
 
+By having a margin we ensure that the model doesn't cheat by making
+$\bar{y}_{1,n} = \bar{y}_{2,n}$
+
 > [!TIP]
 > Let's say we are trying to use this for `classification`,
 > we can ***cheat*** a bit to make the `model` more ***robust***
@@ -368,6 +378,9 @@ Optimizing here means
 ***latter*** being the most important thing to do (as it's
 the only *negative* term in the equation)
 
+> [!NOTE]
+> This is how reverse image search used to work in Google
+
 ## SoftMarginLoss
 
 $$
@@ -412,7 +425,7 @@ l_n =
 $$
 
 Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
-with ***multiple classes*** as
+with ***multiple target classes*** as
 [CrossEntropyLoss](#crossentropyloss)
 
 ## CosineEmbeddingLoss
@@ -438,11 +451,11 @@ With this loss we make these things:
 
 - bring the angle to $0$ between $\bar{y}_{n,1}$
     and $\bar{y}_{n,2}$ when $y_n = 1$
-- bring the angle to $\pi$ between $\bar{y}_{n,1}$
-    and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
-    only positive values of $\cos$ are allowed, making them
+- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
+    and $\bar{y}_{n,2}$ when $y_n = -1$, making them
     ***orthogonal***
 
+
 [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
 
 [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11