From 47eac8ff47b8f744d57dc6a9026a577080008204 Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Tue, 15 Apr 2025 17:21:47 +0200
Subject: [PATCH] Added other Loss Functions

---
 Chapters/4-Loss-Functions/INDEX.md | 244 ++++++++++++++++++++++++++++-
 1 file changed, 242 insertions(+), 2 deletions(-)

diff --git a/Chapters/4-Loss-Functions/INDEX.md b/Chapters/4-Loss-Functions/INDEX.md
index 0693a2d..bd62e29 100644
--- a/Chapters/4-Loss-Functions/INDEX.md
+++ b/Chapters/4-Loss-Functions/INDEX.md
@@ -187,24 +187,264 @@ $$
 
 ## AdaptiveLogSoftmaxWithLoss
 
+<!-- TODO: Read https://arxiv.org/abs/1609.04309 -->
+
+This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`.
+Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in
+our `training set`.
+
 ## BCELoss | AKA Binary Cross Entropy Loss
 
+$$
+BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = - w_n \cdot \left(
+   y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1  - \bar{y}_n)}
+\right)
+$$
+
+This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
+
+Even here we can reduce with either `mean` or `sum` modifiers
+
 ## KLDivLoss | AKA Kullback-Leibler Divergence Loss
 
+$$
+KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = y_n \cdot \ln{
+    \frac{
+        y_n
+    }{
+        \bar{y}_n
+    }
+} = y_n \cdot \left(
+    \ln{(y_n)} - \ln{(\bar{y}_n)}
+\right)
+$$
+
+This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
+
+This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
+
+> [!CAUTION]
+> This method assumes you have `probablities` but it does not enforce the use of
+> [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax),
+> leading to ***numerical instabilities***
+
 ## BCEWithLogitsLoss
 
+<!-- TODO: Define Logits -->
+
+$$
+BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = - w_n \cdot \left(\,
+   y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
+   \cdot
+   \ln{(1  - \sigma(\bar{y}_n))}\,
+\right)
+$$
+
+This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
+with a
+[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
+***numerical instabilities***
+
 ## HingeEmbeddingLoss
 
+$$
+HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = \begin{cases}
+    \bar{y}_n \; \; & y=1 \\
+    max(0, margin - \bar{y}_n) & y = -1
+\end{cases}
+$$
+
+In order to understand this type of `Loss` let's reason
+as an actual `model`, thus our ***objective*** is to reduce
+the `Loss`.
+
+By observing the `Loss`, we get that if we
+***predict high values*** for `positive classes` we get a
+***high `loss`***.
+
+At the same time, we observe that if we ***predict low values***
+for `negative classes`, we get a ***high `loss`***.
+
+Now, what we'll do is:
+
+- ***predict low `outputs` for `positive classes`***
+- ***predict high `outputs` for `negative classes`***
+
+This makes these 2 `classes` ***more distant***
+between each other,
+and make `points` of each `class` ***closer***
+
 ## MarginRankingLoss
 
-## TripletMarginLoss
+$$
+MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
+\begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = max\left(
+    0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n}  ) + margin \,
+\right)
+$$
+
+$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
+of predictions of that `point` being `class 1` or `class 2`
+(both ***positive***),
+while $\vec{y}$ is the ***vector*** of `labels`.
+
+As before, our goal is to ***minimize*** the `loss`, thus we
+always want ***negative values*** that are ***larger*** than
+$margin$:
+
+- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
+- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
+
+> [!TIP]
+> Let's say we are trying to use this for `classification`,
+> we can ***cheat*** a bit to make the `model` more ***robust***
+> by having ***all correct predictions*** on $\vec{\bar{y}}_1$
+> and on  $\vec{\bar{y}}_2$ only the
+> ***highest wrong prediction*** repeated $n$ times.
+
+## TripletMarginLoss[^tripletmarginloss]
+
+$$
+TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
+\begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = max\left(
+    0,\,
+    d(a_n, p_n) - d(a_n, n_n) + margin \,
+\right)
+$$
+
+Here we have:
+
+- $\vec{a}$: ***anchor point*** that represents a `class`
+- $\vec{p}$: ***positive example*** that is a `point` in the same
+    `class` as $\vec{a}$
+- $\vec{n}$: ***negative example*** that is a `point` in another
+    `class` with respect to $\vec{a}$.
+
+Optimizing here means
+***having similar points near to each other*** and
+***dissilimal points further from each other***, with the
+***latter*** being the most important thing to do (as it's
+the only *negative* term in the equation)
 
 ## SoftMarginLoss
 
-## MultiLabelMarginLoss
+$$
+SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
+\sum_n \frac{
+    \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
+}{
+    N
+}
+$$
+
+This `loss` gives only ***positive*** results, thus the
+optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$.
+
+Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy
+is to make:
+
+- $y_i = -1 \rightarrow \bar{y}_i >> 0$
+- $y_i = 1 \rightarrow \bar{y}_i << 0$
+
+## MultiClassHingeLoss | AKA MultiLabelMarginLoss
+
+<!-- TODO: make examples to understand it better -->
+
+$$
+MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
+\begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n =
+    \sum_{i,j} \frac{
+        max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
+    }{
+     \text{num\_of\_classes}
+    }
+
+    \; \;\forall i,j \text{ with } i \neq j
+$$
+
+Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
+with ***multiple classes*** as
+[CrossEntropyLoss](#crossentropyloss)
 
 ## CosineEmbeddingLoss
 
+$$
+CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
+\begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = \begin{cases}
+    1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
+    max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
+    & y_n = -1
+
+\end{cases}
+$$
+
+With this loss we make these things:
+
+- bring the angle to $0$ between $\bar{y}_{n,1}$
+    and $\bar{y}_{n,2}$ when $y_n = 1$
+- bring the angle to $\pi$ between $\bar{y}_{n,1}$
+    and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
+    only positive values of $\cos$ are allowed, making them
+    ***orthogonal***
+
 [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
 
 [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
+
+[^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)