Added 4th Chapter

2025-04-15 14:11:19 +02:00
parent 73c11ebf9d
commit f1f89417a9
1 changed files with 210 additions and 0 deletions
--- a/Chapters/4-Loss-Functions/INDEX.md
+++ b/Chapters/4-Loss-Functions/INDEX.md
@@ -0,0 +1,210 @@
+# Loss Functions
+
+## MSELoss | AKA L2
+
+$$
+MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    (\bar{y}_1 - y_1)^2 \\
+    (\bar{y}_2 - y_2)^2 \\
+    ...             \\
+    (\bar{y}_n - y_n)^2 \\
+\end{bmatrix}^T
+$$
+
+Though, it can be reduced to a **scalar** by making
+either the `sum` of all the values, or the `mean`.
+
+## L1Loss
+
+This measures the **M**ean **A**bsolute **E**rror
+
+$$
+L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    |\bar{y}_1 - y_1| \\
+    |\bar{y}_2 - y_2| \\
+    ...               \\
+    |\bar{y}_n - y_n| \\
+\end{bmatrix}^T
+$$
+
+This is more **robust against outliers** as their
+value is not **squared**.
+
+However this is not ***differentiable*** towards
+**small values**, thus the existance of
+[SmoothL1Loss](#smoothl1loss--aka-huber-loss)
+
+As [MSELoss](#mseloss--aka-l2), it can be reduces into
+a **scalar**
+
+## SmoothL1Loss | AKA Huber Loss
+
+> [!NOTE]
+> Called `Elastic Network` when used as an
+> **objective function**
+
+$$
+L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+ln = \begin{cases}
+    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
+    &\text{ if }
+    |\bar{y}_n -y_n| < \beta \\
+
+    |\bar{y}_n -y_n| - 0.5 \cdot \beta
+    &\text{ if }
+    |\bar{y}_n -y_n| \geq \beta
+\end{cases}
+$$
+
+This behaves like [MSELoss](#mseloss--aka-l2) for
+values **under a treshold** and [L1Loss](#l1loss)
+**otherwise**.
+
+It has the **advantage** of being **differentiable**
+and is **very useful for `computer vision`**
+
+As [MSELoss](#mseloss--aka-l2), it can be reduces into
+a **scalar**
+
+## L1 vs L2 For Image Classification
+
+Usually with `L2` losses, we get a **blurrier** image as
+opposed with `L1` loss. This comes from the fact that
+`L2` averages all values and does not respect
+`distances`.
+
+Moreover, since `L1` takes the difference, this is
+constant over **all values** and **does not
+decrease towards $0$**
+
+## NLLLoss[^NLLLoss]
+
+This is basically the ***distance*** towards
+real ***class tags***.
+
+$$
+NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = - w_n \cdot \bar{y}_{n, y_n}
+$$
+
+Even here there's the possibility to reduce the vector
+to a **scalar**:
+
+$$
+NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
+    \sum^N_{n=1} \frac{
+        l_n
+    }{
+        \sum^N_{n=1} w_n
+    } & \text{ if mode = "mean"}\\
+    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
+\end{cases}
+$$
+
+Technically speaking, in `Pytorch` you have the
+possibility to ***exclude*** some `classes` during
+training. Moreover it's possible to pass
+`weights` for `classes`, **useful when dealing
+with unbalanced training set**
+
+> [!TIP]
+>
+> So, what's $\vec{\bar{y}}$?
+>
+> It's the `tensor` containing the probability of
+> a `point` to belong to those `classes`.
+>
+> For example, let's say we have 10 `points` and 3
+> `classes`, then $\vec{\bar{y}}_{p,c}$ is the
+> **`probability` of `point` `p` belonging to `class`
+> `c`**
+>
+> This is why we have
+> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
+> In fact, we take the error over the
+> **actual `class tag` of that `point`**.
+>
+> To get a clear idea, check this website[^NLLLoss]
+
+<!-- Comment to suppress linter -->
+
+> [!WARNING]
+>
+> Technically speaking the `input` data should come
+> from a `LogLikelihood` like
+> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
+> However this is not enforced by `Pytorch`
+
+## CrossEntropyLoss[^Anelli-CEL]
+
+$$
+CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
+    l_1 \\
+    l_2 \\
+    ... \\
+    l_n \\
+\end{bmatrix}^T;\\
+
+l_n = - w_n \cdot \ln\left(
+    \frac{
+        e^{\bar{y}_{n, y_n}}
+    }{
+        \sum_c e^{\bar{y}_{n, y_c}}
+    }
+\right)
+$$
+
+Even here there's the possibility to reduce the vector
+to a **scalar**:
+
+$$
+CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
+    \sum^N_{n=1} \frac{
+        l_n
+    }{
+        \sum^N_{n=1} w_n
+    } & \text{ if mode = "mean"}\\
+    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
+\end{cases}
+$$
+
+> [!NOTE]
+>
+> This is basically a **good version** of
+> [NLLLoss](#nllloss)
+
+## AdaptiveLogSoftmaxWithLoss
+
+## BCELoss | AKA Binary Cross Entropy Loss
+
+## KLDivLoss | AKA Kullback-Leibler Divergence Loss
+
+## BCEWithLogitsLoss
+
+## HingeEmbeddingLoss
+
+## MarginRankingLoss
+
+## TripletMarginLoss
+
+## SoftMarginLoss
+
+## MultiLabelMarginLoss
+
+## CosineEmbeddingLoss
+
+[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
+
+[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11