From f1f89417a9c2fe204020281782c95e938aabe7f5 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Tue, 15 Apr 2025 14:11:19 +0200 Subject: [PATCH] Added 4th Chapter --- Chapters/4-Loss-Functions/INDEX.md | 210 +++++++++++++++++++++++++++++ 1 file changed, 210 insertions(+) create mode 100644 Chapters/4-Loss-Functions/INDEX.md diff --git a/Chapters/4-Loss-Functions/INDEX.md b/Chapters/4-Loss-Functions/INDEX.md new file mode 100644 index 0000000..0693a2d --- /dev/null +++ b/Chapters/4-Loss-Functions/INDEX.md @@ -0,0 +1,210 @@ +# Loss Functions + +## MSELoss | AKA L2 + +$$ +MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + (\bar{y}_1 - y_1)^2 \\ + (\bar{y}_2 - y_2)^2 \\ + ... \\ + (\bar{y}_n - y_n)^2 \\ +\end{bmatrix}^T +$$ + +Though, it can be reduced to a **scalar** by making +either the `sum` of all the values, or the `mean`. + +## L1Loss + +This measures the **M**ean **A**bsolute **E**rror + +$$ +L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + |\bar{y}_1 - y_1| \\ + |\bar{y}_2 - y_2| \\ + ... \\ + |\bar{y}_n - y_n| \\ +\end{bmatrix}^T +$$ + +This is more **robust against outliers** as their +value is not **squared**. + +However this is not ***differentiable*** towards +**small values**, thus the existance of +[SmoothL1Loss](#smoothl1loss--aka-huber-loss) + +As [MSELoss](#mseloss--aka-l2), it can be reduces into +a **scalar** + +## SmoothL1Loss | AKA Huber Loss + +> [!NOTE] +> Called `Elastic Network` when used as an +> **objective function** + +$$ +L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +ln = \begin{cases} + \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta} + &\text{ if } + |\bar{y}_n -y_n| < \beta \\ + + |\bar{y}_n -y_n| - 0.5 \cdot \beta + &\text{ if } + |\bar{y}_n -y_n| \geq \beta +\end{cases} +$$ + +This behaves like [MSELoss](#mseloss--aka-l2) for +values **under a treshold** and [L1Loss](#l1loss) +**otherwise**. + +It has the **advantage** of being **differentiable** +and is **very useful for `computer vision`** + +As [MSELoss](#mseloss--aka-l2), it can be reduces into +a **scalar** + +## L1 vs L2 For Image Classification + +Usually with `L2` losses, we get a **blurrier** image as +opposed with `L1` loss. This comes from the fact that +`L2` averages all values and does not respect +`distances`. + +Moreover, since `L1` takes the difference, this is +constant over **all values** and **does not +decrease towards $0$** + +## NLLLoss[^NLLLoss] + +This is basically the ***distance*** towards +real ***class tags***. + +$$ +NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = - w_n \cdot \bar{y}_{n, y_n} +$$ + +Even here there's the possibility to reduce the vector +to a **scalar**: + +$$ +NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} + \sum^N_{n=1} \frac{ + l_n + }{ + \sum^N_{n=1} w_n + } & \text{ if mode = "mean"}\\ + \sum^N_{n=1} l_n & \text{ if mode = "sum"} +\end{cases} +$$ + +Technically speaking, in `Pytorch` you have the +possibility to ***exclude*** some `classes` during +training. Moreover it's possible to pass +`weights` for `classes`, **useful when dealing +with unbalanced training set** + +> [!TIP] +> +> So, what's $\vec{\bar{y}}$? +> +> It's the `tensor` containing the probability of +> a `point` to belong to those `classes`. +> +> For example, let's say we have 10 `points` and 3 +> `classes`, then $\vec{\bar{y}}_{p,c}$ is the +> **`probability` of `point` `p` belonging to `class` +> `c`** +> +> This is why we have +> $l_n = - w_n \cdot \bar{y}_{n, y_n}$. +> In fact, we take the error over the +> **actual `class tag` of that `point`**. +> +> To get a clear idea, check this website[^NLLLoss] + + + +> [!WARNING] +> +> Technically speaking the `input` data should come +> from a `LogLikelihood` like +> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax). +> However this is not enforced by `Pytorch` + +## CrossEntropyLoss[^Anelli-CEL] + +$$ +CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} + l_1 \\ + l_2 \\ + ... \\ + l_n \\ +\end{bmatrix}^T;\\ + +l_n = - w_n \cdot \ln\left( + \frac{ + e^{\bar{y}_{n, y_n}} + }{ + \sum_c e^{\bar{y}_{n, y_c}} + } +\right) +$$ + +Even here there's the possibility to reduce the vector +to a **scalar**: + +$$ +CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} + \sum^N_{n=1} \frac{ + l_n + }{ + \sum^N_{n=1} w_n + } & \text{ if mode = "mean"}\\ + \sum^N_{n=1} l_n & \text{ if mode = "sum"} +\end{cases} +$$ + +> [!NOTE] +> +> This is basically a **good version** of +> [NLLLoss](#nllloss) + +## AdaptiveLogSoftmaxWithLoss + +## BCELoss | AKA Binary Cross Entropy Loss + +## KLDivLoss | AKA Kullback-Leibler Divergence Loss + +## BCEWithLogitsLoss + +## HingeEmbeddingLoss + +## MarginRankingLoss + +## TripletMarginLoss + +## SoftMarginLoss + +## MultiLabelMarginLoss + +## CosineEmbeddingLoss + +[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) + +[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11