# Loss Functions ## MSELoss | AKA L2 $$ MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} (\bar{y}_1 - y_1)^2 \\ (\bar{y}_2 - y_2)^2 \\ ... \\ (\bar{y}_n - y_n)^2 \\ \end{bmatrix}^T $$ Though, it can be reduced to a **scalar** by making either the `sum` of all the values, or the `mean`. ## L1Loss This measures the **M**ean **A**bsolute **E**rror $$ L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} |\bar{y}_1 - y_1| \\ |\bar{y}_2 - y_2| \\ ... \\ |\bar{y}_n - y_n| \\ \end{bmatrix}^T $$ This is more **robust against outliers** as their value is not **squared**. However this is not ***differentiable*** towards **small values**, thus the existance of [SmoothL1Loss](#smoothl1loss--aka-huber-loss) As [MSELoss](#mseloss--aka-l2), it can be reduces into a **scalar** ## SmoothL1Loss | AKA Huber Loss > [!NOTE] > Called `Elastic Network` when used as an > **objective function** $$ L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ ln = \begin{cases} \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta} &\text{ if } |\bar{y}_n -y_n| < \beta \\ |\bar{y}_n -y_n| - 0.5 \cdot \beta &\text{ if } |\bar{y}_n -y_n| \geq \beta \end{cases} $$ This behaves like [MSELoss](#mseloss--aka-l2) for values **under a treshold** and [L1Loss](#l1loss) **otherwise**. It has the **advantage** of being **differentiable** and is **very useful for `computer vision`** As [MSELoss](#mseloss--aka-l2), it can be reduces into a **scalar** ## L1 vs L2 For Image Classification Usually with `L2` losses, we get a **blurrier** image as opposed with `L1` loss. This comes from the fact that `L2` averages all values and does not respect `distances`. Moreover, since `L1` takes the difference, this is constant over **all values** and **does not decrease towards $0$** ## NLLLoss[^NLLLoss] This is basically the ***distance*** towards real ***class tags***. $$ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_n \cdot \bar{y}_{n, y_n} $$ Even here there's the possibility to reduce the vector to a **scalar**: $$ NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} \sum^N_{n=1} \frac{ l_n }{ \sum^N_{n=1} w_n } & \text{ if mode = "mean"}\\ \sum^N_{n=1} l_n & \text{ if mode = "sum"} \end{cases} $$ Technically speaking, in `Pytorch` you have the possibility to ***exclude*** some `classes` during training. Moreover it's possible to pass `weights` for `classes`, **useful when dealing with unbalanced training set** > [!TIP] > > So, what's $\vec{\bar{y}}$? > > It's the `tensor` containing the probability of > a `point` to belong to those `classes`. > > For example, let's say we have 10 `points` and 3 > `classes`, then $\vec{\bar{y}}_{p,c}$ is the > **`probability` of `point` `p` belonging to `class` > `c`** > > This is why we have > $l_n = - w_n \cdot \bar{y}_{n, y_n}$. > In fact, we take the error over the > **actual `class tag` of that `point`**. > > To get a clear idea, check this website[^NLLLoss] > [!WARNING] > > Technically speaking the `input` data should come > from a `LogLikelihood` like > [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax). > However this is not enforced by `Pytorch` ## CrossEntropyLoss[^Anelli-CEL] $$ CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} l_1 \\ l_2 \\ ... \\ l_n \\ \end{bmatrix}^T;\\ l_n = - w_n \cdot \ln\left( \frac{ e^{\bar{y}_{n, y_n}} }{ \sum_c e^{\bar{y}_{n, y_c}} } \right) $$ Even here there's the possibility to reduce the vector to a **scalar**: $$ CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases} \sum^N_{n=1} \frac{ l_n }{ \sum^N_{n=1} w_n } & \text{ if mode = "mean"}\\ \sum^N_{n=1} l_n & \text{ if mode = "sum"} \end{cases} $$ > [!NOTE] > > This is basically a **good version** of > [NLLLoss](#nllloss) ## AdaptiveLogSoftmaxWithLoss ## BCELoss | AKA Binary Cross Entropy Loss ## KLDivLoss | AKA Kullback-Leibler Divergence Loss ## BCEWithLogitsLoss ## HingeEmbeddingLoss ## MarginRankingLoss ## TripletMarginLoss ## SoftMarginLoss ## MultiLabelMarginLoss ## CosineEmbeddingLoss [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11