# Loss Functions

## MSELoss | AKA L2

$$
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    (\bar{y}_1 - y_1)^2 \\
    (\bar{y}_2 - y_2)^2 \\
    ...             \\
    (\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
$$

Though, it can be reduced to a **scalar** by making
either the `sum` of all the values, or the `mean`.

## L1Loss

This measures the **M**ean **A**bsolute **E**rror

$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    |\bar{y}_1 - y_1| \\
    |\bar{y}_2 - y_2| \\
    ...               \\
    |\bar{y}_n - y_n| \\
\end{bmatrix}^T
$$

This is more **robust against outliers** as their
value is not **squared**.

However this is not ***differentiable*** towards
**small values**, thus the existance of
[SmoothL1Loss](#smoothl1loss--aka-huber-loss)

As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**

## SmoothL1Loss | AKA Huber Loss

> [!NOTE]
> Called `Elastic Network` when used as an
> **objective function**

$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

ln = \begin{cases}
    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
    &\text{ if }
    |\bar{y}_n -y_n| < \beta \\

    |\bar{y}_n -y_n| - 0.5 \cdot \beta
    &\text{ if }
    |\bar{y}_n -y_n| \geq \beta
\end{cases}
$$

This behaves like [MSELoss](#mseloss--aka-l2) for
values **under a treshold** and [L1Loss](#l1loss)
**otherwise**.

It has the **advantage** of being **differentiable**
and is **very useful for `computer vision`**

As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**

## L1 vs L2 For Image Classification

Usually with `L2` losses, we get a **blurrier** image as
opposed with `L1` loss. This comes from the fact that
`L2` averages all values and does not respect
`distances`.

Moreover, since `L1` takes the difference, this is
constant over **all values** and **does not
decrease towards $0$**

## NLLLoss[^NLLLoss]

This is basically the ***distance*** towards
real ***class tags***.

$$
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \bar{y}_{n, y_n}
$$

Even here there's the possibility to reduce the vector
to a **scalar**:

$$
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$

Technically speaking, in `Pytorch` you have the
possibility to ***exclude*** some `classes` during
training. Moreover it's possible to pass
`weights` for `classes`, **useful when dealing
with unbalanced training set**

> [!TIP]
>
> So, what's $\vec{\bar{y}}$?
>
> It's the `tensor` containing the probability of
> a `point` to belong to those `classes`.
>
> For example, let's say we have 10 `points` and 3
> `classes`, then $\vec{\bar{y}}_{p,c}$ is the
> **`probability` of `point` `p` belonging to `class`
> `c`**
>
> This is why we have
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
> In fact, we take the error over the
> **actual `class tag` of that `point`**.
>
> To get a clear idea, check this website[^NLLLoss]

<!-- Comment to suppress linter -->

> [!WARNING]
>
> Technically speaking the `input` data should come
> from a `LogLikelihood` like
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
> However this is not enforced by `Pytorch`

## CrossEntropyLoss[^Anelli-CEL]

$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \ln\left(
    \frac{
        e^{\bar{y}_{n, y_n}}
    }{
        \sum_c e^{\bar{y}_{n, y_c}}
    }
\right)
$$

Even here there's the possibility to reduce the vector
to a **scalar**:

$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$

> [!NOTE]
>
> This is basically a **good version** of
> [NLLLoss](#nllloss)

## AdaptiveLogSoftmaxWithLoss

## BCELoss | AKA Binary Cross Entropy Loss

## KLDivLoss | AKA Kullback-Leibler Divergence Loss

## BCEWithLogitsLoss

## HingeEmbeddingLoss

## MarginRankingLoss

## TripletMarginLoss

## SoftMarginLoss

## MultiLabelMarginLoss

## CosineEmbeddingLoss

[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)

[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11