# Loss Functions

## MSELoss | AKA L2

$$
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    (\bar{y}_1 - y_1)^2 \\
    (\bar{y}_2 - y_2)^2 \\
    ...             \\
    (\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
$$

Though, it can be reduced to a **scalar** by making
either the `sum` of all the values, or the `mean`.

## L1Loss

This measures the **M**ean **A**bsolute **E**rror

$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    |\bar{y}_1 - y_1| \\
    |\bar{y}_2 - y_2| \\
    ...               \\
    |\bar{y}_n - y_n| \\
\end{bmatrix}^T
$$

This is more **robust against outliers** as their
value is not **squared**.

However this is not ***differentiable*** towards
**small values**, thus the existance of
[SmoothL1Loss](#smoothl1loss--aka-huber-loss)

As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**

## SmoothL1Loss | AKA Huber Loss

> [!NOTE]
> Called `Elastic Network` when used as an
> **objective function**

$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

ln = \begin{cases}
    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
    &\text{ if }
    |\bar{y}_n -y_n| < \beta \\

    |\bar{y}_n -y_n| - 0.5 \cdot \beta
    &\text{ if }
    |\bar{y}_n -y_n| \geq \beta
\end{cases}
$$

This behaves like [MSELoss](#mseloss--aka-l2) for
values **under a treshold** and [L1Loss](#l1loss)
**otherwise**.

It has the **advantage** of being **differentiable**
and is **very useful for `computer vision`**

As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**

## L1 vs L2 For Image Classification

Usually with `L2` losses, we get a **blurrier** image as
opposed with `L1` loss. This comes from the fact that
`L2` averages all values and does not respect
`distances`.

Moreover, since `L1` takes the difference, this is
constant over **all values** and **does not
decrease towards $0$**

## NLLLoss[^NLLLoss]

> [!CAUTION]
>
> Technically speaking the `input` data should come
> from a `LogLikelihood` like
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
> However this is not enforced by `Pytorch`

This is basically the ***distance*** towards
real ***class tags***, optionally weighted by $\vec{w}$.

$$
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
$$

Even here there's the possibility to reduce the vector
to a **scalar**:

$$
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$

Technically speaking, in `Pytorch` you have the
possibility to ***exclude*** some `classes` during
training. Moreover it's possible to pass
`weights`, $\vec{w}$, for `classes`, **useful when dealing
with unbalanced training set**

> [!TIP]
>
> So, what's $\vec{\bar{y}}$?
>
> It's the `tensor` containing the probability of
> a `point` to belong to those `classes`.
>
> For example, let's say we have 10 `points` and 3
> `classes`, then $\vec{\bar{y}}_{p,c}$ is the
> **`probability` of `point` `p` belonging to `class`
> `c`**
>
> This is why we have
> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
> In fact, we take the error over the
> **actual `class tag` of that `point`**.
>
> To get a clear idea, check this website[^NLLLoss]

> [!NOTE]
> While using weights to give more importance to certain classes, or to give a
> higher weight to less frequent classes, there's a better method.
>
> We can use circular buffers to sample an equal amount from all classes and then
> fine-tune at the end by using actual classes frequencies.

## CrossEntropyLoss[^Anelli-CEL]

Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
see its formal derivation

$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \ln\left(
    \frac{
        e^{\bar{y}_{n, y_n}}
    }{
        \sum_c e^{\bar{y}_{n, y_c}}
    }
\right)
$$

Even here there's the possibility to reduce the vector
to a **scalar**:

$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$

> [!NOTE]
>
> This is basically [NLLLoss](#nllloss) without needing a
> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)

## AdaptiveLogSoftmaxWithLoss

<!-- TODO: Read https://arxiv.org/abs/1609.04309 -->

This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`.
Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in
our `training set`.

## BCELoss | AKA Binary Cross Entropy Loss

$$
BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(
   y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1  - \bar{y}_n)}
\right)
$$

This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
2 to represent the loss, thus the *longer equation*.

Even here we can reduce with either `mean` or `sum` modifiers

## KLDivLoss | AKA Kullback-Leibler Divergence Loss

$$
KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = y_n \cdot \ln{
    \frac{
        y_n
    }{
        \bar{y}_n
    }
} = y_n \cdot \left(
    \ln{(y_n)} - \ln{(\bar{y}_n)}
\right)
$$

This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.

This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$

> [!CAUTION]
> This method assumes you have `probablities` but it does not enforce the use of
> [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax),
> leading to ***numerical instabilities***

## BCEWithLogitsLoss

<!-- TODO: Define Logits -->

$$
BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(\,
   y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
   \cdot
   \ln{(1  - \sigma(\bar{y}_n))}\,
\right)
$$

This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
with a
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
***numerical instabilities and make numbers contrained to [0, 1]***

## HingeEmbeddingLoss

$$
HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    \bar{y}_n \; \; & y=1 \\
    max(0, margin - \bar{y}_n) & y = -1
\end{cases}
$$

In order to understand this type of `Loss` let's reason
as an actual `model`, thus our ***objective*** is to reduce
the `Loss`.

By observing the `Loss`, we get that if we
***predict high values*** for `positive classes` we get a
***high `loss`***.

At the same time, we observe that if we ***predict low values***
for `negative classes`, we get a ***high `loss`***.

Now, what we'll do is:

- ***predict low `outputs` for `positive classes`***
- ***predict high `outputs` for `negative classes`***

This makes these 2 `classes` ***more distant***
between each other,
and make `points` of each `class` ***closer***

## MarginRankingLoss

$$
MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n}  ) + margin \,
\right)
$$

here we have 2 predictions of items. The objective is to rank positive items
with high values and vice-versa.

As before, our goal is to ***minimize*** the `loss`, thus we
always want ***negative values*** that are ***larger*** than
$margin$:

- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$

By having a margin we ensure that the model doesn't cheat by making
$\bar{y}_{1,n} = \bar{y}_{2,n}$

> [!TIP]
> Let's say we are trying to use this for `classification`,
> we can ***cheat*** a bit to make the `model` more ***robust***
> by having ***all correct predictions*** on $\vec{\bar{y}}_1$
> and on  $\vec{\bar{y}}_2$ only the
> ***highest wrong prediction*** repeated $n$ times.

## TripletMarginLoss[^tripletmarginloss]

$$
TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\,
    d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)
$$

Here we have:

- $\vec{a}$: ***anchor point*** that represents a `class`
- $\vec{p}$: ***positive example*** that is a `point` in the same
    `class` as $\vec{a}$
- $\vec{n}$: ***negative example*** that is a `point` in another
    `class` with respect to $\vec{a}$.

Optimizing here means
***having similar points near to each other*** and
***dissilimal points further from each other***, with the
***latter*** being the most important thing to do (as it's
the only *negative* term in the equation)

> [!NOTE]
> This is how reverse image search used to work in Google

## SoftMarginLoss

$$
SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
    \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
    N
}
$$

This `loss` gives only ***positive*** results, thus the
optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$.

Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy
is to make:

- $y_i = -1 \rightarrow \bar{y}_i >> 0$
- $y_i = 1 \rightarrow \bar{y}_i << 0$

## MultiClassHingeLoss | AKA MultiLabelMarginLoss

<!-- TODO: make examples to understand it better -->

$$
MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n =
    \sum_{i,j} \frac{
        max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
    }{
     \text{num\_of\_classes}
    }

    \; \;\forall i,j \text{ with } i \neq j
$$

Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
with ***multiple target classes*** as
[CrossEntropyLoss](#crossentropyloss)

## CosineEmbeddingLoss

$$
CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
    max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
    & y_n = -1

\end{cases}
$$

With this loss we make these things:

- bring the angle to $0$ between $\bar{y}_{n,1}$
    and $\bar{y}_{n,2}$ when $y_n = 1$
- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
    and $\bar{y}_{n,2}$ when $y_n = -1$, making them
    ***orthogonal***


[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)

[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11

[^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)