2025-04-15 17:21:47 +02:00

451 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Loss Functions
## MSELoss | AKA L2
$$
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
(\bar{y}_1 - y_1)^2 \\
(\bar{y}_2 - y_2)^2 \\
... \\
(\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
$$
Though, it can be reduced to a **scalar** by making
either the `sum` of all the values, or the `mean`.
## L1Loss
This measures the **M**ean **A**bsolute **E**rror
$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|\bar{y}_1 - y_1| \\
|\bar{y}_2 - y_2| \\
... \\
|\bar{y}_n - y_n| \\
\end{bmatrix}^T
$$
This is more **robust against outliers** as their
value is not **squared**.
However this is not ***differentiable*** towards
**small values**, thus the existance of
[SmoothL1Loss](#smoothl1loss--aka-huber-loss)
As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**
## SmoothL1Loss | AKA Huber Loss
> [!NOTE]
> Called `Elastic Network` when used as an
> **objective function**
$$
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
ln = \begin{cases}
\frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
&\text{ if }
|\bar{y}_n -y_n| < \beta \\
|\bar{y}_n -y_n| - 0.5 \cdot \beta
&\text{ if }
|\bar{y}_n -y_n| \geq \beta
\end{cases}
$$
This behaves like [MSELoss](#mseloss--aka-l2) for
values **under a treshold** and [L1Loss](#l1loss)
**otherwise**.
It has the **advantage** of being **differentiable**
and is **very useful for `computer vision`**
As [MSELoss](#mseloss--aka-l2), it can be reduces into
a **scalar**
## L1 vs L2 For Image Classification
Usually with `L2` losses, we get a **blurrier** image as
opposed with `L1` loss. This comes from the fact that
`L2` averages all values and does not respect
`distances`.
Moreover, since `L1` takes the difference, this is
constant over **all values** and **does not
decrease towards $0$**
## NLLLoss[^NLLLoss]
This is basically the ***distance*** towards
real ***class tags***.
$$
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \bar{y}_{n, y_n}
$$
Even here there's the possibility to reduce the vector
to a **scalar**:
$$
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$
Technically speaking, in `Pytorch` you have the
possibility to ***exclude*** some `classes` during
training. Moreover it's possible to pass
`weights` for `classes`, **useful when dealing
with unbalanced training set**
> [!TIP]
>
> So, what's $\vec{\bar{y}}$?
>
> It's the `tensor` containing the probability of
> a `point` to belong to those `classes`.
>
> For example, let's say we have 10 `points` and 3
> `classes`, then $\vec{\bar{y}}_{p,c}$ is the
> **`probability` of `point` `p` belonging to `class`
> `c`**
>
> This is why we have
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
> In fact, we take the error over the
> **actual `class tag` of that `point`**.
>
> To get a clear idea, check this website[^NLLLoss]
<!-- Comment to suppress linter -->
> [!WARNING]
>
> Technically speaking the `input` data should come
> from a `LogLikelihood` like
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
> However this is not enforced by `Pytorch`
## CrossEntropyLoss[^Anelli-CEL]
$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \ln\left(
\frac{
e^{\bar{y}_{n, y_n}}
}{
\sum_c e^{\bar{y}_{n, y_c}}
}
\right)
$$
Even here there's the possibility to reduce the vector
to a **scalar**:
$$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
$$
> [!NOTE]
>
> This is basically a **good version** of
> [NLLLoss](#nllloss)
## AdaptiveLogSoftmaxWithLoss
<!-- TODO: Read https://arxiv.org/abs/1609.04309 -->
This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`.
Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in
our `training set`.
## BCELoss | AKA Binary Cross Entropy Loss
$$
BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(
y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)}
\right)
$$
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
Even here we can reduce with either `mean` or `sum` modifiers
## KLDivLoss | AKA Kullback-Leibler Divergence Loss
$$
KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = y_n \cdot \ln{
\frac{
y_n
}{
\bar{y}_n
}
} = y_n \cdot \left(
\ln{(y_n)} - \ln{(\bar{y}_n)}
\right)
$$
This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/KullbackLeibler_divergence)***.
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
> [!CAUTION]
> This method assumes you have `probablities` but it does not enforce the use of
> [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax),
> leading to ***numerical instabilities***
## BCEWithLogitsLoss
<!-- TODO: Define Logits -->
$$
BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(\,
y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
\cdot
\ln{(1 - \sigma(\bar{y}_n))}\,
\right)
$$
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
with a
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
***numerical instabilities***
## HingeEmbeddingLoss
$$
HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
\bar{y}_n \; \; & y=1 \\
max(0, margin - \bar{y}_n) & y = -1
\end{cases}
$$
In order to understand this type of `Loss` let's reason
as an actual `model`, thus our ***objective*** is to reduce
the `Loss`.
By observing the `Loss`, we get that if we
***predict high values*** for `positive classes` we get a
***high `loss`***.
At the same time, we observe that if we ***predict low values***
for `negative classes`, we get a ***high `loss`***.
Now, what we'll do is:
- ***predict low `outputs` for `positive classes`***
- ***predict high `outputs` for `negative classes`***
This makes these 2 `classes` ***more distant***
between each other,
and make `points` of each `class` ***closer***
## MarginRankingLoss
$$
MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \,
\right)
$$
$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
of predictions of that `point` being `class 1` or `class 2`
(both ***positive***),
while $\vec{y}$ is the ***vector*** of `labels`.
As before, our goal is to ***minimize*** the `loss`, thus we
always want ***negative values*** that are ***larger*** than
$margin$:
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
> [!TIP]
> Let's say we are trying to use this for `classification`,
> we can ***cheat*** a bit to make the `model` more ***robust***
> by having ***all correct predictions*** on $\vec{\bar{y}}_1$
> and on $\vec{\bar{y}}_2$ only the
> ***highest wrong prediction*** repeated $n$ times.
## TripletMarginLoss[^tripletmarginloss]
$$
TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\,
d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)
$$
Here we have:
- $\vec{a}$: ***anchor point*** that represents a `class`
- $\vec{p}$: ***positive example*** that is a `point` in the same
`class` as $\vec{a}$
- $\vec{n}$: ***negative example*** that is a `point` in another
`class` with respect to $\vec{a}$.
Optimizing here means
***having similar points near to each other*** and
***dissilimal points further from each other***, with the
***latter*** being the most important thing to do (as it's
the only *negative* term in the equation)
## SoftMarginLoss
$$
SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
\ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
N
}
$$
This `loss` gives only ***positive*** results, thus the
optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$.
Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy
is to make:
- $y_i = -1 \rightarrow \bar{y}_i >> 0$
- $y_i = 1 \rightarrow \bar{y}_i << 0$
## MultiClassHingeLoss | AKA MultiLabelMarginLoss
<!-- TODO: make examples to understand it better -->
$$
MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n =
\sum_{i,j} \frac{
max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
}{
\text{num\_of\_classes}
}
\; \;\forall i,j \text{ with } i \neq j
$$
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
with ***multiple classes*** as
[CrossEntropyLoss](#crossentropyloss)
## CosineEmbeddingLoss
$$
CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
& y_n = -1
\end{cases}
$$
With this loss we make these things:
- bring the angle to $0$ between $\bar{y}_{n,1}$
and $\bar{y}_{n,2}$ when $y_n = 1$
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
only positive values of $\cos$ are allowed, making them
***orthogonal***
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
[^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)