451 lines
10 KiB
Markdown
451 lines
10 KiB
Markdown
# Loss Functions
|
||
|
||
## MSELoss | AKA L2
|
||
|
||
$$
|
||
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
(\bar{y}_1 - y_1)^2 \\
|
||
(\bar{y}_2 - y_2)^2 \\
|
||
... \\
|
||
(\bar{y}_n - y_n)^2 \\
|
||
\end{bmatrix}^T
|
||
$$
|
||
|
||
Though, it can be reduced to a **scalar** by making
|
||
either the `sum` of all the values, or the `mean`.
|
||
|
||
## L1Loss
|
||
|
||
This measures the **M**ean **A**bsolute **E**rror
|
||
|
||
$$
|
||
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
|\bar{y}_1 - y_1| \\
|
||
|\bar{y}_2 - y_2| \\
|
||
... \\
|
||
|\bar{y}_n - y_n| \\
|
||
\end{bmatrix}^T
|
||
$$
|
||
|
||
This is more **robust against outliers** as their
|
||
value is not **squared**.
|
||
|
||
However this is not ***differentiable*** towards
|
||
**small values**, thus the existance of
|
||
[SmoothL1Loss](#smoothl1loss--aka-huber-loss)
|
||
|
||
As [MSELoss](#mseloss--aka-l2), it can be reduces into
|
||
a **scalar**
|
||
|
||
## SmoothL1Loss | AKA Huber Loss
|
||
|
||
> [!NOTE]
|
||
> Called `Elastic Network` when used as an
|
||
> **objective function**
|
||
|
||
$$
|
||
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
ln = \begin{cases}
|
||
\frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
|
||
&\text{ if }
|
||
|\bar{y}_n -y_n| < \beta \\
|
||
|
||
|\bar{y}_n -y_n| - 0.5 \cdot \beta
|
||
&\text{ if }
|
||
|\bar{y}_n -y_n| \geq \beta
|
||
\end{cases}
|
||
$$
|
||
|
||
This behaves like [MSELoss](#mseloss--aka-l2) for
|
||
values **under a treshold** and [L1Loss](#l1loss)
|
||
**otherwise**.
|
||
|
||
It has the **advantage** of being **differentiable**
|
||
and is **very useful for `computer vision`**
|
||
|
||
As [MSELoss](#mseloss--aka-l2), it can be reduces into
|
||
a **scalar**
|
||
|
||
## L1 vs L2 For Image Classification
|
||
|
||
Usually with `L2` losses, we get a **blurrier** image as
|
||
opposed with `L1` loss. This comes from the fact that
|
||
`L2` averages all values and does not respect
|
||
`distances`.
|
||
|
||
Moreover, since `L1` takes the difference, this is
|
||
constant over **all values** and **does not
|
||
decrease towards $0$**
|
||
|
||
## NLLLoss[^NLLLoss]
|
||
|
||
This is basically the ***distance*** towards
|
||
real ***class tags***.
|
||
|
||
$$
|
||
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = - w_n \cdot \bar{y}_{n, y_n}
|
||
$$
|
||
|
||
Even here there's the possibility to reduce the vector
|
||
to a **scalar**:
|
||
|
||
$$
|
||
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
|
||
\sum^N_{n=1} \frac{
|
||
l_n
|
||
}{
|
||
\sum^N_{n=1} w_n
|
||
} & \text{ if mode = "mean"}\\
|
||
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
|
||
\end{cases}
|
||
$$
|
||
|
||
Technically speaking, in `Pytorch` you have the
|
||
possibility to ***exclude*** some `classes` during
|
||
training. Moreover it's possible to pass
|
||
`weights` for `classes`, **useful when dealing
|
||
with unbalanced training set**
|
||
|
||
> [!TIP]
|
||
>
|
||
> So, what's $\vec{\bar{y}}$?
|
||
>
|
||
> It's the `tensor` containing the probability of
|
||
> a `point` to belong to those `classes`.
|
||
>
|
||
> For example, let's say we have 10 `points` and 3
|
||
> `classes`, then $\vec{\bar{y}}_{p,c}$ is the
|
||
> **`probability` of `point` `p` belonging to `class`
|
||
> `c`**
|
||
>
|
||
> This is why we have
|
||
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
|
||
> In fact, we take the error over the
|
||
> **actual `class tag` of that `point`**.
|
||
>
|
||
> To get a clear idea, check this website[^NLLLoss]
|
||
|
||
<!-- Comment to suppress linter -->
|
||
|
||
> [!WARNING]
|
||
>
|
||
> Technically speaking the `input` data should come
|
||
> from a `LogLikelihood` like
|
||
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
|
||
> However this is not enforced by `Pytorch`
|
||
|
||
## CrossEntropyLoss[^Anelli-CEL]
|
||
|
||
$$
|
||
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = - w_n \cdot \ln\left(
|
||
\frac{
|
||
e^{\bar{y}_{n, y_n}}
|
||
}{
|
||
\sum_c e^{\bar{y}_{n, y_c}}
|
||
}
|
||
\right)
|
||
$$
|
||
|
||
Even here there's the possibility to reduce the vector
|
||
to a **scalar**:
|
||
|
||
$$
|
||
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
|
||
\sum^N_{n=1} \frac{
|
||
l_n
|
||
}{
|
||
\sum^N_{n=1} w_n
|
||
} & \text{ if mode = "mean"}\\
|
||
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
|
||
\end{cases}
|
||
$$
|
||
|
||
> [!NOTE]
|
||
>
|
||
> This is basically a **good version** of
|
||
> [NLLLoss](#nllloss)
|
||
|
||
## AdaptiveLogSoftmaxWithLoss
|
||
|
||
<!-- TODO: Read https://arxiv.org/abs/1609.04309 -->
|
||
|
||
This is an ***approximative*** method to train models with ***large `outputs`*** on `GPUs`.
|
||
Usually used when we have ***many `classes`*** and we have ***[imbalances](DEALING-WITH-IMBALANCES.md)*** in
|
||
our `training set`.
|
||
|
||
## BCELoss | AKA Binary Cross Entropy Loss
|
||
|
||
$$
|
||
BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = - w_n \cdot \left(
|
||
y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)}
|
||
\right)
|
||
$$
|
||
|
||
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
|
||
|
||
Even here we can reduce with either `mean` or `sum` modifiers
|
||
|
||
## KLDivLoss | AKA Kullback-Leibler Divergence Loss
|
||
|
||
$$
|
||
KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = y_n \cdot \ln{
|
||
\frac{
|
||
y_n
|
||
}{
|
||
\bar{y}_n
|
||
}
|
||
} = y_n \cdot \left(
|
||
\ln{(y_n)} - \ln{(\bar{y}_n)}
|
||
\right)
|
||
$$
|
||
|
||
This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
|
||
|
||
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
|
||
|
||
> [!CAUTION]
|
||
> This method assumes you have `probablities` but it does not enforce the use of
|
||
> [Softmax](./../3-Activation-Functions/INDEX.md#softmax) or [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax),
|
||
> leading to ***numerical instabilities***
|
||
|
||
## BCEWithLogitsLoss
|
||
|
||
<!-- TODO: Define Logits -->
|
||
|
||
$$
|
||
BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = - w_n \cdot \left(\,
|
||
y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
|
||
\cdot
|
||
\ln{(1 - \sigma(\bar{y}_n))}\,
|
||
\right)
|
||
$$
|
||
|
||
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
|
||
with a
|
||
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
|
||
***numerical instabilities***
|
||
|
||
## HingeEmbeddingLoss
|
||
|
||
$$
|
||
HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = \begin{cases}
|
||
\bar{y}_n \; \; & y=1 \\
|
||
max(0, margin - \bar{y}_n) & y = -1
|
||
\end{cases}
|
||
$$
|
||
|
||
In order to understand this type of `Loss` let's reason
|
||
as an actual `model`, thus our ***objective*** is to reduce
|
||
the `Loss`.
|
||
|
||
By observing the `Loss`, we get that if we
|
||
***predict high values*** for `positive classes` we get a
|
||
***high `loss`***.
|
||
|
||
At the same time, we observe that if we ***predict low values***
|
||
for `negative classes`, we get a ***high `loss`***.
|
||
|
||
Now, what we'll do is:
|
||
|
||
- ***predict low `outputs` for `positive classes`***
|
||
- ***predict high `outputs` for `negative classes`***
|
||
|
||
This makes these 2 `classes` ***more distant***
|
||
between each other,
|
||
and make `points` of each `class` ***closer***
|
||
|
||
## MarginRankingLoss
|
||
|
||
$$
|
||
MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
|
||
\begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = max\left(
|
||
0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \,
|
||
\right)
|
||
$$
|
||
|
||
$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
|
||
of predictions of that `point` being `class 1` or `class 2`
|
||
(both ***positive***),
|
||
while $\vec{y}$ is the ***vector*** of `labels`.
|
||
|
||
As before, our goal is to ***minimize*** the `loss`, thus we
|
||
always want ***negative values*** that are ***larger*** than
|
||
$margin$:
|
||
|
||
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
|
||
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
|
||
|
||
> [!TIP]
|
||
> Let's say we are trying to use this for `classification`,
|
||
> we can ***cheat*** a bit to make the `model` more ***robust***
|
||
> by having ***all correct predictions*** on $\vec{\bar{y}}_1$
|
||
> and on $\vec{\bar{y}}_2$ only the
|
||
> ***highest wrong prediction*** repeated $n$ times.
|
||
|
||
## TripletMarginLoss[^tripletmarginloss]
|
||
|
||
$$
|
||
TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
|
||
\begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = max\left(
|
||
0,\,
|
||
d(a_n, p_n) - d(a_n, n_n) + margin \,
|
||
\right)
|
||
$$
|
||
|
||
Here we have:
|
||
|
||
- $\vec{a}$: ***anchor point*** that represents a `class`
|
||
- $\vec{p}$: ***positive example*** that is a `point` in the same
|
||
`class` as $\vec{a}$
|
||
- $\vec{n}$: ***negative example*** that is a `point` in another
|
||
`class` with respect to $\vec{a}$.
|
||
|
||
Optimizing here means
|
||
***having similar points near to each other*** and
|
||
***dissilimal points further from each other***, with the
|
||
***latter*** being the most important thing to do (as it's
|
||
the only *negative* term in the equation)
|
||
|
||
## SoftMarginLoss
|
||
|
||
$$
|
||
SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
|
||
\sum_n \frac{
|
||
\ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
|
||
}{
|
||
N
|
||
}
|
||
$$
|
||
|
||
This `loss` gives only ***positive*** results, thus the
|
||
optimization consist in reducing $e^{-y_i \cdot \bar{y}_i}$.
|
||
|
||
Since $\vec{y}$ has only $1$ or $-1$ as values, our strategy
|
||
is to make:
|
||
|
||
- $y_i = -1 \rightarrow \bar{y}_i >> 0$
|
||
- $y_i = 1 \rightarrow \bar{y}_i << 0$
|
||
|
||
## MultiClassHingeLoss | AKA MultiLabelMarginLoss
|
||
|
||
<!-- TODO: make examples to understand it better -->
|
||
|
||
$$
|
||
MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
|
||
\begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n =
|
||
\sum_{i,j} \frac{
|
||
max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
|
||
}{
|
||
\text{num\_of\_classes}
|
||
}
|
||
|
||
\; \;\forall i,j \text{ with } i \neq j
|
||
$$
|
||
|
||
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
|
||
with ***multiple classes*** as
|
||
[CrossEntropyLoss](#crossentropyloss)
|
||
|
||
## CosineEmbeddingLoss
|
||
|
||
$$
|
||
CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
|
||
\begin{bmatrix}
|
||
l_1 \\
|
||
l_2 \\
|
||
... \\
|
||
l_n \\
|
||
\end{bmatrix}^T;\\
|
||
|
||
l_n = \begin{cases}
|
||
1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
|
||
max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
|
||
& y_n = -1
|
||
|
||
\end{cases}
|
||
$$
|
||
|
||
With this loss we make these things:
|
||
|
||
- bring the angle to $0$ between $\bar{y}_{n,1}$
|
||
and $\bar{y}_{n,2}$ when $y_n = 1$
|
||
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
|
||
and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
|
||
only positive values of $\cos$ are allowed, making them
|
||
***orthogonal***
|
||
|
||
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
|
||
|
||
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
|
||
|
||
[^tripletmarginloss]: [Official Paper](https://bmva-archive.org.uk/bmvc/2016/papers/paper119/paper119.pdf)
|