2025-04-15 17:21:47 +02:00

10 KiB

Loss Functions

MSELoss | AKA L2


MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    (\bar{y}_1 - y_1)^2 \\
    (\bar{y}_2 - y_2)^2 \\
    ...             \\
    (\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T

Though, it can be reduced to a scalar by making either the sum of all the values, or the mean.

L1Loss

This measures the Mean Absolute Error


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    |\bar{y}_1 - y_1| \\
    |\bar{y}_2 - y_2| \\
    ...               \\
    |\bar{y}_n - y_n| \\
\end{bmatrix}^T

This is more robust against outliers as their value is not squared.

However this is not differentiable towards small values, thus the existance of SmoothL1Loss

As MSELoss, it can be reduces into a scalar

SmoothL1Loss | AKA Huber Loss

Note

Called Elastic Network when used as an objective function


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

ln = \begin{cases}
    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
    &\text{ if }
    |\bar{y}_n -y_n| < \beta \\

    |\bar{y}_n -y_n| - 0.5 \cdot \beta
    &\text{ if }
    |\bar{y}_n -y_n| \geq \beta
\end{cases}

This behaves like MSELoss for values under a treshold and L1Loss otherwise.

It has the advantage of being differentiable and is very useful for computer vision

As MSELoss, it can be reduces into a scalar

L1 vs L2 For Image Classification

Usually with L2 losses, we get a blurrier image as opposed with L1 loss. This comes from the fact that L2 averages all values and does not respect distances.

Moreover, since L1 takes the difference, this is constant over all values and does not decrease towards $0$

NLLLoss1

This is basically the distance towards real class tags.


NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \bar{y}_{n, y_n}

Even here there's the possibility to reduce the vector to a scalar:


NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Technically speaking, in Pytorch you have the possibility to exclude some classes during training. Moreover it's possible to pass weights for classes, useful when dealing with unbalanced training set

Tip

So, what's \vec{\bar{y}}?

It's the tensor containing the probability of a point to belong to those classes.

For example, let's say we have 10 points and 3 classes, then \vec{\bar{y}}_{p,c} is the probability of point p belonging to class c

This is why we have l_n = - w_n \cdot \bar{y}_{n, y_n}. In fact, we take the error over the actual class tag of that point.

To get a clear idea, check this website1

Warning

Technically speaking the input data should come from a LogLikelihood like LogSoftmax. However this is not enforced by Pytorch

CrossEntropyLoss2


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \ln\left(
    \frac{
        e^{\bar{y}_{n, y_n}}
    }{
        \sum_c e^{\bar{y}_{n, y_c}}
    }
\right)

Even here there's the possibility to reduce the vector to a scalar:


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Note

This is basically a good version of NLLLoss

AdaptiveLogSoftmaxWithLoss

This is an approximative method to train models with large outputs on GPUs. Usually used when we have many classes and we have imbalances in our training set.

BCELoss | AKA Binary Cross Entropy Loss


BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(
   y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1  - \bar{y}_n)}
\right)

This is a special case of Cross Entropy Loss with just 2 classes

Even here we can reduce with either mean or sum modifiers

KLDivLoss | AKA Kullback-Leibler Divergence Loss


KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = y_n \cdot \ln{
    \frac{
        y_n
    }{
        \bar{y}_n
    }
} = y_n \cdot \left(
    \ln{(y_n)} - \ln{(\bar{y}_n)}
\right)

This is just the Kullback Leibler Loss.

This is used because we are predicting the distribution \vec{y} by using \vec{\bar{y}}

Caution

This method assumes you have probablities but it does not enforce the use of Softmax or LogSoftmax, leading to numerical instabilities

BCEWithLogitsLoss


BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(\,
   y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
   \cdot
   \ln{(1  - \sigma(\bar{y}_n))}\,
\right)

This is basically a BCELoss with a Sigmoid layer to deal with numerical instabilities

HingeEmbeddingLoss


HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    \bar{y}_n \; \; & y=1 \\
    max(0, margin - \bar{y}_n) & y = -1
\end{cases}

In order to understand this type of Loss let's reason as an actual model, thus our objective is to reduce the Loss.

By observing the Loss, we get that if we predict high values for positive classes we get a high loss.

At the same time, we observe that if we predict low values for negative classes, we get a high loss.

Now, what we'll do is:

  • predict low outputs for positive classes
  • predict high outputs for negative classes

This makes these 2 classes more distant between each other, and make points of each class closer

MarginRankingLoss


MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n}  ) + margin \,
\right)

\vec{\bar{y}}_1 and \vec{\bar{y}}_2 represent vectors of predictions of that point being class 1 or class 2 (both positive), while \vec{y} is the vector of labels.

As before, our goal is to minimize the loss, thus we always want negative values that are larger than margin:

  • y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}
  • y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}

Tip

Let's say we are trying to use this for classification, we can cheat a bit to make the model more robust by having all correct predictions on \vec{\bar{y}}_1 and on \vec{\bar{y}}_2 only the highest wrong prediction repeated n times.

TripletMarginLoss3


TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\,
    d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)

Here we have:

  • \vec{a}: anchor point that represents a class
  • \vec{p}: positive example that is a point in the same class as \vec{a}
  • \vec{n}: negative example that is a point in another class with respect to \vec{a}.

Optimizing here means having similar points near to each other and dissilimal points further from each other, with the latter being the most important thing to do (as it's the only negative term in the equation)

SoftMarginLoss


SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
    \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
    N
}

This loss gives only positive results, thus the optimization consist in reducing e^{-y_i \cdot \bar{y}_i}.

Since \vec{y} has only 1 or -1 as values, our strategy is to make:

  • y_i = -1 \rightarrow \bar{y}_i >> 0
  • y_i = 1 \rightarrow \bar{y}_i << 0

MultiClassHingeLoss | AKA MultiLabelMarginLoss


MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n =
    \sum_{i,j} \frac{
        max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
    }{
     \text{num\_of\_classes}
    }

    \; \;\forall i,j \text{ with } i \neq j

Essentially, it works as the HingeLoss, but with multiple classes as CrossEntropyLoss

CosineEmbeddingLoss


CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
    max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
    & y_n = -1

\end{cases}

With this loss we make these things:

  • bring the angle to 0 between \bar{y}_{n,1} and \bar{y}_{n,2} when y_n = 1
  • bring the angle to \pi between \bar{y}_{n,1} and \bar{y}_{n,2} when y_n = -1, or \frac{\pi}{2} if only positive values of \cos are allowed, making them orthogonal