Christian Risi 2a96deaebf Revised chapter 4

2025-11-17 17:04:46 +01:00

11 KiB

Raw Permalink Blame History

Loss Functions

MSELoss | AKA L2


MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    (\bar{y}_1 - y_1)^2 \\
    (\bar{y}_2 - y_2)^2 \\
    ...             \\
    (\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T

Though, it can be reduced to a scalar by making either the sum of all the values, or the mean.

L1Loss

This measures the Mean Absolute Error


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    |\bar{y}_1 - y_1| \\
    |\bar{y}_2 - y_2| \\
    ...               \\
    |\bar{y}_n - y_n| \\
\end{bmatrix}^T

This is more robust against outliers as their value is not squared.

However this is not differentiable towards small values, thus the existance of SmoothL1Loss

As MSELoss, it can be reduces into a scalar

SmoothL1Loss | AKA Huber Loss

Note

Called Elastic Network when used as an objective function


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

ln = \begin{cases}
    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
    &\text{ if }
    |\bar{y}_n -y_n| < \beta \\

    |\bar{y}_n -y_n| - 0.5 \cdot \beta
    &\text{ if }
    |\bar{y}_n -y_n| \geq \beta
\end{cases}

This behaves like MSELoss for values under a treshold and L1Loss otherwise.

It has the advantage of being differentiable and is very useful for computer vision

As MSELoss, it can be reduces into a scalar

L1 vs L2 For Image Classification

Usually with L2 losses, we get a blurrier image as opposed with L1 loss. This comes from the fact that L2 averages all values and does not respect distances.

Moreover, since L1 takes the difference, this is constant over all values and does not decrease towards $0$

NLLLoss¹

Caution

Technically speaking the input data should come from a LogLikelihood like LogSoftmax. However this is not enforced by Pytorch

This is basically the distance towards real class tags, optionally weighted by \vec{w}.


NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}

Even here there's the possibility to reduce the vector to a scalar:


NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Technically speaking, in Pytorch you have the possibility to exclude some classes during training. Moreover it's possible to pass weights, \vec{w}, for classes, useful when dealing with unbalanced training set

Tip

So, what's \vec{\bar{y}}?

It's the tensor containing the probability of a point to belong to those classes.

For example, let's say we have 10 points and 3 classes, then \vec{\bar{y}}_{p,c} is the probability of point p belonging to class c

This is why we have l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}. In fact, we take the error over the actual class tag of that point.

To get a clear idea, check this website¹

Note

While using weights to give more importance to certain classes, or to give a higher weight to less frequent classes, there's a better method.

We can use circular buffers to sample an equal amount from all classes and then fine-tune at the end by using actual classes frequencies.

CrossEntropyLoss²

Check here to see its formal derivation


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \ln\left(
    \frac{
        e^{\bar{y}_{n, y_n}}
    }{
        \sum_c e^{\bar{y}_{n, y_c}}
    }
\right)

Even here there's the possibility to reduce the vector to a scalar:


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Note

This is basically NLLLoss without needing a log softmax

AdaptiveLogSoftmaxWithLoss

This is an approximative method to train models with large outputs on GPUs. Usually used when we have many classes and we have imbalances in our training set.

BCELoss | AKA Binary Cross Entropy Loss


BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(
   y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1  - \bar{y}_n)}
\right)

This is a special case of Cross Entropy Loss with just 2 classes. Because of this we employ a trick to use a single variable instead of 2 to represent the loss, thus the longer equation.

Even here we can reduce with either mean or sum modifiers

KLDivLoss | AKA Kullback-Leibler Divergence Loss


KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = y_n \cdot \ln{
    \frac{
        y_n
    }{
        \bar{y}_n
    }
} = y_n \cdot \left(
    \ln{(y_n)} - \ln{(\bar{y}_n)}
\right)

This is just the Kullback Leibler Divergence.

This is used because we are predicting the distribution \vec{y} by using \vec{\bar{y}}

Caution

This method assumes you have probablities but it does not enforce the use of Softmax or LogSoftmax, leading to numerical instabilities

BCEWithLogitsLoss


BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \left(\,
   y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
   \cdot
   \ln{(1  - \sigma(\bar{y}_n))}\,
\right)

This is basically a BCELoss with a Sigmoid layer to deal with numerical instabilities and make numbers contrained to [0, 1]

HingeEmbeddingLoss


HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    \bar{y}_n \; \; & y=1 \\
    max(0, margin - \bar{y}_n) & y = -1
\end{cases}

In order to understand this type of Loss let's reason as an actual model, thus our objective is to reduce the Loss.

By observing the Loss, we get that if we predict high values for positive classes we get a high loss.

At the same time, we observe that if we predict low values for negative classes, we get a high loss.

Now, what we'll do is:

predict low outputs for positive classes
predict high outputs for negative classes

This makes these 2 classes more distant between each other, and make points of each class closer

MarginRankingLoss


MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n}  ) + margin \,
\right)

here we have 2 predictions of items. The objective is to rank positive items with high values and vice-versa.

As before, our goal is to minimize the loss, thus we always want negative values that are larger than margin:

y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}
y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}

By having a margin we ensure that the model doesn't cheat by making \bar{y}_{1,n} = \bar{y}_{2,n}

Tip

Let's say we are trying to use this for classification, we can cheat a bit to make the model more robust by having all correct predictions on \vec{\bar{y}}_1 and on \vec{\bar{y}}_2 only the highest wrong prediction repeated n times.

TripletMarginLoss³


TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = max\left(
    0,\,
    d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)

Here we have:

\vec{a}: anchor point that represents a class
\vec{p}: positive example that is a point in the same class as \vec{a}
\vec{n}: negative example that is a point in another class with respect to \vec{a}.

Optimizing here means having similar points near to each other and dissilimal points further from each other, with the latter being the most important thing to do (as it's the only negative term in the equation)

Note

This is how reverse image search used to work in Google

SoftMarginLoss


SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
    \ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
    N
}

This loss gives only positive results, thus the optimization consist in reducing e^{-y_i \cdot \bar{y}_i}.

Since \vec{y} has only 1 or -1 as values, our strategy is to make:

y_i = -1 \rightarrow \bar{y}_i >> 0
y_i = 1 \rightarrow \bar{y}_i << 0

MultiClassHingeLoss | AKA MultiLabelMarginLoss


MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n =
    \sum_{i,j} \frac{
        max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
    }{
     \text{num\_of\_classes}
    }

    \; \;\forall i,j \text{ with } i \neq j

Essentially, it works as the HingeLoss, but with multiple target classes as CrossEntropyLoss

CosineEmbeddingLoss


CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = \begin{cases}
    1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
    max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
    & y_n = -1

\end{cases}

With this loss we make these things:

bring the angle to 0 between \bar{y}_{n,1} and \bar{y}_{n,2} when y_n = 1
bring the angle to \frac{\pi}{2} between \bar{y}_{n,1} and \bar{y}_{n,2} when y_n = -1, making them orthogonal

Remy Lau | Towards Data Science | 4th April 2025 ↩︎
Anelli | Deep Learning PDF 4 pg. 11 ↩︎
Official Paper ↩︎

11 KiB Raw Permalink Blame History