2025-04-15 14:11:19 +02:00

4.5 KiB

Loss Functions

MSELoss | AKA L2


MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    (\bar{y}_1 - y_1)^2 \\
    (\bar{y}_2 - y_2)^2 \\
    ...             \\
    (\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T

Though, it can be reduced to a scalar by making either the sum of all the values, or the mean.

L1Loss

This measures the Mean Absolute Error


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    |\bar{y}_1 - y_1| \\
    |\bar{y}_2 - y_2| \\
    ...               \\
    |\bar{y}_n - y_n| \\
\end{bmatrix}^T

This is more robust against outliers as their value is not squared.

However this is not differentiable towards small values, thus the existance of SmoothL1Loss

As MSELoss, it can be reduces into a scalar

SmoothL1Loss | AKA Huber Loss

Note

Called Elastic Network when used as an objective function


L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

ln = \begin{cases}
    \frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
    &\text{ if }
    |\bar{y}_n -y_n| < \beta \\

    |\bar{y}_n -y_n| - 0.5 \cdot \beta
    &\text{ if }
    |\bar{y}_n -y_n| \geq \beta
\end{cases}

This behaves like MSELoss for values under a treshold and L1Loss otherwise.

It has the advantage of being differentiable and is very useful for computer vision

As MSELoss, it can be reduces into a scalar

L1 vs L2 For Image Classification

Usually with L2 losses, we get a blurrier image as opposed with L1 loss. This comes from the fact that L2 averages all values and does not respect distances.

Moreover, since L1 takes the difference, this is constant over all values and does not decrease towards $0$

NLLLoss1

This is basically the distance towards real class tags.


NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \bar{y}_{n, y_n}

Even here there's the possibility to reduce the vector to a scalar:


NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Technically speaking, in Pytorch you have the possibility to exclude some classes during training. Moreover it's possible to pass weights for classes, useful when dealing with unbalanced training set

Tip

So, what's \vec{\bar{y}}?

It's the tensor containing the probability of a point to belong to those classes.

For example, let's say we have 10 points and 3 classes, then \vec{\bar{y}}_{p,c} is the probability of point p belonging to class c

This is why we have l_n = - w_n \cdot \bar{y}_{n, y_n}. In fact, we take the error over the actual class tag of that point.

To get a clear idea, check this website1

Warning

Technically speaking the input data should come from a LogLikelihood like LogSoftmax. However this is not enforced by Pytorch

CrossEntropyLoss2


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
    l_1 \\
    l_2 \\
    ... \\
    l_n \\
\end{bmatrix}^T;\\

l_n = - w_n \cdot \ln\left(
    \frac{
        e^{\bar{y}_{n, y_n}}
    }{
        \sum_c e^{\bar{y}_{n, y_c}}
    }
\right)

Even here there's the possibility to reduce the vector to a scalar:


CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
    \sum^N_{n=1} \frac{
        l_n
    }{
        \sum^N_{n=1} w_n
    } & \text{ if mode = "mean"}\\
    \sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}

Note

This is basically a good version of NLLLoss

AdaptiveLogSoftmaxWithLoss

BCELoss | AKA Binary Cross Entropy Loss

KLDivLoss | AKA Kullback-Leibler Divergence Loss

BCEWithLogitsLoss

HingeEmbeddingLoss

MarginRankingLoss

TripletMarginLoss

SoftMarginLoss

MultiLabelMarginLoss

CosineEmbeddingLoss


  1. Remy Lau | Towards Data Science | 4th April 2025 ↩︎

  2. Anelli | Deep Learning PDF 4 pg. 11 ↩︎