4.5 KiB
Loss Functions
MSELoss | AKA L2
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
(\bar{y}_1 - y_1)^2 \\
(\bar{y}_2 - y_2)^2 \\
... \\
(\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
Though, it can be reduced to a scalar by making
either the sum of all the values, or the mean.
L1Loss
This measures the Mean Absolute Error
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|\bar{y}_1 - y_1| \\
|\bar{y}_2 - y_2| \\
... \\
|\bar{y}_n - y_n| \\
\end{bmatrix}^T
This is more robust against outliers as their value is not squared.
However this is not differentiable towards small values, thus the existance of SmoothL1Loss
As MSELoss, it can be reduces into a scalar
SmoothL1Loss | AKA Huber Loss
Note
Called
Elastic Networkwhen used as an objective function
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
ln = \begin{cases}
\frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
&\text{ if }
|\bar{y}_n -y_n| < \beta \\
|\bar{y}_n -y_n| - 0.5 \cdot \beta
&\text{ if }
|\bar{y}_n -y_n| \geq \beta
\end{cases}
This behaves like MSELoss for values under a treshold and L1Loss otherwise.
It has the advantage of being differentiable
and is very useful for computer vision
As MSELoss, it can be reduces into a scalar
L1 vs L2 For Image Classification
Usually with L2 losses, we get a blurrier image as
opposed with L1 loss. This comes from the fact that
L2 averages all values and does not respect
distances.
Moreover, since L1 takes the difference, this is
constant over all values and does not
decrease towards $0$
NLLLoss1
This is basically the distance towards real class tags.
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \bar{y}_{n, y_n}
Even here there's the possibility to reduce the vector to a scalar:
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Technically speaking, in Pytorch you have the
possibility to exclude some classes during
training. Moreover it's possible to pass
weights for classes, useful when dealing
with unbalanced training set
Tip
So, what's
\vec{\bar{y}}?It's the
tensorcontaining the probability of apointto belong to thoseclasses.For example, let's say we have 10
pointsand 3classes, then\vec{\bar{y}}_{p,c}is theprobabilityofpointpbelonging toclasscThis is why we have
l_n = - w_n \cdot \bar{y}_{n, y_n}. In fact, we take the error over the actualclass tagof thatpoint.To get a clear idea, check this website1
Warning
Technically speaking the
inputdata should come from aLogLikelihoodlike LogSoftmax. However this is not enforced byPytorch
CrossEntropyLoss2
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \ln\left(
\frac{
e^{\bar{y}_{n, y_n}}
}{
\sum_c e^{\bar{y}_{n, y_c}}
}
\right)
Even here there's the possibility to reduce the vector to a scalar:
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Note
This is basically a good version of NLLLoss
AdaptiveLogSoftmaxWithLoss
BCELoss | AKA Binary Cross Entropy Loss
KLDivLoss | AKA Kullback-Leibler Divergence Loss
BCEWithLogitsLoss
HingeEmbeddingLoss
MarginRankingLoss
TripletMarginLoss
SoftMarginLoss
MultiLabelMarginLoss
CosineEmbeddingLoss
-
Anelli | Deep Learning PDF 4 pg. 11 ↩︎