10 KiB
Loss Functions
MSELoss | AKA L2
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
(\bar{y}_1 - y_1)^2 \\
(\bar{y}_2 - y_2)^2 \\
... \\
(\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
Though, it can be reduced to a scalar by making
either the sum of all the values, or the mean.
L1Loss
This measures the Mean Absolute Error
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|\bar{y}_1 - y_1| \\
|\bar{y}_2 - y_2| \\
... \\
|\bar{y}_n - y_n| \\
\end{bmatrix}^T
This is more robust against outliers as their value is not squared.
However this is not differentiable towards small values, thus the existance of SmoothL1Loss
As MSELoss, it can be reduces into a scalar
SmoothL1Loss | AKA Huber Loss
Note
Called
Elastic Networkwhen used as an objective function
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
ln = \begin{cases}
\frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
&\text{ if }
|\bar{y}_n -y_n| < \beta \\
|\bar{y}_n -y_n| - 0.5 \cdot \beta
&\text{ if }
|\bar{y}_n -y_n| \geq \beta
\end{cases}
This behaves like MSELoss for values under a treshold and L1Loss otherwise.
It has the advantage of being differentiable
and is very useful for computer vision
As MSELoss, it can be reduces into a scalar
L1 vs L2 For Image Classification
Usually with L2 losses, we get a blurrier image as
opposed with L1 loss. This comes from the fact that
L2 averages all values and does not respect
distances.
Moreover, since L1 takes the difference, this is
constant over all values and does not
decrease towards $0$
NLLLoss1
This is basically the distance towards real class tags.
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \bar{y}_{n, y_n}
Even here there's the possibility to reduce the vector to a scalar:
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Technically speaking, in Pytorch you have the
possibility to exclude some classes during
training. Moreover it's possible to pass
weights for classes, useful when dealing
with unbalanced training set
Tip
So, what's
\vec{\bar{y}}?It's the
tensorcontaining the probability of apointto belong to thoseclasses.For example, let's say we have 10
pointsand 3classes, then\vec{\bar{y}}_{p,c}is theprobabilityofpointpbelonging toclasscThis is why we have
l_n = - w_n \cdot \bar{y}_{n, y_n}. In fact, we take the error over the actualclass tagof thatpoint.To get a clear idea, check this website1
Warning
Technically speaking the
inputdata should come from aLogLikelihoodlike LogSoftmax. However this is not enforced byPytorch
CrossEntropyLoss2
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \ln\left(
\frac{
e^{\bar{y}_{n, y_n}}
}{
\sum_c e^{\bar{y}_{n, y_c}}
}
\right)
Even here there's the possibility to reduce the vector to a scalar:
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Note
This is basically a good version of NLLLoss
AdaptiveLogSoftmaxWithLoss
This is an approximative method to train models with large outputs on GPUs.
Usually used when we have many classes and we have imbalances in
our training set.
BCELoss | AKA Binary Cross Entropy Loss
BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(
y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)}
\right)
This is a special case of Cross Entropy Loss with just 2 classes
Even here we can reduce with either mean or sum modifiers
KLDivLoss | AKA Kullback-Leibler Divergence Loss
KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = y_n \cdot \ln{
\frac{
y_n
}{
\bar{y}_n
}
} = y_n \cdot \left(
\ln{(y_n)} - \ln{(\bar{y}_n)}
\right)
This is just the Kullback Leibler Loss.
This is used because we are predicting the distribution \vec{y} by using \vec{\bar{y}}
Caution
This method assumes you have
probablitiesbut it does not enforce the use of Softmax or LogSoftmax, leading to numerical instabilities
BCEWithLogitsLoss
BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(\,
y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
\cdot
\ln{(1 - \sigma(\bar{y}_n))}\,
\right)
This is basically a BCELoss with a Sigmoid layer to deal with numerical instabilities
HingeEmbeddingLoss
HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
\bar{y}_n \; \; & y=1 \\
max(0, margin - \bar{y}_n) & y = -1
\end{cases}
In order to understand this type of Loss let's reason
as an actual model, thus our objective is to reduce
the Loss.
By observing the Loss, we get that if we
predict high values for positive classes we get a
high loss.
At the same time, we observe that if we predict low values
for negative classes, we get a high loss.
Now, what we'll do is:
- predict low
outputsforpositive classes - predict high
outputsfornegative classes
This makes these 2 classes more distant
between each other,
and make points of each class closer
MarginRankingLoss
MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \,
\right)
\vec{\bar{y}}_1 and \vec{\bar{y}}_2 represent vectors
of predictions of that point being class 1 or class 2
(both positive),
while \vec{y} is the vector of labels.
As before, our goal is to minimize the loss, thus we
always want negative values that are larger than
margin:
y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}
Tip
Let's say we are trying to use this for
classification, we can cheat a bit to make themodelmore robust by having all correct predictions on\vec{\bar{y}}_1and on\vec{\bar{y}}_2only the highest wrong prediction repeatedntimes.
TripletMarginLoss3
TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\,
d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)
Here we have:
\vec{a}: anchor point that represents aclass\vec{p}: positive example that is apointin the sameclassas\vec{a}\vec{n}: negative example that is apointin anotherclasswith respect to\vec{a}.
Optimizing here means having similar points near to each other and dissilimal points further from each other, with the latter being the most important thing to do (as it's the only negative term in the equation)
SoftMarginLoss
SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
\ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
N
}
This loss gives only positive results, thus the
optimization consist in reducing e^{-y_i \cdot \bar{y}_i}.
Since \vec{y} has only 1 or -1 as values, our strategy
is to make:
y_i = -1 \rightarrow \bar{y}_i >> 0y_i = 1 \rightarrow \bar{y}_i << 0
MultiClassHingeLoss | AKA MultiLabelMarginLoss
MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n =
\sum_{i,j} \frac{
max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
}{
\text{num\_of\_classes}
}
\; \;\forall i,j \text{ with } i \neq j
Essentially, it works as the HingeLoss, but with multiple classes as CrossEntropyLoss
CosineEmbeddingLoss
CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
& y_n = -1
\end{cases}
With this loss we make these things:
- bring the angle to
0between\bar{y}_{n,1}and\bar{y}_{n,2}wheny_n = 1 - bring the angle to
\pibetween\bar{y}_{n,1}and\bar{y}_{n,2}wheny_n = -1, or\frac{\pi}{2}if only positive values of\cosare allowed, making them orthogonal
-
Anelli | Deep Learning PDF 4 pg. 11 ↩︎