11 KiB
Loss Functions
MSELoss | AKA L2
MSE(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
(\bar{y}_1 - y_1)^2 \\
(\bar{y}_2 - y_2)^2 \\
... \\
(\bar{y}_n - y_n)^2 \\
\end{bmatrix}^T
Though, it can be reduced to a scalar by making
either the sum of all the values, or the mean.
L1Loss
This measures the Mean Absolute Error
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|\bar{y}_1 - y_1| \\
|\bar{y}_2 - y_2| \\
... \\
|\bar{y}_n - y_n| \\
\end{bmatrix}^T
This is more robust against outliers as their value is not squared.
However this is not differentiable towards small values, thus the existance of SmoothL1Loss
As MSELoss, it can be reduces into a scalar
SmoothL1Loss | AKA Huber Loss
Note
Called
Elastic Networkwhen used as an objective function
L1(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
ln = \begin{cases}
\frac{0.5 \cdot (\bar{y}_n -y_n)^2}{\beta}
&\text{ if }
|\bar{y}_n -y_n| < \beta \\
|\bar{y}_n -y_n| - 0.5 \cdot \beta
&\text{ if }
|\bar{y}_n -y_n| \geq \beta
\end{cases}
This behaves like MSELoss for values under a treshold and L1Loss otherwise.
It has the advantage of being differentiable
and is very useful for computer vision
As MSELoss, it can be reduces into a scalar
L1 vs L2 For Image Classification
Usually with L2 losses, we get a blurrier image as
opposed with L1 loss. This comes from the fact that
L2 averages all values and does not respect
distances.
Moreover, since L1 takes the difference, this is
constant over all values and does not
decrease towards $0$
NLLLoss1
Caution
Technically speaking the
inputdata should come from aLogLikelihoodlike LogSoftmax. However this is not enforced byPytorch
This is basically the distance towards
real class tags, optionally weighted by \vec{w}.
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
Even here there's the possibility to reduce the vector to a scalar:
NLLLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Technically speaking, in Pytorch you have the
possibility to exclude some classes during
training. Moreover it's possible to pass
weights, \vec{w}, for classes, useful when dealing
with unbalanced training set
Tip
So, what's
\vec{\bar{y}}?It's the
tensorcontaining the probability of apointto belong to thoseclasses.For example, let's say we have 10
pointsand 3classes, then\vec{\bar{y}}_{p,c}is theprobabilityofpointpbelonging toclasscThis is why we have
l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}. In fact, we take the error over the actualclass tagof thatpoint.To get a clear idea, check this website1
Note
While using weights to give more importance to certain classes, or to give a higher weight to less frequent classes, there's a better method.
We can use circular buffers to sample an equal amount from all classes and then fine-tune at the end by using actual classes frequencies.
CrossEntropyLoss2
Check here to see its formal derivation
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \ln\left(
\frac{
e^{\bar{y}_{n, y_n}}
}{
\sum_c e^{\bar{y}_{n, y_c}}
}
\right)
Even here there's the possibility to reduce the vector to a scalar:
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}, mode) = \begin{cases}
\sum^N_{n=1} \frac{
l_n
}{
\sum^N_{n=1} w_n
} & \text{ if mode = "mean"}\\
\sum^N_{n=1} l_n & \text{ if mode = "sum"}
\end{cases}
Note
This is basically NLLLoss without needing a log softmax
AdaptiveLogSoftmaxWithLoss
This is an approximative method to train models with large outputs on GPUs.
Usually used when we have many classes and we have imbalances in
our training set.
BCELoss | AKA Binary Cross Entropy Loss
BCELoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(
y_n \ln{\bar{y}_n} + (1 - y_n) \cdot \ln{(1 - \bar{y}_n)}
\right)
This is a special case of Cross Entropy Loss with just 2 classes. Because of this we employ a trick to use a single variable instead of
2 to represent the loss, thus the longer equation.
Even here we can reduce with either mean or sum modifiers
KLDivLoss | AKA Kullback-Leibler Divergence Loss
KLDivLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = y_n \cdot \ln{
\frac{
y_n
}{
\bar{y}_n
}
} = y_n \cdot \left(
\ln{(y_n)} - \ln{(\bar{y}_n)}
\right)
This is just the Kullback Leibler Divergence.
This is used because we are predicting the distribution \vec{y} by using \vec{\bar{y}}
Caution
This method assumes you have
probablitiesbut it does not enforce the use of Softmax or LogSoftmax, leading to numerical instabilities
BCEWithLogitsLoss
BCEWithLogitsLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = - w_n \cdot \left(\,
y_n \ln{\sigma(\bar{y}_n)} + (1 - y_n)
\cdot
\ln{(1 - \sigma(\bar{y}_n))}\,
\right)
This is basically a BCELoss with a Sigmoid layer to deal with numerical instabilities and make numbers contrained to [0, 1]
HingeEmbeddingLoss
HingeEmbdeddingLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
\bar{y}_n \; \; & y=1 \\
max(0, margin - \bar{y}_n) & y = -1
\end{cases}
In order to understand this type of Loss let's reason
as an actual model, thus our objective is to reduce
the Loss.
By observing the Loss, we get that if we
predict high values for positive classes we get a
high loss.
At the same time, we observe that if we predict low values
for negative classes, we get a high loss.
Now, what we'll do is:
- predict low
outputsforpositive classes - predict high
outputsfornegative classes
This makes these 2 classes more distant
between each other,
and make points of each class closer
MarginRankingLoss
MarginRankingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\, -y_n \cdot (\bar{y}_{1,n} - \bar{y}_{2,n} ) + margin \,
\right)
here we have 2 predictions of items. The objective is to rank positive items with high values and vice-versa.
As before, our goal is to minimize the loss, thus we
always want negative values that are larger than
margin:
y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}
By having a margin we ensure that the model doesn't cheat by making
\bar{y}_{1,n} = \bar{y}_{2,n}
Tip
Let's say we are trying to use this for
classification, we can cheat a bit to make themodelmore robust by having all correct predictions on\vec{\bar{y}}_1and on\vec{\bar{y}}_2only the highest wrong prediction repeatedntimes.
TripletMarginLoss3
TripleMarginLoss(\vec{a}, \vec{p}, \vec{n}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = max\left(
0,\,
d(a_n, p_n) - d(a_n, n_n) + margin \,
\right)
Here we have:
\vec{a}: anchor point that represents aclass\vec{p}: positive example that is apointin the sameclassas\vec{a}\vec{n}: negative example that is apointin anotherclasswith respect to\vec{a}.
Optimizing here means having similar points near to each other and dissilimal points further from each other, with the latter being the most important thing to do (as it's the only negative term in the equation)
Note
This is how reverse image search used to work in Google
SoftMarginLoss
SoftMarginLoss(\vec{\bar{y}}, \vec{y}) =
\sum_n \frac{
\ln \left( 1 + e^{-y_i \cdot \bar{y}_i} \right)
}{
N
}
This loss gives only positive results, thus the
optimization consist in reducing e^{-y_i \cdot \bar{y}_i}.
Since \vec{y} has only 1 or -1 as values, our strategy
is to make:
y_i = -1 \rightarrow \bar{y}_i >> 0y_i = 1 \rightarrow \bar{y}_i << 0
MultiClassHingeLoss | AKA MultiLabelMarginLoss
MultiClassHingeLoss(\vec{\bar{y}}, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n =
\sum_{i,j} \frac{
max(0, 1 - (\bar{y}_{n,y_{n,j}} - \bar{y}_{n,i}) \,)
}{
\text{num\_of\_classes}
}
\; \;\forall i,j \text{ with } i \neq j
Essentially, it works as the HingeLoss, but with multiple target classes as CrossEntropyLoss
CosineEmbeddingLoss
CosineEmbeddingLoss(\vec{\bar{y}}_1, \vec{\bar{y}}_2, \vec{y}) =
\begin{bmatrix}
l_1 \\
l_2 \\
... \\
l_n \\
\end{bmatrix}^T;\\
l_n = \begin{cases}
1 - \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} & y_n = 1 \\
max(0, \cos{(\bar{y}_{n,1}, \bar{y}_{n,2})} - margin)
& y_n = -1
\end{cases}
With this loss we make these things:
- bring the angle to
0between\bar{y}_{n,1}and\bar{y}_{n,2}wheny_n = 1 - bring the angle to
\frac{\pi}{2}between\bar{y}_{n,1}and\bar{y}_{n,2}wheny_n = -1, making them orthogonal
-
Anelli | Deep Learning PDF 4 pg. 11 ↩︎