Revised chapter 4
This commit is contained in:
parent
247daf4d56
commit
2a96deaebf
@ -85,8 +85,15 @@ decrease towards $0$**
|
||||
|
||||
## NLLLoss[^NLLLoss]
|
||||
|
||||
> [!CAUTION]
|
||||
>
|
||||
> Technically speaking the `input` data should come
|
||||
> from a `LogLikelihood` like
|
||||
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
|
||||
> However this is not enforced by `Pytorch`
|
||||
|
||||
This is basically the ***distance*** towards
|
||||
real ***class tags***.
|
||||
real ***class tags***, optionally weighted by $\vec{w}$.
|
||||
|
||||
$$
|
||||
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||||
@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||||
l_n \\
|
||||
\end{bmatrix}^T;\\
|
||||
|
||||
l_n = - w_n \cdot \bar{y}_{n, y_n}
|
||||
l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
|
||||
$$
|
||||
|
||||
Even here there's the possibility to reduce the vector
|
||||
@ -116,7 +123,7 @@ $$
|
||||
Technically speaking, in `Pytorch` you have the
|
||||
possibility to ***exclude*** some `classes` during
|
||||
training. Moreover it's possible to pass
|
||||
`weights` for `classes`, **useful when dealing
|
||||
`weights`, $\vec{w}$, for `classes`, **useful when dealing
|
||||
with unbalanced training set**
|
||||
|
||||
> [!TIP]
|
||||
@ -132,23 +139,24 @@ with unbalanced training set**
|
||||
> `c`**
|
||||
>
|
||||
> This is why we have
|
||||
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
|
||||
> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
|
||||
> In fact, we take the error over the
|
||||
> **actual `class tag` of that `point`**.
|
||||
>
|
||||
> To get a clear idea, check this website[^NLLLoss]
|
||||
|
||||
<!-- Comment to suppress linter -->
|
||||
|
||||
> [!WARNING]
|
||||
> [!NOTE]
|
||||
> While using weights to give more importance to certain classes, or to give a
|
||||
> higher weight to less frequent classes, there's a better method.
|
||||
>
|
||||
> Technically speaking the `input` data should come
|
||||
> from a `LogLikelihood` like
|
||||
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
|
||||
> However this is not enforced by `Pytorch`
|
||||
> We can use circular buffers to sample an equal amount from all classes and then
|
||||
> fine-tune at the end by using actual classes frequencies.
|
||||
|
||||
## CrossEntropyLoss[^Anelli-CEL]
|
||||
|
||||
Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
|
||||
see its formal derivation
|
||||
|
||||
$$
|
||||
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||||
l_1 \\
|
||||
@ -182,8 +190,8 @@ $$
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> This is basically a **good version** of
|
||||
> [NLLLoss](#nllloss)
|
||||
> This is basically [NLLLoss](#nllloss) without needing a
|
||||
> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)
|
||||
|
||||
## AdaptiveLogSoftmaxWithLoss
|
||||
|
||||
@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
|
||||
\right)
|
||||
$$
|
||||
|
||||
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
|
||||
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
|
||||
2 to represent the loss, thus the *longer equation*.
|
||||
|
||||
Even here we can reduce with either `mean` or `sum` modifiers
|
||||
|
||||
@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
|
||||
\right)
|
||||
$$
|
||||
|
||||
This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
|
||||
This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.
|
||||
|
||||
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
|
||||
|
||||
@ -264,7 +273,7 @@ $$
|
||||
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
|
||||
with a
|
||||
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
|
||||
***numerical instabilities***
|
||||
***numerical instabilities and make numbers contrained to [0, 1]***
|
||||
|
||||
## HingeEmbeddingLoss
|
||||
|
||||
@ -318,10 +327,8 @@ l_n = max\left(
|
||||
\right)
|
||||
$$
|
||||
|
||||
$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
|
||||
of predictions of that `point` being `class 1` or `class 2`
|
||||
(both ***positive***),
|
||||
while $\vec{y}$ is the ***vector*** of `labels`.
|
||||
here we have 2 predictions of items. The objective is to rank positive items
|
||||
with high values and vice-versa.
|
||||
|
||||
As before, our goal is to ***minimize*** the `loss`, thus we
|
||||
always want ***negative values*** that are ***larger*** than
|
||||
@ -330,6 +337,9 @@ $margin$:
|
||||
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
|
||||
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
|
||||
|
||||
By having a margin we ensure that the model doesn't cheat by making
|
||||
$\bar{y}_{1,n} = \bar{y}_{2,n}$
|
||||
|
||||
> [!TIP]
|
||||
> Let's say we are trying to use this for `classification`,
|
||||
> we can ***cheat*** a bit to make the `model` more ***robust***
|
||||
@ -368,6 +378,9 @@ Optimizing here means
|
||||
***latter*** being the most important thing to do (as it's
|
||||
the only *negative* term in the equation)
|
||||
|
||||
> [!NOTE]
|
||||
> This is how reverse image search used to work in Google
|
||||
|
||||
## SoftMarginLoss
|
||||
|
||||
$$
|
||||
@ -412,7 +425,7 @@ l_n =
|
||||
$$
|
||||
|
||||
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
|
||||
with ***multiple classes*** as
|
||||
with ***multiple target classes*** as
|
||||
[CrossEntropyLoss](#crossentropyloss)
|
||||
|
||||
## CosineEmbeddingLoss
|
||||
@ -438,11 +451,11 @@ With this loss we make these things:
|
||||
|
||||
- bring the angle to $0$ between $\bar{y}_{n,1}$
|
||||
and $\bar{y}_{n,2}$ when $y_n = 1$
|
||||
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
|
||||
and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
|
||||
only positive values of $\cos$ are allowed, making them
|
||||
- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
|
||||
and $\bar{y}_{n,2}$ when $y_n = -1$, making them
|
||||
***orthogonal***
|
||||
|
||||
|
||||
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
|
||||
|
||||
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user