Revised chapter 4

This commit is contained in:
Christian Risi 2025-11-17 17:04:46 +01:00
parent 247daf4d56
commit 2a96deaebf

View File

@ -85,8 +85,15 @@ decrease towards $0$**
## NLLLoss[^NLLLoss] ## NLLLoss[^NLLLoss]
> [!CAUTION]
>
> Technically speaking the `input` data should come
> from a `LogLikelihood` like
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
> However this is not enforced by `Pytorch`
This is basically the ***distance*** towards This is basically the ***distance*** towards
real ***class tags***. real ***class tags***, optionally weighted by $\vec{w}$.
$$ $$
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_n \\ l_n \\
\end{bmatrix}^T;\\ \end{bmatrix}^T;\\
l_n = - w_n \cdot \bar{y}_{n, y_n} l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
$$ $$
Even here there's the possibility to reduce the vector Even here there's the possibility to reduce the vector
@ -116,7 +123,7 @@ $$
Technically speaking, in `Pytorch` you have the Technically speaking, in `Pytorch` you have the
possibility to ***exclude*** some `classes` during possibility to ***exclude*** some `classes` during
training. Moreover it's possible to pass training. Moreover it's possible to pass
`weights` for `classes`, **useful when dealing `weights`, $\vec{w}$, for `classes`, **useful when dealing
with unbalanced training set** with unbalanced training set**
> [!TIP] > [!TIP]
@ -132,23 +139,24 @@ with unbalanced training set**
> `c`** > `c`**
> >
> This is why we have > This is why we have
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$. > $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
> In fact, we take the error over the > In fact, we take the error over the
> **actual `class tag` of that `point`**. > **actual `class tag` of that `point`**.
> >
> To get a clear idea, check this website[^NLLLoss] > To get a clear idea, check this website[^NLLLoss]
<!-- Comment to suppress linter --> > [!NOTE]
> While using weights to give more importance to certain classes, or to give a
> [!WARNING] > higher weight to less frequent classes, there's a better method.
> >
> Technically speaking the `input` data should come > We can use circular buffers to sample an equal amount from all classes and then
> from a `LogLikelihood` like > fine-tune at the end by using actual classes frequencies.
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
> However this is not enforced by `Pytorch`
## CrossEntropyLoss[^Anelli-CEL] ## CrossEntropyLoss[^Anelli-CEL]
Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
see its formal derivation
$$ $$
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix} CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
l_1 \\ l_1 \\
@ -182,8 +190,8 @@ $$
> [!NOTE] > [!NOTE]
> >
> This is basically a **good version** of > This is basically [NLLLoss](#nllloss) without needing a
> [NLLLoss](#nllloss) > [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)
## AdaptiveLogSoftmaxWithLoss ## AdaptiveLogSoftmaxWithLoss
@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
\right) \right)
$$ $$
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes` This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
2 to represent the loss, thus the *longer equation*.
Even here we can reduce with either `mean` or `sum` modifiers Even here we can reduce with either `mean` or `sum` modifiers
@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
\right) \right)
$$ $$
This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/KullbackLeibler_divergence)***. This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$ This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
@ -264,7 +273,7 @@ $$
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss) This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
with a with a
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with [Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
***numerical instabilities*** ***numerical instabilities and make numbers contrained to [0, 1]***
## HingeEmbeddingLoss ## HingeEmbeddingLoss
@ -318,10 +327,8 @@ l_n = max\left(
\right) \right)
$$ $$
$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors*** here we have 2 predictions of items. The objective is to rank positive items
of predictions of that `point` being `class 1` or `class 2` with high values and vice-versa.
(both ***positive***),
while $\vec{y}$ is the ***vector*** of `labels`.
As before, our goal is to ***minimize*** the `loss`, thus we As before, our goal is to ***minimize*** the `loss`, thus we
always want ***negative values*** that are ***larger*** than always want ***negative values*** that are ***larger*** than
@ -330,6 +337,9 @@ $margin$:
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$ - $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$ - $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
By having a margin we ensure that the model doesn't cheat by making
$\bar{y}_{1,n} = \bar{y}_{2,n}$
> [!TIP] > [!TIP]
> Let's say we are trying to use this for `classification`, > Let's say we are trying to use this for `classification`,
> we can ***cheat*** a bit to make the `model` more ***robust*** > we can ***cheat*** a bit to make the `model` more ***robust***
@ -368,6 +378,9 @@ Optimizing here means
***latter*** being the most important thing to do (as it's ***latter*** being the most important thing to do (as it's
the only *negative* term in the equation) the only *negative* term in the equation)
> [!NOTE]
> This is how reverse image search used to work in Google
## SoftMarginLoss ## SoftMarginLoss
$$ $$
@ -412,7 +425,7 @@ l_n =
$$ $$
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
with ***multiple classes*** as with ***multiple target classes*** as
[CrossEntropyLoss](#crossentropyloss) [CrossEntropyLoss](#crossentropyloss)
## CosineEmbeddingLoss ## CosineEmbeddingLoss
@ -438,11 +451,11 @@ With this loss we make these things:
- bring the angle to $0$ between $\bar{y}_{n,1}$ - bring the angle to $0$ between $\bar{y}_{n,1}$
and $\bar{y}_{n,2}$ when $y_n = 1$ and $\bar{y}_{n,2}$ when $y_n = 1$
- bring the angle to $\pi$ between $\bar{y}_{n,1}$ - bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if and $\bar{y}_{n,2}$ when $y_n = -1$, making them
only positive values of $\cos$ are allowed, making them
***orthogonal*** ***orthogonal***
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/) [^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11 [^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11