Revised chapter 4
This commit is contained in:
parent
247daf4d56
commit
2a96deaebf
@ -85,8 +85,15 @@ decrease towards $0$**
|
|||||||
|
|
||||||
## NLLLoss[^NLLLoss]
|
## NLLLoss[^NLLLoss]
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
>
|
||||||
|
> Technically speaking the `input` data should come
|
||||||
|
> from a `LogLikelihood` like
|
||||||
|
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
|
||||||
|
> However this is not enforced by `Pytorch`
|
||||||
|
|
||||||
This is basically the ***distance*** towards
|
This is basically the ***distance*** towards
|
||||||
real ***class tags***.
|
real ***class tags***, optionally weighted by $\vec{w}$.
|
||||||
|
|
||||||
$$
|
$$
|
||||||
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||||||
@ -96,7 +103,7 @@ NLLLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
|||||||
l_n \\
|
l_n \\
|
||||||
\end{bmatrix}^T;\\
|
\end{bmatrix}^T;\\
|
||||||
|
|
||||||
l_n = - w_n \cdot \bar{y}_{n, y_n}
|
l_n = - w_{y_n} \cdot \bar{y}_{n, y_n}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Even here there's the possibility to reduce the vector
|
Even here there's the possibility to reduce the vector
|
||||||
@ -116,7 +123,7 @@ $$
|
|||||||
Technically speaking, in `Pytorch` you have the
|
Technically speaking, in `Pytorch` you have the
|
||||||
possibility to ***exclude*** some `classes` during
|
possibility to ***exclude*** some `classes` during
|
||||||
training. Moreover it's possible to pass
|
training. Moreover it's possible to pass
|
||||||
`weights` for `classes`, **useful when dealing
|
`weights`, $\vec{w}$, for `classes`, **useful when dealing
|
||||||
with unbalanced training set**
|
with unbalanced training set**
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
@ -132,23 +139,24 @@ with unbalanced training set**
|
|||||||
> `c`**
|
> `c`**
|
||||||
>
|
>
|
||||||
> This is why we have
|
> This is why we have
|
||||||
> $l_n = - w_n \cdot \bar{y}_{n, y_n}$.
|
> $l_n = - w_{y_n}\cdot \bar{y}_{n, y_n}$.
|
||||||
> In fact, we take the error over the
|
> In fact, we take the error over the
|
||||||
> **actual `class tag` of that `point`**.
|
> **actual `class tag` of that `point`**.
|
||||||
>
|
>
|
||||||
> To get a clear idea, check this website[^NLLLoss]
|
> To get a clear idea, check this website[^NLLLoss]
|
||||||
|
|
||||||
<!-- Comment to suppress linter -->
|
> [!NOTE]
|
||||||
|
> While using weights to give more importance to certain classes, or to give a
|
||||||
> [!WARNING]
|
> higher weight to less frequent classes, there's a better method.
|
||||||
>
|
>
|
||||||
> Technically speaking the `input` data should come
|
> We can use circular buffers to sample an equal amount from all classes and then
|
||||||
> from a `LogLikelihood` like
|
> fine-tune at the end by using actual classes frequencies.
|
||||||
> [LogSoftmax](./../3-Activation-Functions/INDEX.md#logsoftmax).
|
|
||||||
> However this is not enforced by `Pytorch`
|
|
||||||
|
|
||||||
## CrossEntropyLoss[^Anelli-CEL]
|
## CrossEntropyLoss[^Anelli-CEL]
|
||||||
|
|
||||||
|
Check [here](./../15-Appendix-A/INDEX.md#cross-entropy-loss-derivation) to
|
||||||
|
see its formal derivation
|
||||||
|
|
||||||
$$
|
$$
|
||||||
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
CrossEntropyLoss(\vec{\bar{y}}, \vec{y}) = \begin{bmatrix}
|
||||||
l_1 \\
|
l_1 \\
|
||||||
@ -182,8 +190,8 @@ $$
|
|||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
>
|
>
|
||||||
> This is basically a **good version** of
|
> This is basically [NLLLoss](#nllloss) without needing a
|
||||||
> [NLLLoss](#nllloss)
|
> [log softmax](./../3-Activation-Functions/INDEX.md#logsoftmax)
|
||||||
|
|
||||||
## AdaptiveLogSoftmaxWithLoss
|
## AdaptiveLogSoftmaxWithLoss
|
||||||
|
|
||||||
@ -208,7 +216,8 @@ l_n = - w_n \cdot \left(
|
|||||||
\right)
|
\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`
|
This is a ***special case of [Cross Entropy Loss](#crossentropyloss)*** with just 2 `classes`. Because of this we employ a trick to use a single variable instead of
|
||||||
|
2 to represent the loss, thus the *longer equation*.
|
||||||
|
|
||||||
Even here we can reduce with either `mean` or `sum` modifiers
|
Even here we can reduce with either `mean` or `sum` modifiers
|
||||||
|
|
||||||
@ -233,7 +242,7 @@ l_n = y_n \cdot \ln{
|
|||||||
\right)
|
\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This is just the ***[Kullback Leibler Loss](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence)***.
|
This is just the ***[Kullback Leibler Divergence](./../15-Appendix-A/INDEX.md#kullback-leibler-divergence)***.
|
||||||
|
|
||||||
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
|
This is used because we are predicting the ***distribution*** $\vec{y}$ by using $\vec{\bar{y}}$
|
||||||
|
|
||||||
@ -264,7 +273,7 @@ $$
|
|||||||
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
|
This is basically a [BCELoss](#bceloss--aka-binary-cross-entropy-loss)
|
||||||
with a
|
with a
|
||||||
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
|
[Sigmoid](./../3-Activation-Functions/INDEX.md#sigmoid--aka-logistic) layer to deal with
|
||||||
***numerical instabilities***
|
***numerical instabilities and make numbers contrained to [0, 1]***
|
||||||
|
|
||||||
## HingeEmbeddingLoss
|
## HingeEmbeddingLoss
|
||||||
|
|
||||||
@ -318,10 +327,8 @@ l_n = max\left(
|
|||||||
\right)
|
\right)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$\vec{\bar{y}}_1$ and $\vec{\bar{y}}_2$ represent ***vectors***
|
here we have 2 predictions of items. The objective is to rank positive items
|
||||||
of predictions of that `point` being `class 1` or `class 2`
|
with high values and vice-versa.
|
||||||
(both ***positive***),
|
|
||||||
while $\vec{y}$ is the ***vector*** of `labels`.
|
|
||||||
|
|
||||||
As before, our goal is to ***minimize*** the `loss`, thus we
|
As before, our goal is to ***minimize*** the `loss`, thus we
|
||||||
always want ***negative values*** that are ***larger*** than
|
always want ***negative values*** that are ***larger*** than
|
||||||
@ -330,6 +337,9 @@ $margin$:
|
|||||||
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
|
- $y_n = 1 \rightarrow \bar{y}_{1,n} >> \bar{y}_{2,n}$
|
||||||
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
|
- $y_n = -1 \rightarrow \bar{y}_{2,n} >> \bar{y}_{1,n}$
|
||||||
|
|
||||||
|
By having a margin we ensure that the model doesn't cheat by making
|
||||||
|
$\bar{y}_{1,n} = \bar{y}_{2,n}$
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
> Let's say we are trying to use this for `classification`,
|
> Let's say we are trying to use this for `classification`,
|
||||||
> we can ***cheat*** a bit to make the `model` more ***robust***
|
> we can ***cheat*** a bit to make the `model` more ***robust***
|
||||||
@ -368,6 +378,9 @@ Optimizing here means
|
|||||||
***latter*** being the most important thing to do (as it's
|
***latter*** being the most important thing to do (as it's
|
||||||
the only *negative* term in the equation)
|
the only *negative* term in the equation)
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> This is how reverse image search used to work in Google
|
||||||
|
|
||||||
## SoftMarginLoss
|
## SoftMarginLoss
|
||||||
|
|
||||||
$$
|
$$
|
||||||
@ -412,7 +425,7 @@ l_n =
|
|||||||
$$
|
$$
|
||||||
|
|
||||||
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
|
Essentially, it works as the [HingeLoss](#hingeembeddingloss), but
|
||||||
with ***multiple classes*** as
|
with ***multiple target classes*** as
|
||||||
[CrossEntropyLoss](#crossentropyloss)
|
[CrossEntropyLoss](#crossentropyloss)
|
||||||
|
|
||||||
## CosineEmbeddingLoss
|
## CosineEmbeddingLoss
|
||||||
@ -438,11 +451,11 @@ With this loss we make these things:
|
|||||||
|
|
||||||
- bring the angle to $0$ between $\bar{y}_{n,1}$
|
- bring the angle to $0$ between $\bar{y}_{n,1}$
|
||||||
and $\bar{y}_{n,2}$ when $y_n = 1$
|
and $\bar{y}_{n,2}$ when $y_n = 1$
|
||||||
- bring the angle to $\pi$ between $\bar{y}_{n,1}$
|
- bring the angle to $\frac{\pi}{2}$ between $\bar{y}_{n,1}$
|
||||||
and $\bar{y}_{n,2}$ when $y_n = -1$, or $\frac{\pi}{2}$ if
|
and $\bar{y}_{n,2}$ when $y_n = -1$, making them
|
||||||
only positive values of $\cos$ are allowed, making them
|
|
||||||
***orthogonal***
|
***orthogonal***
|
||||||
|
|
||||||
|
|
||||||
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
|
[^NLLLoss]: [Remy Lau | Towards Data Science | 4th April 2025](https://towardsdatascience.com/cross-entropy-negative-log-likelihood-and-all-that-jazz-47a95bd2e81/)
|
||||||
|
|
||||||
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
|
[^Anelli-CEL]: Anelli | Deep Learning PDF 4 pg. 11
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user