Revised Chapter 3 and added definitions to appendix
This commit is contained in:
@@ -1,5 +1,67 @@
|
||||
# Appendix A
|
||||
|
||||
## Entropy[^wiki-entropy]
|
||||
|
||||
The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
|
||||
knowing the result.
|
||||
|
||||
You can visualize it like this: ***"What can I learn from getting to know something
|
||||
obvious?"***
|
||||
|
||||
As an example, you would be unsurprised to know that if you leav an apple mid-air
|
||||
it falls. However, if it where to remain suspended, that would be mind boggling!
|
||||
|
||||
The entropy now gives us this same sentiment analyzing the actual values,
|
||||
the lower its value, the more suprising the events, and its formula is:
|
||||
|
||||
$$
|
||||
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
> Technically speaking, anothet interpretation is the amount of bits needed to
|
||||
> represent a random event happening, but in that case we use $\log_2$
|
||||
|
||||
## Kullback-Leibler Divergence
|
||||
|
||||
This value gives us the difference in distribution between an estimation $q$
|
||||
and the real one $p$:
|
||||
|
||||
$$
|
||||
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
|
||||
$$
|
||||
|
||||
## Cross Entropy Loss derivation
|
||||
|
||||
A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
|
||||
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
||||
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
H(p, q) &= H(p) + D_{KL}(p || q) =\\
|
||||
&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
|
||||
\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
|
||||
&= \sum_{x\in \mathcal{X}} p(x) \left(
|
||||
\log \frac{p(x)}{q(x)} - \log p(x)
|
||||
\right) = \\
|
||||
&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
|
||||
&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Since we in deep learning we usually don't work with distributions, but actual
|
||||
probabilities, it becomes:
|
||||
|
||||
$$
|
||||
l_n = - \log \hat{y}_{n,c} \\
|
||||
\hat{y} \coloneqq \text{probability of class}
|
||||
$$
|
||||
|
||||
Usually $\hat{y}$ comes from using a
|
||||
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
||||
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
||||
|
||||
## Laplace Operator[^khan-1]
|
||||
|
||||
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
||||
@@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space
|
||||
|
||||
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
||||
|
||||
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
||||
|
||||
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
||||
|
||||
Reference in New Issue
Block a user