Deep-Learning/Chapters/15-Appendix-A/INDEX.md

# Appendix A

## Entropy[^wiki-entropy]

The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
knowing the result.

You can visualize it like this: ***"What can I learn from getting to know something
obvious?"***

As an example, you would be unsurprised to know that if you leav an apple mid-air
it falls. However, if it where to remain suspended, that would be mind boggling!

The entropy now gives us this same sentiment analyzing the actual values,
the lower its value, the more suprising the events, and its formula is:

$$
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
$$

> [!NOTE]
> Technically speaking, anothet interpretation is the amount of bits needed to
> represent a random event happening, but in that case we use $\log_2$

## Kullback-Leibler Divergence

This value gives us the difference in distribution between an estimation $q$
and the real one $p$:

$$
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
$$

## Cross Entropy Loss derivation

A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
results from distribution $q$. It is defined as the entropy of $p$ plus the
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$

$$
\begin{aligned}
    H(p, q) &= H(p) + D_{KL}(p || q) =\\
    &= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
        \sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
    &= \sum_{x\in \mathcal{X}} p(x) \left(
            \log \frac{p(x)}{q(x)} - \log p(x)
        \right) = \\
    &= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
    &= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
\end{aligned}
$$

Since we in deep learning we usually don't work with distributions, but actual
probabilities, it becomes:

$$
l_n = -   \log \hat{y}_{n,c} \\
\hat{y} \coloneqq \text{probability of class}
$$

Usually $\hat{y}$ comes from using a
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
logaritm and probability values are at most 1, the closer to 0, the higher the loss

## Laplace Operator[^khan-1]

It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
**divergence of the function**. Technically speaking it gives us the
**magnitude of a local maximum or minimum**.

Positive values mean that we are around a local maximum and vice-versa. The
higher the magnitude, the higher (or lower) is the local maximum (or minimum).

Another way to see this is as the divergence of the function that tells us whether
that is a point of attraction or divergence.

It can also be used to compute the net flow of particles in that region of space

> [!CAUTION]
> This is not a **discrete laplace operator**, which is instead a **matrix** here,
> as there are many other formulations.

[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)

[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)

[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
Added notes for Laplace Operator 2025-11-09 18:54:59 +01:00			`# Appendix A`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`## Entropy[^wiki-entropy]`

			`The entropy of a random value gives us the "surprise" or "informativeness" of`
			`knowing the result.`

			`You can visualize it like this: ***"What can I learn from getting to know something`
			`obvious?"***`

			`As an example, you would be unsurprised to know that if you leav an apple mid-air`
			`it falls. However, if it where to remain suspended, that would be mind boggling!`

			`The entropy now gives us this same sentiment analyzing the actual values,`
			`the lower its value, the more suprising the events, and its formula is:`

			`$$`
			`H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)`
			`$$`

			`> [!NOTE]`
			`> Technically speaking, anothet interpretation is the amount of bits needed to`
			`> represent a random event happening, but in that case we use $\log_2$`

			`## Kullback-Leibler Divergence`

			`This value gives us the difference in distribution between an estimation $q$`
			`and the real one $p$:`

			`$$`
			`D_{KL}(p \|\| q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}`
			`$$`

			`## Cross Entropy Loss derivation`

			`A cross entropy is the measure of "surprise" we get from distribution $p$ knowing`
			`results from distribution $q$. It is defined as the entropy of $p$ plus the`
			`[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$`

			`$$`
			`\begin{aligned}`
			`H(p, q) &= H(p) + D_{KL}(p \|\| q) =\\`
			`&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +`
			`\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\`
			`&= \sum_{x\in \mathcal{X}} p(x) \left(`
			`\log \frac{p(x)}{q(x)} - \log p(x)`
			`\right) = \\`
			`&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\`
			`&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)`
			`\end{aligned}`
			`$$`

			`Since we in deep learning we usually don't work with distributions, but actual`
			`probabilities, it becomes:`

			`$$`
			`l_n = - \log \hat{y}_{n,c} \\`
			`\hat{y} \coloneqq \text{probability of class}`
			`$$`

			`Usually $\hat{y}$ comes from using a`
			`[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a`
			`logaritm and probability values are at most 1, the closer to 0, the higher the loss`

Added notes for Laplace Operator 2025-11-09 18:54:59 +01:00			`## Laplace Operator[^khan-1]`

			`It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the`
			`divergence of the function. Technically speaking it gives us the`
			`magnitude of a local maximum or minimum.`

			`Positive values mean that we are around a local maximum and vice-versa. The`
			`higher the magnitude, the higher (or lower) is the local maximum (or minimum).`

			`Another way to see this is as the divergence of the function that tells us whether`
			`that is a point of attraction or divergence.`

			`It can also be used to compute the net flow of particles in that region of space`

			`> [!CAUTION]`
			`> This is not a discrete laplace operator, which is instead a matrix here,`
			`> as there are many other formulations.`

			`[^khan-1]: [Khan Academy \| Laplace Intuition \| 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`[^wiki-cross-entropy]: [Wikipedia \| Cross Entropy \| 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)`

			`[^wiki-entropy]: [Wikipedia \| Entropy \| 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))`