Revised Chapter 3 and added definitions to appendix

This commit is contained in:
Christian Risi 2025-11-17 17:04:33 +01:00
parent e07a80649a
commit 247daf4d56
3 changed files with 84 additions and 11 deletions

View File

@ -1,5 +1,67 @@
# Appendix A # Appendix A
## Entropy[^wiki-entropy]
The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
knowing the result.
You can visualize it like this: ***"What can I learn from getting to know something
obvious?"***
As an example, you would be unsurprised to know that if you leav an apple mid-air
it falls. However, if it where to remain suspended, that would be mind boggling!
The entropy now gives us this same sentiment analyzing the actual values,
the lower its value, the more suprising the events, and its formula is:
$$
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
$$
> [!NOTE]
> Technically speaking, anothet interpretation is the amount of bits needed to
> represent a random event happening, but in that case we use $\log_2$
## Kullback-Leibler Divergence
This value gives us the difference in distribution between an estimation $q$
and the real one $p$:
$$
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
$$
## Cross Entropy Loss derivation
A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
results from distribution $q$. It is defined as the entropy of $p$ plus the
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
$$
\begin{aligned}
H(p, q) &= H(p) + D_{KL}(p || q) =\\
&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
&= \sum_{x\in \mathcal{X}} p(x) \left(
\log \frac{p(x)}{q(x)} - \log p(x)
\right) = \\
&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
\end{aligned}
$$
Since we in deep learning we usually don't work with distributions, but actual
probabilities, it becomes:
$$
l_n = - \log \hat{y}_{n,c} \\
\hat{y} \coloneqq \text{probability of class}
$$
Usually $\hat{y}$ comes from using a
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
logaritm and probability values are at most 1, the closer to 0, the higher the loss
## Laplace Operator[^khan-1] ## Laplace Operator[^khan-1]
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0) [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))

View File

@ -96,8 +96,9 @@ $$
RReLU(x) = RReLU(x) =
\begin{cases} \begin{cases}
x \text{ if } x \geq 0 \\ x \text{ if } x \geq 0 \\
a\cdot x \text{ if } x < 0 \vec{a} \cdot x \text{ if } x < 0
\end{cases} \end{cases} \\
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
$$ $$
It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
@ -108,19 +109,22 @@ $$
\frac{d\,RReLU(x)}{dx} &= \frac{d\,RReLU(x)}{dx} &=
\begin{cases} \begin{cases}
1 \text{ if } x \geq 0 \\ 1 \text{ if } x \geq 0 \\
a_{i,j} \cdot x_{i,j} \text{ if } x < 0 \vec{a} \text{ if } x < 0
\end{cases} \\ \end{cases} \\
a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[
\end{aligned} \end{aligned}
$$ $$
Here $\vec{a}$ is a **random** paramter that is Here $\vec{a}$ is a **random** parameter that is
**always sampled** during **training** and **fixed** **always sampled** during **training** and **fixed**
during **tests and inference** to $\frac{l + u}{2}$ during **tests and inference** to $\frac{l + u}{2}$
### ELU ### ELU
This function allows the system to average output to 0, thus it may
converge faster
$$ $$
ELU(x) = ELU(x) =
\begin{cases} \begin{cases}
@ -207,6 +211,9 @@ space of manouver.
### Softplus ### Softplus
This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
values
$$ $$
Softplus(x) = Softplus(x) =
\frac{1}{\beta} \cdot \frac{1}{\beta} \cdot
@ -224,7 +231,7 @@ to **constraint the output to positive values**.
The **larger $\beta$**, the **similar to [ReLU](#relu)** The **larger $\beta$**, the **similar to [ReLU](#relu)**
$$ $$
\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1} \frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
$$ $$
For **numerical-stability** when $\beta > tresh$, the For **numerical-stability** when $\beta > tresh$, the
@ -232,12 +239,16 @@ implementation **reverts back to a linear function**
### GELU[^GELU] ### GELU[^GELU]
This function saturates like ramps over negative values.
$$ $$
GELU(x) = x \cdot \Phi(x) GELU(x) = x \cdot \Phi(x) \\
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
$$ $$
This can be considered as a **smooth [ReLU](#relu)**, This can be considered as a **smooth [ReLU](#relu)**,
however it's **not monothonic** however it's **not monothonic**
$$ $$
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x) \frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
$$ $$
@ -388,7 +399,7 @@ Hardtanh(x) =
$$ $$
It is not ***differentiable***, but It is not ***differentiable***, but
**works well with values around $0$**. **works well with values around $0$**, small values.
$$ $$
\frac{d\,Hardtanh(x)}{dx} = \frac{d\,Hardtanh(x)}{dx} =

View File

@ -1,3 +0,0 @@
# Dealing with imbalances
<!-- TODO: pdf 4 pg. 8 -9 -->