Revised Chapter 3 and added definitions to appendix
This commit is contained in:
parent
e07a80649a
commit
247daf4d56
@ -1,5 +1,67 @@
|
|||||||
# Appendix A
|
# Appendix A
|
||||||
|
|
||||||
|
## Entropy[^wiki-entropy]
|
||||||
|
|
||||||
|
The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
|
||||||
|
knowing the result.
|
||||||
|
|
||||||
|
You can visualize it like this: ***"What can I learn from getting to know something
|
||||||
|
obvious?"***
|
||||||
|
|
||||||
|
As an example, you would be unsurprised to know that if you leav an apple mid-air
|
||||||
|
it falls. However, if it where to remain suspended, that would be mind boggling!
|
||||||
|
|
||||||
|
The entropy now gives us this same sentiment analyzing the actual values,
|
||||||
|
the lower its value, the more suprising the events, and its formula is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
|
||||||
|
$$
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Technically speaking, anothet interpretation is the amount of bits needed to
|
||||||
|
> represent a random event happening, but in that case we use $\log_2$
|
||||||
|
|
||||||
|
## Kullback-Leibler Divergence
|
||||||
|
|
||||||
|
This value gives us the difference in distribution between an estimation $q$
|
||||||
|
and the real one $p$:
|
||||||
|
|
||||||
|
$$
|
||||||
|
D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
|
||||||
|
$$
|
||||||
|
|
||||||
|
## Cross Entropy Loss derivation
|
||||||
|
|
||||||
|
A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
|
||||||
|
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
||||||
|
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
H(p, q) &= H(p) + D_{KL}(p || q) =\\
|
||||||
|
&= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
|
||||||
|
\sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
|
||||||
|
&= \sum_{x\in \mathcal{X}} p(x) \left(
|
||||||
|
\log \frac{p(x)}{q(x)} - \log p(x)
|
||||||
|
\right) = \\
|
||||||
|
&= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
|
||||||
|
&= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Since we in deep learning we usually don't work with distributions, but actual
|
||||||
|
probabilities, it becomes:
|
||||||
|
|
||||||
|
$$
|
||||||
|
l_n = - \log \hat{y}_{n,c} \\
|
||||||
|
\hat{y} \coloneqq \text{probability of class}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Usually $\hat{y}$ comes from using a
|
||||||
|
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
||||||
|
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
||||||
|
|
||||||
## Laplace Operator[^khan-1]
|
## Laplace Operator[^khan-1]
|
||||||
|
|
||||||
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
||||||
@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space
|
|||||||
|
|
||||||
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
||||||
|
|
||||||
|
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
||||||
|
|
||||||
|
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
||||||
|
|||||||
@ -96,8 +96,9 @@ $$
|
|||||||
RReLU(x) =
|
RReLU(x) =
|
||||||
\begin{cases}
|
\begin{cases}
|
||||||
x \text{ if } x \geq 0 \\
|
x \text{ if } x \geq 0 \\
|
||||||
a\cdot x \text{ if } x < 0
|
\vec{a} \cdot x \text{ if } x < 0
|
||||||
\end{cases}
|
\end{cases} \\
|
||||||
|
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
|
||||||
$$
|
$$
|
||||||
|
|
||||||
It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
|
It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
|
||||||
@ -108,19 +109,22 @@ $$
|
|||||||
\frac{d\,RReLU(x)}{dx} &=
|
\frac{d\,RReLU(x)}{dx} &=
|
||||||
\begin{cases}
|
\begin{cases}
|
||||||
1 \text{ if } x \geq 0 \\
|
1 \text{ if } x \geq 0 \\
|
||||||
a_{i,j} \cdot x_{i,j} \text{ if } x < 0
|
\vec{a} \text{ if } x < 0
|
||||||
\end{cases} \\
|
\end{cases} \\
|
||||||
|
|
||||||
a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[
|
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
Here $\vec{a}$ is a **random** paramter that is
|
Here $\vec{a}$ is a **random** parameter that is
|
||||||
**always sampled** during **training** and **fixed**
|
**always sampled** during **training** and **fixed**
|
||||||
during **tests and inference** to $\frac{l + u}{2}$
|
during **tests and inference** to $\frac{l + u}{2}$
|
||||||
|
|
||||||
### ELU
|
### ELU
|
||||||
|
|
||||||
|
This function allows the system to average output to 0, thus it may
|
||||||
|
converge faster
|
||||||
|
|
||||||
$$
|
$$
|
||||||
ELU(x) =
|
ELU(x) =
|
||||||
\begin{cases}
|
\begin{cases}
|
||||||
@ -207,6 +211,9 @@ space of manouver.
|
|||||||
|
|
||||||
### Softplus
|
### Softplus
|
||||||
|
|
||||||
|
This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
|
||||||
|
values
|
||||||
|
|
||||||
$$
|
$$
|
||||||
Softplus(x) =
|
Softplus(x) =
|
||||||
\frac{1}{\beta} \cdot
|
\frac{1}{\beta} \cdot
|
||||||
@ -224,7 +231,7 @@ to **constraint the output to positive values**.
|
|||||||
The **larger $\beta$**, the **similar to [ReLU](#relu)**
|
The **larger $\beta$**, the **similar to [ReLU](#relu)**
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1}
|
\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
For **numerical-stability** when $\beta > tresh$, the
|
For **numerical-stability** when $\beta > tresh$, the
|
||||||
@ -232,12 +239,16 @@ implementation **reverts back to a linear function**
|
|||||||
|
|
||||||
### GELU[^GELU]
|
### GELU[^GELU]
|
||||||
|
|
||||||
|
This function saturates like ramps over negative values.
|
||||||
|
|
||||||
$$
|
$$
|
||||||
GELU(x) = x \cdot \Phi(x)
|
GELU(x) = x \cdot \Phi(x) \\
|
||||||
|
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
|
||||||
$$
|
$$
|
||||||
|
|
||||||
This can be considered as a **smooth [ReLU](#relu)**,
|
This can be considered as a **smooth [ReLU](#relu)**,
|
||||||
however it's **not monothonic**
|
however it's **not monothonic**
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
|
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
|
||||||
$$
|
$$
|
||||||
@ -388,7 +399,7 @@ Hardtanh(x) =
|
|||||||
$$
|
$$
|
||||||
|
|
||||||
It is not ***differentiable***, but
|
It is not ***differentiable***, but
|
||||||
**works well with values around $0$**.
|
**works well with values around $0$**, small values.
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\frac{d\,Hardtanh(x)}{dx} =
|
\frac{d\,Hardtanh(x)}{dx} =
|
||||||
|
|||||||
@ -1,3 +0,0 @@
|
|||||||
# Dealing with imbalances
|
|
||||||
|
|
||||||
<!-- TODO: pdf 4 pg. 8 -9 -->
|
|
||||||
Loading…
x
Reference in New Issue
Block a user