Revised Chapter 3 and added definitions to appendix

2025-11-17 17:04:33 +01:00
parent e07a80649a
commit 247daf4d56
3 changed files with 84 additions and 11 deletions
--- a/Chapters/15-Appendix-A/INDEX.md
+++ b/Chapters/15-Appendix-A/INDEX.md
@@ -1,5 +1,67 @@
 # Appendix A
 ## Entropy[^wiki-entropy]
 The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
 knowing the result.
 You can visualize it like this: ***"What can I learn from getting to know something
 obvious?"***
 As an example, you would be unsurprised to know that if you leav an apple mid-air
 it falls. However, if it where to remain suspended, that would be mind boggling!
 The entropy now gives us this same sentiment analyzing the actual values,
 the lower its value, the more suprising the events, and its formula is:
 $$
 H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
 $$
 > [!NOTE]
 > Technically speaking, anothet interpretation is the amount of bits needed to
 > represent a random event happening, but in that case we use $\log_2$
 ## Kullback-Leibler Divergence
 This value gives us the difference in distribution between an estimation $q$
 and the real one $p$:
 $$
 D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
 $$
 ## Cross Entropy Loss derivation
 A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
 results from distribution $q$. It is defined as the entropy of $p$ plus the
 [Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
 $$
 \begin{aligned}
    H(p, q) &= H(p) + D_{KL}(p || q) =\\
    &= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
        \sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
    &= \sum_{x\in \mathcal{X}} p(x) \left(
            \log \frac{p(x)}{q(x)} - \log p(x)
        \right) = \\
    &= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
    &= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
 \end{aligned}
 $$
 Since we in deep learning we usually don't work with distributions, but actual
 probabilities, it becomes:
 $$
 l_n = -   \log \hat{y}_{n,c} \\
 \hat{y} \coloneqq \text{probability of class}
 $$
 Usually $\hat{y}$ comes from using a
 [softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
 logaritm and probability values are at most 1, the closer to 0, the higher the loss
 ## Laplace Operator[^khan-1]
 It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
@@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space
 [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
 [^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
 [^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
--- a/Chapters/3-Activation-Functions/INDEX.md
+++ b/Chapters/3-Activation-Functions/INDEX.md
@@ -96,8 +96,9 @@ $$
 RReLU(x) =
 \begin{cases}
    x \text{ if } x \geq 0 \\
-    a\cdot x \text{ if } x < 0
+    \vec{a} \cdot x \text{ if } x < 0
-\end{cases}
+\end{cases} \\
 a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
 $$
 It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
@@ -108,19 +109,22 @@ $$
    \frac{d\,RReLU(x)}{dx} &=
 \begin{cases}
    1 \text{ if } x \geq 0 \\
-    a_{i,j} \cdot x_{i,j} \text{ if } x < 0
+    \vec{a} \text{ if } x < 0
 \end{cases} \\
-a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[
+
 \end{aligned}
 $$
-Here $\vec{a}$ is a **random** paramter that is
+Here $\vec{a}$ is a **random** parameter that is
 **always sampled** during **training** and **fixed**
 during **tests and inference** to $\frac{l + u}{2}$
 ### ELU
 This function allows the system to average output to 0, thus it may
 converge faster
 $$
 ELU(x) =
 \begin{cases}
@@ -207,6 +211,9 @@ space of manouver.
 ### Softplus
 This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
 values
 $$
 Softplus(x) =
 \frac{1}{\beta} \cdot
@@ -224,7 +231,7 @@ to **constraint the output to positive values**.
 The **larger $\beta$**, the **similar to [ReLU](#relu)**
 $$
-\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1}
+\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
 $$
 For **numerical-stability** when $\beta > tresh$, the
@@ -232,12 +239,16 @@ implementation **reverts back to a linear function**
 ### GELU[^GELU]
 This function saturates like ramps over negative values.
 $$
-GELU(x) = x \cdot \Phi(x)
+GELU(x) = x \cdot \Phi(x) \\
 \Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
 $$
 This can be considered as a **smooth [ReLU](#relu)**,
 however it's **not monothonic**
 $$
 \frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
 $$
@@ -388,7 +399,7 @@ Hardtanh(x) =
 $$
 It is not ***differentiable***, but
-**works well with values around $0$**.
+**works well with values around $0$**, small values.
 $$
 \frac{d\,Hardtanh(x)}{dx} =
--- a/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md
+++ b/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md
@@ -1,3 +0,0 @@
 # Dealing with imbalances
 <!-- TODO: pdf 4 pg. 8 -9 -->
		`@@ -1,3 +0,0 @@`
			`# Dealing with imbalances`

			`<!-- TODO: pdf 4 pg. 8 -9 -->`