From 247daf4d56dbf2d5d504f7ed80c15ee983896cd8 Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Mon, 17 Nov 2025 17:04:33 +0100
Subject: [PATCH] Revised Chapter 3 and added definitions to appendix

---
 Chapters/15-Appendix-A/INDEX.md               | 65 +++++++++++++++++++
 Chapters/3-Activation-Functions/INDEX.md      | 27 +++++---
 .../DEALING-WITH-IMBALANCES.md                |  3 -
 3 files changed, 84 insertions(+), 11 deletions(-)
 delete mode 100644 Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md

diff --git a/Chapters/15-Appendix-A/INDEX.md b/Chapters/15-Appendix-A/INDEX.md
index 11ec8f8..d488d25 100644
--- a/Chapters/15-Appendix-A/INDEX.md
+++ b/Chapters/15-Appendix-A/INDEX.md
@@ -1,5 +1,67 @@
 # Appendix A
 
+## Entropy[^wiki-entropy]
+
+The entropy of a random value gives us the *"surprise"* or *"informativeness"* of
+knowing the result.
+
+You can visualize it like this: ***"What can I learn from getting to know something
+obvious?"***
+
+As an example, you would be unsurprised to know that if you leav an apple mid-air
+it falls. However, if it where to remain suspended, that would be mind boggling!
+
+The entropy now gives us this same sentiment analyzing the actual values,
+the lower its value, the more suprising the events, and its formula is:
+
+$$
+H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x)
+$$
+
+> [!NOTE]
+> Technically speaking, anothet interpretation is the amount of bits needed to
+> represent a random event happening, but in that case we use $\log_2$
+
+## Kullback-Leibler Divergence
+
+This value gives us the difference in distribution between an estimation $q$
+and the real one $p$:
+
+$$
+D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)}
+$$
+
+## Cross Entropy Loss derivation
+
+A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
+results from distribution $q$. It is defined as the entropy of $p$ plus the
+[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
+
+$$
+\begin{aligned}
+    H(p, q) &= H(p) + D_{KL}(p || q) =\\
+    &= - \sum_{x\in\mathcal{X}}p(x)\log p(x) +
+        \sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\
+    &= \sum_{x\in \mathcal{X}} p(x) \left(
+            \log \frac{p(x)}{q(x)} - \log p(x)
+        \right) = \\
+    &= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\
+    &= - \sum_{x\in \mathcal{X}} p(x) \log q(x)
+\end{aligned}
+$$
+
+Since we in deep learning we usually don't work with distributions, but actual
+probabilities, it becomes:
+
+$$
+l_n = -   \log \hat{y}_{n,c} \\
+\hat{y} \coloneqq \text{probability of class}
+$$
+
+Usually $\hat{y}$ comes from using a
+[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
+logaritm and probability values are at most 1, the closer to 0, the higher the loss
+
 ## Laplace Operator[^khan-1]
 
 It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
@@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space
 
 [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
 
+[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
+
+[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
diff --git a/Chapters/3-Activation-Functions/INDEX.md b/Chapters/3-Activation-Functions/INDEX.md
index a646206..f6f0c06 100644
--- a/Chapters/3-Activation-Functions/INDEX.md
+++ b/Chapters/3-Activation-Functions/INDEX.md
@@ -96,8 +96,9 @@ $$
 RReLU(x) =
 \begin{cases}
     x \text{ if } x \geq 0 \\
-    a\cdot x \text{ if } x < 0
-\end{cases}
+    \vec{a} \cdot x \text{ if } x < 0
+\end{cases} \\
+a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
 $$
 
 It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
@@ -108,19 +109,22 @@ $$
     \frac{d\,RReLU(x)}{dx} &=
 \begin{cases}
     1 \text{ if } x \geq 0 \\
-    a_{i,j} \cdot x_{i,j} \text{ if } x < 0
+    \vec{a} \text{ if } x < 0
 \end{cases} \\
 
-a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[
+
 \end{aligned}
 $$
 
-Here $\vec{a}$ is a **random** paramter that is
+Here $\vec{a}$ is a **random** parameter that is
 **always sampled** during **training** and **fixed**
 during **tests and inference** to $\frac{l + u}{2}$
 
 ### ELU
 
+This function allows the system to average output to 0, thus it may
+converge faster
+
 $$
 ELU(x) =
 \begin{cases}
@@ -207,6 +211,9 @@ space of manouver.
 
 ### Softplus
 
+This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
+values
+
 $$
 Softplus(x) =
 \frac{1}{\beta} \cdot
@@ -224,7 +231,7 @@ to **constraint the output to positive values**.
 The **larger $\beta$**, the **similar to [ReLU](#relu)**
 
 $$
-\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1}
+\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
 $$
 
 For **numerical-stability** when $\beta > tresh$, the
@@ -232,12 +239,16 @@ implementation **reverts back to a linear function**
 
 ### GELU[^GELU]
 
+This function saturates like ramps over negative values.
+
 $$
-GELU(x) = x \cdot \Phi(x)
+GELU(x) = x \cdot \Phi(x) \\
+\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
 $$
 
 This can be considered as a **smooth [ReLU](#relu)**,
 however it's **not monothonic**
+
 $$
 \frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
 $$
@@ -388,7 +399,7 @@ Hardtanh(x) =
 $$
 
 It is not ***differentiable***, but
-**works well with values around $0$**.
+**works well with values around $0$**, small values.
 
 $$
 \frac{d\,Hardtanh(x)}{dx} =
diff --git a/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md b/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md
deleted file mode 100644
index c60e4aa..0000000
--- a/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Dealing with imbalances
-
-<!-- TODO: pdf 4 pg. 8 -9 -->
\ No newline at end of file