From 247daf4d56dbf2d5d504f7ed80c15ee983896cd8 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Mon, 17 Nov 2025 17:04:33 +0100 Subject: [PATCH] Revised Chapter 3 and added definitions to appendix --- Chapters/15-Appendix-A/INDEX.md | 65 +++++++++++++++++++ Chapters/3-Activation-Functions/INDEX.md | 27 +++++--- .../DEALING-WITH-IMBALANCES.md | 3 - 3 files changed, 84 insertions(+), 11 deletions(-) delete mode 100644 Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md diff --git a/Chapters/15-Appendix-A/INDEX.md b/Chapters/15-Appendix-A/INDEX.md index 11ec8f8..d488d25 100644 --- a/Chapters/15-Appendix-A/INDEX.md +++ b/Chapters/15-Appendix-A/INDEX.md @@ -1,5 +1,67 @@ # Appendix A +## Entropy[^wiki-entropy] + +The entropy of a random value gives us the *"surprise"* or *"informativeness"* of +knowing the result. + +You can visualize it like this: ***"What can I learn from getting to know something +obvious?"*** + +As an example, you would be unsurprised to know that if you leav an apple mid-air +it falls. However, if it where to remain suspended, that would be mind boggling! + +The entropy now gives us this same sentiment analyzing the actual values, +the lower its value, the more suprising the events, and its formula is: + +$$ +H(\mathcal{X}) \coloneqq - \sum_{x \in \mathcal{X}} p(x) \log p(x) +$$ + +> [!NOTE] +> Technically speaking, anothet interpretation is the amount of bits needed to +> represent a random event happening, but in that case we use $\log_2$ + +## Kullback-Leibler Divergence + +This value gives us the difference in distribution between an estimation $q$ +and the real one $p$: + +$$ +D_{KL}(p || q) = \sum_{x\in \mathcal{x}} p(x) \log \frac{p(x)}{q(x)} +$$ + +## Cross Entropy Loss derivation + +A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing +results from distribution $q$. It is defined as the entropy of $p$ plus the +[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$ + +$$ +\begin{aligned} + H(p, q) &= H(p) + D_{KL}(p || q) =\\ + &= - \sum_{x\in\mathcal{X}}p(x)\log p(x) + + \sum_{x\in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \\ + &= \sum_{x\in \mathcal{X}} p(x) \left( + \log \frac{p(x)}{q(x)} - \log p(x) + \right) = \\ + &= \sum_{x\in \mathcal{X}} p(x) \log \frac{1}{q(x)} = \\ + &= - \sum_{x\in \mathcal{X}} p(x) \log q(x) +\end{aligned} +$$ + +Since we in deep learning we usually don't work with distributions, but actual +probabilities, it becomes: + +$$ +l_n = - \log \hat{y}_{n,c} \\ +\hat{y} \coloneqq \text{probability of class} +$$ + +Usually $\hat{y}$ comes from using a +[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a +logaritm and probability values are at most 1, the closer to 0, the higher the loss + ## Laplace Operator[^khan-1] It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the @@ -20,3 +82,6 @@ It can also be used to compute the net flow of particles in that region of space [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0) +[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy) + +[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory)) diff --git a/Chapters/3-Activation-Functions/INDEX.md b/Chapters/3-Activation-Functions/INDEX.md index a646206..f6f0c06 100644 --- a/Chapters/3-Activation-Functions/INDEX.md +++ b/Chapters/3-Activation-Functions/INDEX.md @@ -96,8 +96,9 @@ $$ RReLU(x) = \begin{cases} x \text{ if } x \geq 0 \\ - a\cdot x \text{ if } x < 0 -\end{cases} + \vec{a} \cdot x \text{ if } x < 0 +\end{cases} \\ +a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[ $$ It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is @@ -108,19 +109,22 @@ $$ \frac{d\,RReLU(x)}{dx} &= \begin{cases} 1 \text{ if } x \geq 0 \\ - a_{i,j} \cdot x_{i,j} \text{ if } x < 0 + \vec{a} \text{ if } x < 0 \end{cases} \\ -a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[ + \end{aligned} $$ -Here $\vec{a}$ is a **random** paramter that is +Here $\vec{a}$ is a **random** parameter that is **always sampled** during **training** and **fixed** during **tests and inference** to $\frac{l + u}{2}$ ### ELU +This function allows the system to average output to 0, thus it may +converge faster + $$ ELU(x) = \begin{cases} @@ -207,6 +211,9 @@ space of manouver. ### Softplus +This is a smoothed version of a [ReLU](#relu) and as such outputs only positive +values + $$ Softplus(x) = \frac{1}{\beta} \cdot @@ -224,7 +231,7 @@ to **constraint the output to positive values**. The **larger $\beta$**, the **similar to [ReLU](#relu)** $$ -\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1} +\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1} $$ For **numerical-stability** when $\beta > tresh$, the @@ -232,12 +239,16 @@ implementation **reverts back to a linear function** ### GELU[^GELU] +This function saturates like ramps over negative values. + $$ -GELU(x) = x \cdot \Phi(x) +GELU(x) = x \cdot \Phi(x) \\ +\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1) $$ This can be considered as a **smooth [ReLU](#relu)**, however it's **not monothonic** + $$ \frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x) $$ @@ -388,7 +399,7 @@ Hardtanh(x) = $$ It is not ***differentiable***, but -**works well with values around $0$**. +**works well with values around $0$**, small values. $$ \frac{d\,Hardtanh(x)}{dx} = diff --git a/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md b/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md deleted file mode 100644 index c60e4aa..0000000 --- a/Chapters/4-Loss-Functions/DEALING-WITH-IMBALANCES.md +++ /dev/null @@ -1,3 +0,0 @@ -# Dealing with imbalances - - \ No newline at end of file