From 73c11ebf9d16089931d079eff23c5f5c1f1d3704 Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Tue, 15 Apr 2025 14:11:08 +0200
Subject: [PATCH] Added 3rd Chapter

---
 Chapters/3-Activation-Functions/INDEX.md | 498 +++++++++++++++++++++++
 1 file changed, 498 insertions(+)
 create mode 100644 Chapters/3-Activation-Functions/INDEX.md

diff --git a/Chapters/3-Activation-Functions/INDEX.md b/Chapters/3-Activation-Functions/INDEX.md
new file mode 100644
index 0000000..6d4404a
--- /dev/null
+++ b/Chapters/3-Activation-Functions/INDEX.md
@@ -0,0 +1,498 @@
+# Activation Functions
+
+## Vanishing Gradient
+
+One problem of **Activation Functions** is that ***some*** of them have a ***minuscule derivative, often less than 1***.
+
+The problem is that if we have more than 1 `layer`, since the `backpropagation` is ***multiplicative***, the `gradient`
+tends towards $0$.
+
+<!--TODO: Insert Sigmoid -->
+
+Usually these **functions** are said to be **saturating**, meaning that they have **horizontal asymptotes**, where
+their **derivate is near 0**
+
+## List of Non-Saturating Activation Functions
+
+<!--TODO: Insert Graphics-->
+
+### ReLU
+
+$$
+
+ReLU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    0 \text{ if } x < 0
+\end{cases}
+$$
+
+It is not ***derivable***, but on $0$ we usually put the value as $0$ or $1$, though any value between them is
+acceptable
+
+$$
+\frac{d\,ReLU(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    0 \text{ if } x < 0
+\end{cases}
+$$
+
+The problem is that this function **saturates** for
+**negative values**
+
+### Leaky ReLU
+
+$$
+LeakyReLU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    a\cdot x \text{ if } x < 0
+\end{cases}
+$$
+
+It is not ***derivable***, but on $0$ we usually put the value as $a$ or $1$, though any value between them is
+acceptable
+
+$$
+\frac{d\,LeakyReLU(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    a \text{ if } x < 0
+\end{cases}
+$$
+
+Here $a$ is a **fixed** parameter
+
+### Parametric ReLu | AKA PReLU[^PReLU]
+
+$$
+PReLU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    \vec{a} \cdot x \text{ if } x < 0
+\end{cases}
+$$
+
+It is not ***derivable***, but on $0$ we usually put the
+value as $\vec{a}$ or $1$, though any value between them
+is acceptable.
+
+$$
+\frac{d\,PReLU(x) }{dx} =
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    \vec{a} \text{ if } x < 0
+\end{cases}
+$$
+
+Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$
+is a **learnable** parameter that can differ across
+**channels** (features)
+
+### Randomized Leaky ReLU | AKA RReLU[^RReLU]
+
+$$
+RReLU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    a\cdot x \text{ if } x < 0
+\end{cases}
+$$
+
+It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
+acceptable
+
+$$
+\begin{aligned}
+    \frac{d\,RReLU(x)}{dx} &=
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    a_{i,j} \cdot x_{i,j} \text{ if } x < 0
+\end{cases} \\
+
+a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[
+\end{aligned}
+$$
+
+Here $\vec{a}$ is a **random** paramter that is
+**always sampled** during **training** and **fixed**
+during **tests and inference** to $\frac{l + u}{2}$
+
+### ELU
+
+$$
+ELU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    a \cdot (e^{x} -1) \text{ if } x < 0
+\end{cases}
+$$
+
+It is **derivable** for each point, as long as $a = 1$,
+which is usually the case.
+
+$$
+\frac{d\,ELU(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    a \cdot e^x \text{ if } x < 0
+\end{cases}
+$$
+
+Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares
+the problem with **large negative numbers**
+
+### CELU
+
+This is a flavour of [ELU](#elu)
+
+$$
+CELU(x) =
+\begin{cases}
+    x \text{ if } x \geq 0 \\
+    a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0
+\end{cases}
+$$
+
+It is **derivable** for each point, as long as $a > 0$,
+which is usually the case.
+
+$$
+\frac{d\,CELU(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x \geq 0 \\
+    e^{\frac{x}{a}} \text{ if } x < 0
+\end{cases}
+$$
+
+It has the same problems as [ELU](#elu) for
+**saturation**
+
+### SELU[^SELU]
+
+It aims to create a normalization for both the
+**average** and **standard deviation** across
+`layers`
+$$
+SELU(x) =
+\begin{cases}
+    \lambda \cdot x \text{ if } x \geq 0 \\
+    \lambda a \cdot (e^{x} -1) \text{ if } x < 0
+\end{cases}
+$$
+
+It is **derivable** for each point, as long as $a = 1$,
+though, this is not the case, usually, as it's
+recommended values are:
+
+- $a = 1.6733$
+- $\lambda = 1.0507$
+
+Apart from this, this is basically a
+**scaled [ELU](#elu)**.
+
+$$
+\frac{d\,SELU(x)}{dx} =
+\begin{cases}
+    \lambda \text{ if } x \geq 0 \\
+    \lambda a \cdot e^{x} \text{ if } x < 0
+\end{cases}
+$$
+
+It has the same problems as [ELU](#elu) for both
+**derivation and saturation**, though for the
+latter, since it **scales**, it has more
+space of manouver.
+
+### Softplus
+
+$$
+Softplus(x) =
+\frac{1}{\beta} \cdot
+\left(
+    \ln{
+        \left(1 + e^{\beta \cdot x} \right)
+    }
+\right)
+$$
+
+It is **derivable** for each point and it's a
+**smooth approximation** of [ReLU](#relu) aimed
+to **constraint the output to positive values**.
+
+The **larger $\beta$**, the **similar to [ReLU](#relu)**
+
+$$
+\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1}
+$$
+
+For **numerical-stability** when $\beta > tresh$, the
+implementation **reverts back to a linear function**
+
+### GELU[^GELU]
+
+$$
+GELU(x) = x \cdot \Phi(x)
+$$
+
+This can be considered as a **smooth [ReLU](#relu)**,
+however it's **not monothonic**
+$$
+\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
+$$
+
+### Tanhshrink
+
+$$
+Tanhshrink(x) = x - \tanh(x)
+$$
+
+It gives **bigger values** of its **differentiation**
+for **values far from $0$** while being **differentiable**
+
+### Softshrink
+
+$$
+Softshrink(x) =
+\begin{cases}
+    x - \lambda \text{ if } x \geq \lambda \\
+    0 \text{ otherwise} \\
+    x + \lambda \text{ if } x \leq -\lambda \\
+\end{cases}
+$$
+
+$$
+\frac{d\,Softshrink(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x \geq \lambda \\
+    0 \text{ otherwise} \\
+    1 \text{ if } x \leq -\lambda \\
+\end{cases}
+$$
+
+Can be considered as a **step of `L1` criteria**,
+and as a
+**hard approximation of [Tanhshrink](#tanhshrink)**.
+
+It's also a step of
+[ISTA](https://nikopj.github.io/blog/understanding-ista/)
+Algorithm, but **not commonly used as an activation
+function**.
+
+### Hardshrink
+
+$$
+Hardshrink(x) =
+\begin{cases}
+    x \lambda \text{ if } x \geq \lambda \\
+    0 \text{ otherwise} \\
+    x \lambda \text{ if } x \leq -\lambda \\
+\end{cases}
+$$
+
+$$
+\frac{d\,Hardshrink(x)}{dx} =
+\begin{cases}
+    1 \text{ if } x > \lambda \\
+    0 \text{ otherwise} \\
+    1 \text{ if } x < -\lambda \\
+\end{cases}
+$$
+
+This is even **harsher** than the
+[Softshrink](#softshrink) function, as **it's not
+continuous**, so it
+**MUST BE AVOIDED IN BACKPROPAGATION**
+
+## List of Saturating Activation Functions
+
+### ReLU6
+
+$$
+
+ReLU6(x) =
+\begin{cases}
+    6 \text{ if } x \geq 6 \\
+    x \text{ if } 0 \leq x < 6 \\
+    0 \text{ if } x < 0
+\end{cases}
+$$
+
+It is not ***derivable***, but on $0$ and $6$ we usually
+put the value as $0$ or $1$, though any value
+between them is acceptable
+
+$$
+\frac{d\,ReLU6(x)}{dx} =
+\begin{cases}
+    0 \text{ if } x \geq 6 \\
+    1 \text{ if } 0 \leq x < 6\\
+    0 \text{ if } x < 0
+\end{cases}
+$$
+
+### Sigmoid | AKA Logistic
+
+$$
+\sigma(x) = \frac{1}{1 + e^{-x}}
+$$
+
+It is **differentiable**, and offers a large importance over **small** values while **bounding the result between
+$0$ and $1$**.
+
+$$
+\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))
+$$
+
+It is usually good as a **switch** for portions of
+our `system` because of it's **differentiable**,  however
+it **shouldn't be used for many layers** as it
+**saturates very quickly**
+
+### Tanh
+
+$$
+\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
+$$
+
+$$
+\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1
+$$
+
+It is **differentiable**, and offers a large importance
+over **small** values while
+**bounding the result between $-1$ and $1$**,
+making the **convergence faster** as it has a
+**$0$ mean**
+
+### Softsign
+
+$$
+Softsign(x) = \frac{x}{1 + |x|}
+$$
+
+$$
+\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1}
+$$
+
+### Hardtanh
+
+$$
+Hardtanh(x) =
+\begin{cases}
+    M \text{ if } x \geq 6 \\
+    x \text{ if } m \leq x < M \\
+    m \text{ if } x < m
+\end{cases}
+$$
+
+It is not ***differentiable***, but
+**works well with values around $0$**.
+
+$$
+\frac{d\,Hardtanh(x)}{dx} =
+\begin{cases}
+    0 \text{ if } x \geq M \\
+    1 \text{ if } m \leq x < M\\
+    0 \text{ if } x < m
+\end{cases}
+$$
+
+$M$ and $m$ are **usually $1$ and $-1$ respectively**,
+but can be changed.
+
+### Threshold | AKA Heavyside
+
+$$
+Treshold(x) =
+\begin{cases}
+    1 \text{ if } x \geq tresh \\
+    0 \text{ if } x < tresh
+\end{cases}
+$$
+
+We usually don't use this as
+**we can't propagate the gradient back**
+
+### LogSigmoid
+
+$$
+LogSigmoid(x) = \ln \left(
+    \frac{
+        1
+    }{
+        1 + e^{-x}
+    }
+\right)
+$$
+
+$$
+\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}
+$$
+
+This was designed to
+**help with numerical instabilities**
+
+### Softmin
+
+$$
+Softmin(x_j) = \frac{
+    e^{-x_j}
+}{
+    \sum_{i} e^{-x_i}
+} \forall i \in \{0, ..., N\}
+$$
+
+**IT IS AN OUTPUT FUNCTION** and transforms the `input`
+into a vector of **probabilities**, giving
+**higher values** to **small numbers**
+
+### Softmax
+
+$$
+Softmax(x_j) = \frac{
+    e^{x_j}
+}{
+    \sum_{i} e^{x_i}
+} \forall i \in \{0, ..., N\}
+$$
+
+**IT IS AN OUTPUT FUNCTION** and transforms the `input`
+into a vector of **probabilities**, giving
+**higher values** to **high numbers**
+
+In a way, we could say that the **softmax** is
+just a [sigmoid](#sigmoid--aka-logistic) made for
+all values, instead of just one.
+
+In other words, a **softmax** is a **generalization**
+of a **[sigmoid](#sigmoid--aka-logistic)**
+function.
+
+### LogSoftmax
+
+$$
+LogSoftmax(x_j) = \ln \left(
+    \frac{
+    e^{x_j}
+}{
+    \sum_{i} e^{x_i}
+}
+\right )\forall i \in \{0, ..., N\}
+$$
+
+Used **mostly as a `loss-function`** but **uncommon
+for `activation-functions`** and it is used to
+**deal with numerical instabilities** and
+as a **component for other losses**
+
+<!-- TODO: Complete list -->
+
+[^PReLU]: [Microsoft Paper | arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1)
+
+[^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2)
+
+[^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5)
+
+[^GELU]: [Github Page | 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)