From 73c11ebf9d16089931d079eff23c5f5c1f1d3704 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Tue, 15 Apr 2025 14:11:08 +0200 Subject: [PATCH] Added 3rd Chapter --- Chapters/3-Activation-Functions/INDEX.md | 498 +++++++++++++++++++++++ 1 file changed, 498 insertions(+) create mode 100644 Chapters/3-Activation-Functions/INDEX.md diff --git a/Chapters/3-Activation-Functions/INDEX.md b/Chapters/3-Activation-Functions/INDEX.md new file mode 100644 index 0000000..6d4404a --- /dev/null +++ b/Chapters/3-Activation-Functions/INDEX.md @@ -0,0 +1,498 @@ +# Activation Functions + +## Vanishing Gradient + +One problem of **Activation Functions** is that ***some*** of them have a ***minuscule derivative, often less than 1***. + +The problem is that if we have more than 1 `layer`, since the `backpropagation` is ***multiplicative***, the `gradient` +tends towards $0$. + + + +Usually these **functions** are said to be **saturating**, meaning that they have **horizontal asymptotes**, where +their **derivate is near 0** + +## List of Non-Saturating Activation Functions + + + +### ReLU + +$$ + +ReLU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + 0 \text{ if } x < 0 +\end{cases} +$$ + +It is not ***derivable***, but on $0$ we usually put the value as $0$ or $1$, though any value between them is +acceptable + +$$ +\frac{d\,ReLU(x)}{dx} = +\begin{cases} + 1 \text{ if } x \geq 0 \\ + 0 \text{ if } x < 0 +\end{cases} +$$ + +The problem is that this function **saturates** for +**negative values** + +### Leaky ReLU + +$$ +LeakyReLU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + a\cdot x \text{ if } x < 0 +\end{cases} +$$ + +It is not ***derivable***, but on $0$ we usually put the value as $a$ or $1$, though any value between them is +acceptable + +$$ +\frac{d\,LeakyReLU(x)}{dx} = +\begin{cases} + 1 \text{ if } x \geq 0 \\ + a \text{ if } x < 0 +\end{cases} +$$ + +Here $a$ is a **fixed** parameter + +### Parametric ReLu | AKA PReLU[^PReLU] + +$$ +PReLU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + \vec{a} \cdot x \text{ if } x < 0 +\end{cases} +$$ + +It is not ***derivable***, but on $0$ we usually put the +value as $\vec{a}$ or $1$, though any value between them +is acceptable. + +$$ +\frac{d\,PReLU(x) }{dx} = +\begin{cases} + 1 \text{ if } x \geq 0 \\ + \vec{a} \text{ if } x < 0 +\end{cases} +$$ + +Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$ +is a **learnable** parameter that can differ across +**channels** (features) + +### Randomized Leaky ReLU | AKA RReLU[^RReLU] + +$$ +RReLU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + a\cdot x \text{ if } x < 0 +\end{cases} +$$ + +It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is +acceptable + +$$ +\begin{aligned} + \frac{d\,RReLU(x)}{dx} &= +\begin{cases} + 1 \text{ if } x \geq 0 \\ + a_{i,j} \cdot x_{i,j} \text{ if } x < 0 +\end{cases} \\ + +a_{i,j} \sim U (l, u)&: \;l < u \wedge l, u \in [0, 1[ +\end{aligned} +$$ + +Here $\vec{a}$ is a **random** paramter that is +**always sampled** during **training** and **fixed** +during **tests and inference** to $\frac{l + u}{2}$ + +### ELU + +$$ +ELU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + a \cdot (e^{x} -1) \text{ if } x < 0 +\end{cases} +$$ + +It is **derivable** for each point, as long as $a = 1$, +which is usually the case. + +$$ +\frac{d\,ELU(x)}{dx} = +\begin{cases} + 1 \text{ if } x \geq 0 \\ + a \cdot e^x \text{ if } x < 0 +\end{cases} +$$ + +Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares +the problem with **large negative numbers** + +### CELU + +This is a flavour of [ELU](#elu) + +$$ +CELU(x) = +\begin{cases} + x \text{ if } x \geq 0 \\ + a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0 +\end{cases} +$$ + +It is **derivable** for each point, as long as $a > 0$, +which is usually the case. + +$$ +\frac{d\,CELU(x)}{dx} = +\begin{cases} + 1 \text{ if } x \geq 0 \\ + e^{\frac{x}{a}} \text{ if } x < 0 +\end{cases} +$$ + +It has the same problems as [ELU](#elu) for +**saturation** + +### SELU[^SELU] + +It aims to create a normalization for both the +**average** and **standard deviation** across +`layers` +$$ +SELU(x) = +\begin{cases} + \lambda \cdot x \text{ if } x \geq 0 \\ + \lambda a \cdot (e^{x} -1) \text{ if } x < 0 +\end{cases} +$$ + +It is **derivable** for each point, as long as $a = 1$, +though, this is not the case, usually, as it's +recommended values are: + +- $a = 1.6733$ +- $\lambda = 1.0507$ + +Apart from this, this is basically a +**scaled [ELU](#elu)**. + +$$ +\frac{d\,SELU(x)}{dx} = +\begin{cases} + \lambda \text{ if } x \geq 0 \\ + \lambda a \cdot e^{x} \text{ if } x < 0 +\end{cases} +$$ + +It has the same problems as [ELU](#elu) for both +**derivation and saturation**, though for the +latter, since it **scales**, it has more +space of manouver. + +### Softplus + +$$ +Softplus(x) = +\frac{1}{\beta} \cdot +\left( + \ln{ + \left(1 + e^{\beta \cdot x} \right) + } +\right) +$$ + +It is **derivable** for each point and it's a +**smooth approximation** of [ReLU](#relu) aimed +to **constraint the output to positive values**. + +The **larger $\beta$**, the **similar to [ReLU](#relu)** + +$$ +\frac{d\,Softplus(x)}{dx} = \frac{e^{b*x}}{e^{b*x} + 1} +$$ + +For **numerical-stability** when $\beta > tresh$, the +implementation **reverts back to a linear function** + +### GELU[^GELU] + +$$ +GELU(x) = x \cdot \Phi(x) +$$ + +This can be considered as a **smooth [ReLU](#relu)**, +however it's **not monothonic** +$$ +\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x) +$$ + +### Tanhshrink + +$$ +Tanhshrink(x) = x - \tanh(x) +$$ + +It gives **bigger values** of its **differentiation** +for **values far from $0$** while being **differentiable** + +### Softshrink + +$$ +Softshrink(x) = +\begin{cases} + x - \lambda \text{ if } x \geq \lambda \\ + 0 \text{ otherwise} \\ + x + \lambda \text{ if } x \leq -\lambda \\ +\end{cases} +$$ + +$$ +\frac{d\,Softshrink(x)}{dx} = +\begin{cases} + 1 \text{ if } x \geq \lambda \\ + 0 \text{ otherwise} \\ + 1 \text{ if } x \leq -\lambda \\ +\end{cases} +$$ + +Can be considered as a **step of `L1` criteria**, +and as a +**hard approximation of [Tanhshrink](#tanhshrink)**. + +It's also a step of +[ISTA](https://nikopj.github.io/blog/understanding-ista/) +Algorithm, but **not commonly used as an activation +function**. + +### Hardshrink + +$$ +Hardshrink(x) = +\begin{cases} + x \lambda \text{ if } x \geq \lambda \\ + 0 \text{ otherwise} \\ + x \lambda \text{ if } x \leq -\lambda \\ +\end{cases} +$$ + +$$ +\frac{d\,Hardshrink(x)}{dx} = +\begin{cases} + 1 \text{ if } x > \lambda \\ + 0 \text{ otherwise} \\ + 1 \text{ if } x < -\lambda \\ +\end{cases} +$$ + +This is even **harsher** than the +[Softshrink](#softshrink) function, as **it's not +continuous**, so it +**MUST BE AVOIDED IN BACKPROPAGATION** + +## List of Saturating Activation Functions + +### ReLU6 + +$$ + +ReLU6(x) = +\begin{cases} + 6 \text{ if } x \geq 6 \\ + x \text{ if } 0 \leq x < 6 \\ + 0 \text{ if } x < 0 +\end{cases} +$$ + +It is not ***derivable***, but on $0$ and $6$ we usually +put the value as $0$ or $1$, though any value +between them is acceptable + +$$ +\frac{d\,ReLU6(x)}{dx} = +\begin{cases} + 0 \text{ if } x \geq 6 \\ + 1 \text{ if } 0 \leq x < 6\\ + 0 \text{ if } x < 0 +\end{cases} +$$ + +### Sigmoid | AKA Logistic + +$$ +\sigma(x) = \frac{1}{1 + e^{-x}} +$$ + +It is **differentiable**, and offers a large importance over **small** values while **bounding the result between +$0$ and $1$**. + +$$ +\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x)) +$$ + +It is usually good as a **switch** for portions of +our `system` because of it's **differentiable**, however +it **shouldn't be used for many layers** as it +**saturates very quickly** + +### Tanh + +$$ +\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} +$$ + +$$ +\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1 +$$ + +It is **differentiable**, and offers a large importance +over **small** values while +**bounding the result between $-1$ and $1$**, +making the **convergence faster** as it has a +**$0$ mean** + +### Softsign + +$$ +Softsign(x) = \frac{x}{1 + |x|} +$$ + +$$ +\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1} +$$ + +### Hardtanh + +$$ +Hardtanh(x) = +\begin{cases} + M \text{ if } x \geq 6 \\ + x \text{ if } m \leq x < M \\ + m \text{ if } x < m +\end{cases} +$$ + +It is not ***differentiable***, but +**works well with values around $0$**. + +$$ +\frac{d\,Hardtanh(x)}{dx} = +\begin{cases} + 0 \text{ if } x \geq M \\ + 1 \text{ if } m \leq x < M\\ + 0 \text{ if } x < m +\end{cases} +$$ + +$M$ and $m$ are **usually $1$ and $-1$ respectively**, +but can be changed. + +### Threshold | AKA Heavyside + +$$ +Treshold(x) = +\begin{cases} + 1 \text{ if } x \geq tresh \\ + 0 \text{ if } x < tresh +\end{cases} +$$ + +We usually don't use this as +**we can't propagate the gradient back** + +### LogSigmoid + +$$ +LogSigmoid(x) = \ln \left( + \frac{ + 1 + }{ + 1 + e^{-x} + } +\right) +$$ + +$$ +\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x} +$$ + +This was designed to +**help with numerical instabilities** + +### Softmin + +$$ +Softmin(x_j) = \frac{ + e^{-x_j} +}{ + \sum_{i} e^{-x_i} +} \forall i \in \{0, ..., N\} +$$ + +**IT IS AN OUTPUT FUNCTION** and transforms the `input` +into a vector of **probabilities**, giving +**higher values** to **small numbers** + +### Softmax + +$$ +Softmax(x_j) = \frac{ + e^{x_j} +}{ + \sum_{i} e^{x_i} +} \forall i \in \{0, ..., N\} +$$ + +**IT IS AN OUTPUT FUNCTION** and transforms the `input` +into a vector of **probabilities**, giving +**higher values** to **high numbers** + +In a way, we could say that the **softmax** is +just a [sigmoid](#sigmoid--aka-logistic) made for +all values, instead of just one. + +In other words, a **softmax** is a **generalization** +of a **[sigmoid](#sigmoid--aka-logistic)** +function. + +### LogSoftmax + +$$ +LogSoftmax(x_j) = \ln \left( + \frac{ + e^{x_j} +}{ + \sum_{i} e^{x_i} +} +\right )\forall i \in \{0, ..., N\} +$$ + +Used **mostly as a `loss-function`** but **uncommon +for `activation-functions`** and it is used to +**deal with numerical instabilities** and +as a **component for other losses** + + + +[^PReLU]: [Microsoft Paper | arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1) + +[^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2) + +[^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5) + +[^GELU]: [Github Page | 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)