# Activation Functions ## Vanishing Gradient One problem of **Activation Functions** is that ***some*** of them have a ***minuscule derivative, often less than 1***. The problem is that if we have more than 1 `layer`, since the `backpropagation` is ***multiplicative***, the `gradient` tends towards $0$. Usually these **functions** are said to be **saturating**, meaning that they have **horizontal asymptotes**, where their **derivate is near 0** ## List of Non-Saturating Activation Functions ### ReLU $$ ReLU(x) = \begin{cases} x \text{ if } x \geq 0 \\ 0 \text{ if } x < 0 \end{cases} $$ It is not ***derivable***, but on $0$ we usually put the value as $0$ or $1$, though any value between them is acceptable $$ \frac{d\,ReLU(x)}{dx} = \begin{cases} 1 \text{ if } x \geq 0 \\ 0 \text{ if } x < 0 \end{cases} $$ The problem is that this function **saturates** for **negative values** ### Leaky ReLU $$ LeakyReLU(x) = \begin{cases} x \text{ if } x \geq 0 \\ a\cdot x \text{ if } x < 0 \end{cases} $$ It is not ***derivable***, but on $0$ we usually put the value as $a$ or $1$, though any value between them is acceptable $$ \frac{d\,LeakyReLU(x)}{dx} = \begin{cases} 1 \text{ if } x \geq 0 \\ a \text{ if } x < 0 \end{cases} $$ Here $a$ is a **fixed** parameter ### Parametric ReLu | AKA PReLU[^PReLU] $$ PReLU(x) = \begin{cases} x \text{ if } x \geq 0 \\ \vec{a} \cdot x \text{ if } x < 0 \end{cases} $$ It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is acceptable. $$ \frac{d\,PReLU(x) }{dx} = \begin{cases} 1 \text{ if } x \geq 0 \\ \vec{a} \text{ if } x < 0 \end{cases} $$ Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$ is a **learnable** parameter that can differ across **channels** (features) ### Randomized Leaky ReLU | AKA RReLU[^RReLU] $$ RReLU(x) = \begin{cases} x \text{ if } x \geq 0 \\ \vec{a} \cdot x \text{ if } x < 0 \end{cases} \\ a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[ $$ It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is acceptable $$ \begin{aligned} \frac{d\,RReLU(x)}{dx} &= \begin{cases} 1 \text{ if } x \geq 0 \\ \vec{a} \text{ if } x < 0 \end{cases} \\ \end{aligned} $$ Here $\vec{a}$ is a **random** parameter that is **always sampled** during **training** and **fixed** during **tests and inference** to $\frac{l + u}{2}$ ### ELU This function allows the system to average output to 0, thus it may converge faster $$ ELU(x) = \begin{cases} x \text{ if } x \geq 0 \\ a \cdot (e^{x} -1) \text{ if } x < 0 \end{cases} $$ It is **derivable** for each point, as long as $a = 1$, which is usually the case. $$ \frac{d\,ELU(x)}{dx} = \begin{cases} 1 \text{ if } x \geq 0 \\ a \cdot e^x \text{ if } x < 0 \end{cases} $$ Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares the problem with **large negative numbers** ### CELU This is a flavour of [ELU](#elu) $$ CELU(x) = \begin{cases} x \text{ if } x \geq 0 \\ a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0 \end{cases} $$ It is **derivable** for each point, as long as $a > 0$, which is usually the case. $$ \frac{d\,CELU(x)}{dx} = \begin{cases} 1 \text{ if } x \geq 0 \\ e^{\frac{x}{a}} \text{ if } x < 0 \end{cases} $$ It has the same problems as [ELU](#elu) for **saturation** ### SELU[^SELU] It aims to create a normalization for both the **average** and **standard deviation** across `layers` $$ SELU(x) = \begin{cases} \lambda \cdot x \text{ if } x \geq 0 \\ \lambda a \cdot (e^{x} -1) \text{ if } x < 0 \end{cases} $$ It is **derivable** for each point, as long as $a = 1$, though, this is not the case, usually, as it's recommended values are: - $a = 1.6733$ - $\lambda = 1.0507$ Apart from this, this is basically a **scaled [ELU](#elu)**. $$ \frac{d\,SELU(x)}{dx} = \begin{cases} \lambda \text{ if } x \geq 0 \\ \lambda a \cdot e^{x} \text{ if } x < 0 \end{cases} $$ It has the same problems as [ELU](#elu) for both **derivation and saturation**, though for the latter, since it **scales**, it has more space of manouver. ### Softplus This is a smoothed version of a [ReLU](#relu) and as such outputs only positive values $$ Softplus(x) = \frac{1}{\beta} \cdot \left( \ln{ \left(1 + e^{\beta \cdot x} \right) } \right) $$ It is **derivable** for each point and it's a **smooth approximation** of [ReLU](#relu) aimed to **constraint the output to positive values**. The **larger $\beta$**, the **similar to [ReLU](#relu)** $$ \frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1} $$ For **numerical-stability** when $\beta > tresh$, the implementation **reverts back to a linear function** ### GELU[^GELU] This function saturates like ramps over negative values. $$ GELU(x) = x \cdot \Phi(x) \\ \Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1) $$ This can be considered as a **smooth [ReLU](#relu)**, however it's **not monothonic** $$ \frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x) $$ ### Tanhshrink $$ Tanhshrink(x) = x - \tanh(x) $$ It gives **bigger values** of its **differentiation** for **values far from $0$** while being **differentiable** ### Softshrink $$ Softshrink(x) = \begin{cases} x - \lambda \text{ if } x \geq \lambda \\ 0 \text{ otherwise} \\ x + \lambda \text{ if } x \leq -\lambda \\ \end{cases} $$ $$ \frac{d\,Softshrink(x)}{dx} = \begin{cases} 1 \text{ if } x \geq \lambda \\ 0 \text{ otherwise} \\ 1 \text{ if } x \leq -\lambda \\ \end{cases} $$ Can be considered as a **step of `L1` criteria**, and as a **hard approximation of [Tanhshrink](#tanhshrink)**. It's also a step of [ISTA](https://nikopj.github.io/blog/understanding-ista/) Algorithm, but **not commonly used as an activation function**. ### Hardshrink $$ Hardshrink(x) = \begin{cases} x \lambda \text{ if } x \geq \lambda \\ 0 \text{ otherwise} \\ x \lambda \text{ if } x \leq -\lambda \\ \end{cases} $$ $$ \frac{d\,Hardshrink(x)}{dx} = \begin{cases} 1 \text{ if } x > \lambda \\ 0 \text{ otherwise} \\ 1 \text{ if } x < -\lambda \\ \end{cases} $$ This is even **harsher** than the [Softshrink](#softshrink) function, as **it's not continuous**, so it **MUST BE AVOIDED IN BACKPROPAGATION** ## List of Saturating Activation Functions ### ReLU6 $$ ReLU6(x) = \begin{cases} 6 \text{ if } x \geq 6 \\ x \text{ if } 0 \leq x < 6 \\ 0 \text{ if } x < 0 \end{cases} $$ It is not ***derivable***, but on $0$ and $6$ we usually put the value as $0$ or $1$, though any value between them is acceptable $$ \frac{d\,ReLU6(x)}{dx} = \begin{cases} 0 \text{ if } x \geq 6 \\ 1 \text{ if } 0 \leq x < 6\\ 0 \text{ if } x < 0 \end{cases} $$ ### Sigmoid | AKA Logistic $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ It is **differentiable**, and offers a large importance over **small** values while **bounding the result between $0$ and $1$**. $$ \frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x)) $$ It is usually good as a **switch** for portions of our `system` because of it's **differentiable**, however it **shouldn't be used for many layers** as it **saturates very quickly** ### Tanh $$ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} $$ $$ \frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1 $$ It is **differentiable**, and offers a large importance over **small** values while **bounding the result between $-1$ and $1$**, making the **convergence faster** as it has a **$0$ mean** ### Softsign $$ Softsign(x) = \frac{x}{1 + |x|} $$ $$ \frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1} $$ ### Hardtanh $$ Hardtanh(x) = \begin{cases} M \text{ if } x \geq 6 \\ x \text{ if } m \leq x < M \\ m \text{ if } x < m \end{cases} $$ It is not ***differentiable***, but **works well with values around $0$**, small values. $$ \frac{d\,Hardtanh(x)}{dx} = \begin{cases} 0 \text{ if } x \geq M \\ 1 \text{ if } m \leq x < M\\ 0 \text{ if } x < m \end{cases} $$ $M$ and $m$ are **usually $1$ and $-1$ respectively**, but can be changed. ### Threshold | AKA Heavyside $$ Treshold(x) = \begin{cases} 1 \text{ if } x \geq tresh \\ 0 \text{ if } x < tresh \end{cases} $$ We usually don't use this as **we can't propagate the gradient back** ### LogSigmoid $$ LogSigmoid(x) = \ln \left( \frac{ 1 }{ 1 + e^{-x} } \right) $$ $$ \frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x} $$ This was designed to **help with numerical instabilities** ### Softmin $$ Softmin(x_j) = \frac{ e^{-x_j} }{ \sum_{i} e^{-x_i} } \forall i \in \{0, ..., N\} $$ **IT IS AN OUTPUT FUNCTION** and transforms the `input` into a vector of **probabilities**, giving **higher values** to **small numbers** ### Softmax $$ Softmax(x_j) = \frac{ e^{x_j} }{ \sum_{i} e^{x_i} } \forall i \in \{0, ..., N\} $$ **IT IS AN OUTPUT FUNCTION** and transforms the `input` into a vector of **probabilities**, giving **higher values** to **high numbers** In a way, we could say that the **softmax** is just a [sigmoid](#sigmoid--aka-logistic) made for all values, instead of just one. In other words, a **softmax** is a **generalization** of a **[sigmoid](#sigmoid--aka-logistic)** function. ### LogSoftmax $$ LogSoftmax(x_j) = \ln \left( \frac{ e^{x_j} }{ \sum_{i} e^{x_i} } \right )\forall i \in \{0, ..., N\} $$ Used **mostly as a `loss-function`** but **uncommon for `activation-functions`** and it is used to **deal with numerical instabilities** and as a **component for other losses** [^PReLU]: [Microsoft Paper | arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1) [^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2) [^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5) [^GELU]: [Github Page | 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)