Deep-Learning/Chapters/3-Activation-Functions/INDEX.md

# Activation Functions

## Vanishing Gradient

One problem of **Activation Functions** is that ***some*** of them have a ***minuscule derivative, often less than 1***.

The problem is that if we have more than 1 `layer`, since the `backpropagation` is ***multiplicative***, the `gradient`
tends towards $0$.

<!--TODO: Insert Sigmoid -->

Usually these **functions** are said to be **saturating**, meaning that they have **horizontal asymptotes**, where
their **derivate is near 0**

## List of Non-Saturating Activation Functions

<!--TODO: Insert Graphs-->

### ReLU

$$

ReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    0 \text{ if } x < 0
\end{cases}
$$

It is not ***derivable***, but on $0$ we usually put the value as $0$ or $1$, though any value between them is
acceptable

$$
\frac{d\,ReLU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    0 \text{ if } x < 0
\end{cases}
$$

The problem is that this function **saturates** for
**negative values**

### Leaky ReLU

$$
LeakyReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a\cdot x \text{ if } x < 0
\end{cases}
$$

It is not ***derivable***, but on $0$ we usually put the value as $a$ or $1$, though any value between them is
acceptable

$$
\frac{d\,LeakyReLU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    a \text{ if } x < 0
\end{cases}
$$

Here $a$ is a **fixed** parameter

### Parametric ReLu | AKA PReLU[^PReLU]

$$
PReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    \vec{a} \cdot x \text{ if } x < 0
\end{cases}
$$

It is not ***derivable***, but on $0$ we usually put the
value as $\vec{a}$ or $1$, though any value between them
is acceptable.

$$
\frac{d\,PReLU(x) }{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    \vec{a} \text{ if } x < 0
\end{cases}
$$

Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$
is a **learnable** parameter that can differ across
**channels** (features)

### Randomized Leaky ReLU | AKA RReLU[^RReLU]

$$
RReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    \vec{a} \cdot x \text{ if } x < 0
\end{cases} \\
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
$$

It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
acceptable

$$
\begin{aligned}
    \frac{d\,RReLU(x)}{dx} &=
\begin{cases}
    1 \text{ if } x \geq 0 \\
    \vec{a} \text{ if } x < 0
\end{cases} \\


\end{aligned}
$$

Here $\vec{a}$ is a **random** parameter that is
**always sampled** during **training** and **fixed**
during **tests and inference** to $\frac{l + u}{2}$

### ELU

This function allows the system to average output to 0, thus it may
converge faster

$$
ELU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}
$$

It is **derivable** for each point, as long as $a = 1$,
which is usually the case.

$$
\frac{d\,ELU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    a \cdot e^x \text{ if } x < 0
\end{cases}
$$

Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares
the problem with **large negative numbers**

### CELU

This is a flavour of [ELU](#elu)

$$
CELU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0
\end{cases}
$$

It is **derivable** for each point, as long as $a > 0$,
which is usually the case.

$$
\frac{d\,CELU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    e^{\frac{x}{a}} \text{ if } x < 0
\end{cases}
$$

It has the same problems as [ELU](#elu) for
**saturation**

### SELU[^SELU]

It aims to create a normalization for both the
**average** and **standard deviation** across
`layers`
$$
SELU(x) =
\begin{cases}
    \lambda \cdot x \text{ if } x \geq 0 \\
    \lambda a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}
$$

It is **derivable** for each point, as long as $a = 1$,
though, this is not the case, usually, as it's
recommended values are:

- $a = 1.6733$
- $\lambda = 1.0507$

Apart from this, this is basically a
**scaled [ELU](#elu)**.

$$
\frac{d\,SELU(x)}{dx} =
\begin{cases}
    \lambda \text{ if } x \geq 0 \\
    \lambda a \cdot e^{x} \text{ if } x < 0
\end{cases}
$$

It has the same problems as [ELU](#elu) for both
**derivation and saturation**, though for the
latter, since it **scales**, it has more
space of manouver.

### Softplus

This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
values

$$
Softplus(x) =
\frac{1}{\beta} \cdot
\left(
    \ln{
        \left(1 + e^{\beta \cdot x} \right)
    }
\right)
$$

It is **derivable** for each point and it's a
**smooth approximation** of [ReLU](#relu) aimed
to **constraint the output to positive values**.

The **larger $\beta$**, the **similar to [ReLU](#relu)**

$$
\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
$$

For **numerical-stability** when $\beta > tresh$, the
implementation **reverts back to a linear function**

### GELU[^GELU]

This function saturates like ramps over negative values.

$$
GELU(x) = x \cdot \Phi(x) \\
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
$$

This can be considered as a **smooth [ReLU](#relu)**,
however it's **not monothonic**

$$
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
$$

### Tanhshrink

$$
Tanhshrink(x) = x - \tanh(x)
$$

It gives **bigger values** of its **differentiation**
for **values far from $0$** while being **differentiable**

### Softshrink

$$
Softshrink(x) =
\begin{cases}
    x - \lambda \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    x + \lambda \text{ if } x \leq -\lambda \\
\end{cases}
$$

$$
\frac{d\,Softshrink(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    1 \text{ if } x \leq -\lambda \\
\end{cases}
$$

Can be considered as a **step of `L1` criteria**,
and as a
**hard approximation of [Tanhshrink](#tanhshrink)**.

It's also a step of
[ISTA](https://nikopj.github.io/blog/understanding-ista/)
Algorithm, but **not commonly used as an activation
function**.

### Hardshrink

$$
Hardshrink(x) =
\begin{cases}
    x \lambda \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    x \lambda \text{ if } x \leq -\lambda \\
\end{cases}
$$

$$
\frac{d\,Hardshrink(x)}{dx} =
\begin{cases}
    1 \text{ if } x > \lambda \\
    0 \text{ otherwise} \\
    1 \text{ if } x < -\lambda \\
\end{cases}
$$

This is even **harsher** than the
[Softshrink](#softshrink) function, as **it's not
continuous**, so it
**MUST BE AVOIDED IN BACKPROPAGATION**

## List of Saturating Activation Functions

### ReLU6

$$

ReLU6(x) =
\begin{cases}
    6 \text{ if } x \geq 6 \\
    x \text{ if } 0 \leq x < 6 \\
    0 \text{ if } x < 0
\end{cases}
$$

It is not ***derivable***, but on $0$ and $6$ we usually
put the value as $0$ or $1$, though any value
between them is acceptable

$$
\frac{d\,ReLU6(x)}{dx} =
\begin{cases}
    0 \text{ if } x \geq 6 \\
    1 \text{ if } 0 \leq x < 6\\
    0 \text{ if } x < 0
\end{cases}
$$

### Sigmoid | AKA Logistic

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

It is **differentiable**, and offers a large importance over **small** values while **bounding the result between
$0$ and $1$**.

$$
\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))
$$

It is usually good as a **switch** for portions of
our `system` because of it's **differentiable**,  however
it **shouldn't be used for many layers** as it
**saturates very quickly**

### Tanh

$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

$$
\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1
$$

It is **differentiable**, and offers a large importance
over **small** values while
**bounding the result between $-1$ and $1$**,
making the **convergence faster** as it has a
**$0$ mean**

### Softsign

$$
Softsign(x) = \frac{x}{1 + |x|}
$$

$$
\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1}
$$

### Hardtanh

$$
Hardtanh(x) =
\begin{cases}
    M \text{ if } x \geq 6 \\
    x \text{ if } m \leq x < M \\
    m \text{ if } x < m
\end{cases}
$$

It is not ***differentiable***, but
**works well with values around $0$**, small values.

$$
\frac{d\,Hardtanh(x)}{dx} =
\begin{cases}
    0 \text{ if } x \geq M \\
    1 \text{ if } m \leq x < M\\
    0 \text{ if } x < m
\end{cases}
$$

$M$ and $m$ are **usually $1$ and $-1$ respectively**,
but can be changed.

### Threshold | AKA Heavyside

$$
Treshold(x) =
\begin{cases}
    1 \text{ if } x \geq tresh \\
    0 \text{ if } x < tresh
\end{cases}
$$

We usually don't use this as
**we can't propagate the gradient back**

### LogSigmoid

$$
LogSigmoid(x) = \ln \left(
    \frac{
        1
    }{
        1 + e^{-x}
    }
\right)
$$

$$
\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}
$$

This was designed to
**help with numerical instabilities**

### Softmin

$$
Softmin(x_j) = \frac{
    e^{-x_j}
}{
    \sum_{i} e^{-x_i}
} \forall i \in \{0, ..., N\}
$$

**IT IS AN OUTPUT FUNCTION** and transforms the `input`
into a vector of **probabilities**, giving
**higher values** to **small numbers**

### Softmax

$$
Softmax(x_j) = \frac{
    e^{x_j}
}{
    \sum_{i} e^{x_i}
} \forall i \in \{0, ..., N\}
$$

**IT IS AN OUTPUT FUNCTION** and transforms the `input`
into a vector of **probabilities**, giving
**higher values** to **high numbers**

In a way, we could say that the **softmax** is
just a [sigmoid](#sigmoid--aka-logistic) made for
all values, instead of just one.

In other words, a **softmax** is a **generalization**
of a **[sigmoid](#sigmoid--aka-logistic)**
function.

### LogSoftmax

$$
LogSoftmax(x_j) = \ln \left(
    \frac{
    e^{x_j}
}{
    \sum_{i} e^{x_i}
}
\right )\forall i \in \{0, ..., N\}
$$

Used **mostly as a `loss-function`** but **uncommon
for `activation-functions`** and it is used to
**deal with numerical instabilities** and
as a **component for other losses**

[^PReLU]: [Microsoft Paper | arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1)

[^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2)

[^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5)

[^GELU]: [Github Page | 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`# Activation Functions`

			`## Vanishing Gradient`

			`One problem of Activation Functions is that *some* of them have a *minuscule derivative, often less than 1*.`

			The problem is that if we have more than 1 `layer`, since the `backpropagation` is *multiplicative*, the `gradient`
			`tends towards $0$.`

			`<!--TODO: Insert Sigmoid -->`

			`Usually these functions are said to be saturating, meaning that they have horizontal asymptotes, where`
			`their derivate is near 0`

			`## List of Non-Saturating Activation Functions`

List completed 2025-04-15 17:21:29 +02:00			`<!--TODO: Insert Graphs-->`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00
			`### ReLU`

			`$$`

			`ReLU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
			`0 \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is not *derivable*, but on $0$ we usually put the value as $0$ or $1$, though any value between them is`
			`acceptable`

			`$$`
			`\frac{d\,ReLU(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
			`0 \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`The problem is that this function saturates for`
			`negative values`

			`### Leaky ReLU`

			`$$`
			`LeakyReLU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
			`a\cdot x \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is not *derivable*, but on $0$ we usually put the value as $a$ or $1$, though any value between them is`
			`acceptable`

			`$$`
			`\frac{d\,LeakyReLU(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
			`a \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`Here $a$ is a fixed parameter`

			`### Parametric ReLu \| AKA PReLU[^PReLU]`

			`$$`
			`PReLU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
			`\vec{a} \cdot x \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is not *derivable*, but on $0$ we usually put the`
			`value as $\vec{a}$ or $1$, though any value between them`
			`is acceptable.`

			`$$`
			`\frac{d\,PReLU(x) }{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
			`\vec{a} \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$`
			`is a learnable parameter that can differ across`
			`channels (features)`

			`### Randomized Leaky ReLU \| AKA RReLU[^RReLU]`

			`$$`
			`RReLU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`\vec{a} \cdot x \text{ if } x < 0`
			`\end{cases} \\`
			`a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`

			`It is not *derivable*, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is`
			`acceptable`

			`$$`
			`\begin{aligned}`
			`\frac{d\,RReLU(x)}{dx} &=`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`\vec{a} \text{ if } x < 0`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`\end{cases} \\`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`\end{aligned}`
			`$$`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`Here $\vec{a}$ is a random parameter that is`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`always sampled during training and fixed`
			`during tests and inference to $\frac{l + u}{2}$`

			`### ELU`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`This function allows the system to average output to 0, thus it may`
			`converge faster`

Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`
			`ELU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
			`a \cdot (e^{x} -1) \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is derivable for each point, as long as $a = 1$,`
			`which is usually the case.`

			`$$`
			`\frac{d\,ELU(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
			`a \cdot e^x \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares`
			`the problem with large negative numbers`

			`### CELU`

			`This is a flavour of [ELU](#elu)`

			`$$`
			`CELU(x) =`
			`\begin{cases}`
			`x \text{ if } x \geq 0 \\`
			`a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is derivable for each point, as long as $a > 0$,`
			`which is usually the case.`

			`$$`
			`\frac{d\,CELU(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq 0 \\`
			`e^{\frac{x}{a}} \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It has the same problems as [ELU](#elu) for`
			`saturation`

			`### SELU[^SELU]`

			`It aims to create a normalization for both the`
			`average and standard deviation across`
			`layers`
			`$$`
			`SELU(x) =`
			`\begin{cases}`
			`\lambda \cdot x \text{ if } x \geq 0 \\`
			`\lambda a \cdot (e^{x} -1) \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is derivable for each point, as long as $a = 1$,`
			`though, this is not the case, usually, as it's`
			`recommended values are:`

			`- $a = 1.6733$`
			`- $\lambda = 1.0507$`

			`Apart from this, this is basically a`
			`scaled [ELU](#elu).`

			`$$`
			`\frac{d\,SELU(x)}{dx} =`
			`\begin{cases}`
			`\lambda \text{ if } x \geq 0 \\`
			`\lambda a \cdot e^{x} \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It has the same problems as [ELU](#elu) for both`
			`derivation and saturation, though for the`
			`latter, since it scales, it has more`
			`space of manouver.`

			`### Softplus`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`This is a smoothed version of a [ReLU](#relu) and as such outputs only positive`
			`values`

Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`
			`Softplus(x) =`
			`\frac{1}{\beta} \cdot`
			`\left(`
			`\ln{`
			`\left(1 + e^{\beta \cdot x} \right)`
			`}`
			`\right)`
			`$$`

			`It is derivable for each point and it's a`
			`smooth approximation of [ReLU](#relu) aimed`
			`to constraint the output to positive values.`

			`The larger $\beta$, the similar to [ReLU](#relu)`

			`$$`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`\frac{d\,Softplus(x)}{dx} = \frac{e^{\betax}}{e^{\betax} + 1}`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`

			`For numerical-stability when $\beta > tresh$, the`
			`implementation reverts back to a linear function`

			`### GELU[^GELU]`

Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`This function saturates like ramps over negative values.`

Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`GELU(x) = x \cdot \Phi(x) \\`
			`\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`

			`This can be considered as a smooth [ReLU](#relu),`
			`however it's not monothonic`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00
Added 3rd Chapter 2025-04-15 14:11:08 +02:00			`$$`
			`\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)`
			`$$`

			`### Tanhshrink`

			`$$`
			`Tanhshrink(x) = x - \tanh(x)`
			`$$`

			`It gives bigger values of its differentiation`
			`for values far from $0$ while being differentiable`

			`### Softshrink`

			`$$`
			`Softshrink(x) =`
			`\begin{cases}`
			`x - \lambda \text{ if } x \geq \lambda \\`
			`0 \text{ otherwise} \\`
			`x + \lambda \text{ if } x \leq -\lambda \\`
			`\end{cases}`
			`$$`

			`$$`
			`\frac{d\,Softshrink(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x \geq \lambda \\`
			`0 \text{ otherwise} \\`
			`1 \text{ if } x \leq -\lambda \\`
			`\end{cases}`
			`$$`

			Can be considered as a step of `L1` criteria,
			`and as a`
			`hard approximation of [Tanhshrink](#tanhshrink).`

			`It's also a step of`
			`[ISTA](https://nikopj.github.io/blog/understanding-ista/)`
			`Algorithm, but **not commonly used as an activation`
			`function**.`

			`### Hardshrink`

			`$$`
			`Hardshrink(x) =`
			`\begin{cases}`
			`x \lambda \text{ if } x \geq \lambda \\`
			`0 \text{ otherwise} \\`
			`x \lambda \text{ if } x \leq -\lambda \\`
			`\end{cases}`
			`$$`

			`$$`
			`\frac{d\,Hardshrink(x)}{dx} =`
			`\begin{cases}`
			`1 \text{ if } x > \lambda \\`
			`0 \text{ otherwise} \\`
			`1 \text{ if } x < -\lambda \\`
			`\end{cases}`
			`$$`

			`This is even harsher than the`
			`[Softshrink](#softshrink) function, as **it's not`
			`continuous**, so it`
			`MUST BE AVOIDED IN BACKPROPAGATION`

			`## List of Saturating Activation Functions`

			`### ReLU6`

			`$$`

			`ReLU6(x) =`
			`\begin{cases}`
			`6 \text{ if } x \geq 6 \\`
			`x \text{ if } 0 \leq x < 6 \\`
			`0 \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`It is not *derivable*, but on $0$ and $6$ we usually`
			`put the value as $0$ or $1$, though any value`
			`between them is acceptable`

			`$$`
			`\frac{d\,ReLU6(x)}{dx} =`
			`\begin{cases}`
			`0 \text{ if } x \geq 6 \\`
			`1 \text{ if } 0 \leq x < 6\\`
			`0 \text{ if } x < 0`
			`\end{cases}`
			`$$`

			`### Sigmoid \| AKA Logistic`

			`$$`
			`\sigma(x) = \frac{1}{1 + e^{-x}}`
			`$$`

			`It is differentiable, and offers a large importance over small values while **bounding the result between`
			`$0$ and $1$**.`

			`$$`
			`\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))`
			`$$`

			`It is usually good as a switch for portions of`
			our `system` because of it's differentiable, however
			`it shouldn't be used for many layers as it`
			`saturates very quickly`

			`### Tanh`

			`$$`
			`\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}`
			`$$`

			`$$`
			`\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1`
			`$$`

			`It is differentiable, and offers a large importance`
			`over small values while`
			`bounding the result between $-1$ and $1$,`
			`making the convergence faster as it has a`
			`$0$ mean`

			`### Softsign`

			`$$`
			`Softsign(x) = \frac{x}{1 + \|x\|}`
			`$$`

			`$$`
			`\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2\|x\| + 1}`
			`$$`

			`### Hardtanh`

			`$$`
			`Hardtanh(x) =`
			`\begin{cases}`
			`M \text{ if } x \geq 6 \\`
			`x \text{ if } m \leq x < M \\`
			`m \text{ if } x < m`
			`\end{cases}`
			`$$`

			`It is not *differentiable*, but`
Revised Chapter 3 and added definitions to appendix 2025-11-17 17:04:33 +01:00			`works well with values around $0$, small values.`
Added 3rd Chapter 2025-04-15 14:11:08 +02:00
			`$$`
			`\frac{d\,Hardtanh(x)}{dx} =`
			`\begin{cases}`
			`0 \text{ if } x \geq M \\`
			`1 \text{ if } m \leq x < M\\`
			`0 \text{ if } x < m`
			`\end{cases}`
			`$$`

			`$M$ and $m$ are usually $1$ and $-1$ respectively,`
			`but can be changed.`

			`### Threshold \| AKA Heavyside`

			`$$`
			`Treshold(x) =`
			`\begin{cases}`
			`1 \text{ if } x \geq tresh \\`
			`0 \text{ if } x < tresh`
			`\end{cases}`
			`$$`

			`We usually don't use this as`
			`we can't propagate the gradient back`

			`### LogSigmoid`

			`$$`
			`LogSigmoid(x) = \ln \left(`
			`\frac{`
			`1`
			`}{`
			`1 + e^{-x}`
			`}`
			`\right)`
			`$$`

			`$$`
			`\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}`
			`$$`

			`This was designed to`
			`help with numerical instabilities`

			`### Softmin`

			`$$`
			`Softmin(x_j) = \frac{`
			`e^{-x_j}`
			`}{`
			`\sum_{i} e^{-x_i}`
			`} \forall i \in \{0, ..., N\}`
			`$$`

			IT IS AN OUTPUT FUNCTION and transforms the `input`
			`into a vector of probabilities, giving`
			`higher values to small numbers`

			`### Softmax`

			`$$`
			`Softmax(x_j) = \frac{`
			`e^{x_j}`
			`}{`
			`\sum_{i} e^{x_i}`
			`} \forall i \in \{0, ..., N\}`
			`$$`

			IT IS AN OUTPUT FUNCTION and transforms the `input`
			`into a vector of probabilities, giving`
			`higher values to high numbers`

			`In a way, we could say that the softmax is`
			`just a [sigmoid](#sigmoid--aka-logistic) made for`
			`all values, instead of just one.`

			`In other words, a softmax is a generalization`
			`of a [sigmoid](#sigmoid--aka-logistic)`
			`function.`

			`### LogSoftmax`

			`$$`
			`LogSoftmax(x_j) = \ln \left(`
			`\frac{`
			`e^{x_j}`
			`}{`
			`\sum_{i} e^{x_i}`
			`}`
			`\right )\forall i \in \{0, ..., N\}`
			`$$`

			Used mostly as a `loss-function` but **uncommon
			for `activation-functions`** and it is used to
			`deal with numerical instabilities and`
			`as a component for other losses`

			`[^PReLU]: [Microsoft Paper \| arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1)`

			`[^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2)`

			`[^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5)`

			`[^GELU]: [Github Page \| 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)`