2025-04-15 14:11:08 +02:00
|
|
|
# Activation Functions
|
|
|
|
|
|
|
|
|
|
## Vanishing Gradient
|
|
|
|
|
|
|
|
|
|
One problem of **Activation Functions** is that ***some*** of them have a ***minuscule derivative, often less than 1***.
|
|
|
|
|
|
|
|
|
|
The problem is that if we have more than 1 `layer`, since the `backpropagation` is ***multiplicative***, the `gradient`
|
|
|
|
|
tends towards $0$.
|
|
|
|
|
|
|
|
|
|
<!--TODO: Insert Sigmoid -->
|
|
|
|
|
|
|
|
|
|
Usually these **functions** are said to be **saturating**, meaning that they have **horizontal asymptotes**, where
|
|
|
|
|
their **derivate is near 0**
|
|
|
|
|
|
|
|
|
|
## List of Non-Saturating Activation Functions
|
|
|
|
|
|
2025-04-15 17:21:29 +02:00
|
|
|
<!--TODO: Insert Graphs-->
|
2025-04-15 14:11:08 +02:00
|
|
|
|
|
|
|
|
### ReLU
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
ReLU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
|
|
|
|
0 \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***derivable***, but on $0$ we usually put the value as $0$ or $1$, though any value between them is
|
|
|
|
|
acceptable
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,ReLU(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
|
|
|
|
0 \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
The problem is that this function **saturates** for
|
|
|
|
|
**negative values**
|
|
|
|
|
|
|
|
|
|
### Leaky ReLU
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
LeakyReLU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
|
|
|
|
a\cdot x \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***derivable***, but on $0$ we usually put the value as $a$ or $1$, though any value between them is
|
|
|
|
|
acceptable
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,LeakyReLU(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
|
|
|
|
a \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Here $a$ is a **fixed** parameter
|
|
|
|
|
|
|
|
|
|
### Parametric ReLu | AKA PReLU[^PReLU]
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
PReLU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
|
|
|
|
\vec{a} \cdot x \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***derivable***, but on $0$ we usually put the
|
|
|
|
|
value as $\vec{a}$ or $1$, though any value between them
|
|
|
|
|
is acceptable.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,PReLU(x) }{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
|
|
|
|
\vec{a} \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Differently from [LeakyReLU](#leaky-relu), here $\vec{a}$
|
|
|
|
|
is a **learnable** parameter that can differ across
|
|
|
|
|
**channels** (features)
|
|
|
|
|
|
|
|
|
|
### Randomized Leaky ReLU | AKA RReLU[^RReLU]
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
RReLU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
2025-11-17 17:04:33 +01:00
|
|
|
\vec{a} \cdot x \text{ if } x < 0
|
|
|
|
|
\end{cases} \\
|
|
|
|
|
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***derivable***, but on $0$ we usually put the value as $\vec{a}$ or $1$, though any value between them is
|
|
|
|
|
acceptable
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
\frac{d\,RReLU(x)}{dx} &=
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
2025-11-17 17:04:33 +01:00
|
|
|
\vec{a} \text{ if } x < 0
|
2025-04-15 14:11:08 +02:00
|
|
|
\end{cases} \\
|
|
|
|
|
|
2025-11-17 17:04:33 +01:00
|
|
|
|
2025-04-15 14:11:08 +02:00
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-17 17:04:33 +01:00
|
|
|
Here $\vec{a}$ is a **random** parameter that is
|
2025-04-15 14:11:08 +02:00
|
|
|
**always sampled** during **training** and **fixed**
|
|
|
|
|
during **tests and inference** to $\frac{l + u}{2}$
|
|
|
|
|
|
|
|
|
|
### ELU
|
|
|
|
|
|
2025-11-17 17:04:33 +01:00
|
|
|
This function allows the system to average output to 0, thus it may
|
|
|
|
|
converge faster
|
|
|
|
|
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
ELU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
|
|
|
|
a \cdot (e^{x} -1) \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **derivable** for each point, as long as $a = 1$,
|
|
|
|
|
which is usually the case.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,ELU(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
|
|
|
|
a \cdot e^x \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Like [ReLU](#relu) and all [Saturating Functions](#list-of-saturating-activation-functions), it shares
|
|
|
|
|
the problem with **large negative numbers**
|
|
|
|
|
|
|
|
|
|
### CELU
|
|
|
|
|
|
|
|
|
|
This is a flavour of [ELU](#elu)
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
CELU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \text{ if } x \geq 0 \\
|
|
|
|
|
a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **derivable** for each point, as long as $a > 0$,
|
|
|
|
|
which is usually the case.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,CELU(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq 0 \\
|
|
|
|
|
e^{\frac{x}{a}} \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It has the same problems as [ELU](#elu) for
|
|
|
|
|
**saturation**
|
|
|
|
|
|
|
|
|
|
### SELU[^SELU]
|
|
|
|
|
|
|
|
|
|
It aims to create a normalization for both the
|
|
|
|
|
**average** and **standard deviation** across
|
|
|
|
|
`layers`
|
|
|
|
|
$$
|
|
|
|
|
SELU(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\lambda \cdot x \text{ if } x \geq 0 \\
|
|
|
|
|
\lambda a \cdot (e^{x} -1) \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **derivable** for each point, as long as $a = 1$,
|
|
|
|
|
though, this is not the case, usually, as it's
|
|
|
|
|
recommended values are:
|
|
|
|
|
|
|
|
|
|
- $a = 1.6733$
|
|
|
|
|
- $\lambda = 1.0507$
|
|
|
|
|
|
|
|
|
|
Apart from this, this is basically a
|
|
|
|
|
**scaled [ELU](#elu)**.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,SELU(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
\lambda \text{ if } x \geq 0 \\
|
|
|
|
|
\lambda a \cdot e^{x} \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It has the same problems as [ELU](#elu) for both
|
|
|
|
|
**derivation and saturation**, though for the
|
|
|
|
|
latter, since it **scales**, it has more
|
|
|
|
|
space of manouver.
|
|
|
|
|
|
|
|
|
|
### Softplus
|
|
|
|
|
|
2025-11-17 17:04:33 +01:00
|
|
|
This is a smoothed version of a [ReLU](#relu) and as such outputs only positive
|
|
|
|
|
values
|
|
|
|
|
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
Softplus(x) =
|
|
|
|
|
\frac{1}{\beta} \cdot
|
|
|
|
|
\left(
|
|
|
|
|
\ln{
|
|
|
|
|
\left(1 + e^{\beta \cdot x} \right)
|
|
|
|
|
}
|
|
|
|
|
\right)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **derivable** for each point and it's a
|
|
|
|
|
**smooth approximation** of [ReLU](#relu) aimed
|
|
|
|
|
to **constraint the output to positive values**.
|
|
|
|
|
|
|
|
|
|
The **larger $\beta$**, the **similar to [ReLU](#relu)**
|
|
|
|
|
|
|
|
|
|
$$
|
2025-11-17 17:04:33 +01:00
|
|
|
\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
For **numerical-stability** when $\beta > tresh$, the
|
|
|
|
|
implementation **reverts back to a linear function**
|
|
|
|
|
|
|
|
|
|
### GELU[^GELU]
|
|
|
|
|
|
2025-11-17 17:04:33 +01:00
|
|
|
This function saturates like ramps over negative values.
|
|
|
|
|
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
2025-11-17 17:04:33 +01:00
|
|
|
GELU(x) = x \cdot \Phi(x) \\
|
|
|
|
|
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
This can be considered as a **smooth [ReLU](#relu)**,
|
|
|
|
|
however it's **not monothonic**
|
2025-11-17 17:04:33 +01:00
|
|
|
|
2025-04-15 14:11:08 +02:00
|
|
|
$$
|
|
|
|
|
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
### Tanhshrink
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Tanhshrink(x) = x - \tanh(x)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It gives **bigger values** of its **differentiation**
|
|
|
|
|
for **values far from $0$** while being **differentiable**
|
|
|
|
|
|
|
|
|
|
### Softshrink
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Softshrink(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x - \lambda \text{ if } x \geq \lambda \\
|
|
|
|
|
0 \text{ otherwise} \\
|
|
|
|
|
x + \lambda \text{ if } x \leq -\lambda \\
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,Softshrink(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq \lambda \\
|
|
|
|
|
0 \text{ otherwise} \\
|
|
|
|
|
1 \text{ if } x \leq -\lambda \\
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Can be considered as a **step of `L1` criteria**,
|
|
|
|
|
and as a
|
|
|
|
|
**hard approximation of [Tanhshrink](#tanhshrink)**.
|
|
|
|
|
|
|
|
|
|
It's also a step of
|
|
|
|
|
[ISTA](https://nikopj.github.io/blog/understanding-ista/)
|
|
|
|
|
Algorithm, but **not commonly used as an activation
|
|
|
|
|
function**.
|
|
|
|
|
|
|
|
|
|
### Hardshrink
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Hardshrink(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
x \lambda \text{ if } x \geq \lambda \\
|
|
|
|
|
0 \text{ otherwise} \\
|
|
|
|
|
x \lambda \text{ if } x \leq -\lambda \\
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,Hardshrink(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x > \lambda \\
|
|
|
|
|
0 \text{ otherwise} \\
|
|
|
|
|
1 \text{ if } x < -\lambda \\
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
This is even **harsher** than the
|
|
|
|
|
[Softshrink](#softshrink) function, as **it's not
|
|
|
|
|
continuous**, so it
|
|
|
|
|
**MUST BE AVOIDED IN BACKPROPAGATION**
|
|
|
|
|
|
|
|
|
|
## List of Saturating Activation Functions
|
|
|
|
|
|
|
|
|
|
### ReLU6
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
ReLU6(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
6 \text{ if } x \geq 6 \\
|
|
|
|
|
x \text{ if } 0 \leq x < 6 \\
|
|
|
|
|
0 \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***derivable***, but on $0$ and $6$ we usually
|
|
|
|
|
put the value as $0$ or $1$, though any value
|
|
|
|
|
between them is acceptable
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,ReLU6(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
0 \text{ if } x \geq 6 \\
|
|
|
|
|
1 \text{ if } 0 \leq x < 6\\
|
|
|
|
|
0 \text{ if } x < 0
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
### Sigmoid | AKA Logistic
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\sigma(x) = \frac{1}{1 + e^{-x}}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **differentiable**, and offers a large importance over **small** values while **bounding the result between
|
|
|
|
|
$0$ and $1$**.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is usually good as a **switch** for portions of
|
|
|
|
|
our `system` because of it's **differentiable**, however
|
|
|
|
|
it **shouldn't be used for many layers** as it
|
|
|
|
|
**saturates very quickly**
|
|
|
|
|
|
|
|
|
|
### Tanh
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is **differentiable**, and offers a large importance
|
|
|
|
|
over **small** values while
|
|
|
|
|
**bounding the result between $-1$ and $1$**,
|
|
|
|
|
making the **convergence faster** as it has a
|
|
|
|
|
**$0$ mean**
|
|
|
|
|
|
|
|
|
|
### Softsign
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Softsign(x) = \frac{x}{1 + |x|}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
### Hardtanh
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Hardtanh(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
M \text{ if } x \geq 6 \\
|
|
|
|
|
x \text{ if } m \leq x < M \\
|
|
|
|
|
m \text{ if } x < m
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
It is not ***differentiable***, but
|
2025-11-17 17:04:33 +01:00
|
|
|
**works well with values around $0$**, small values.
|
2025-04-15 14:11:08 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,Hardtanh(x)}{dx} =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
0 \text{ if } x \geq M \\
|
|
|
|
|
1 \text{ if } m \leq x < M\\
|
|
|
|
|
0 \text{ if } x < m
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$M$ and $m$ are **usually $1$ and $-1$ respectively**,
|
|
|
|
|
but can be changed.
|
|
|
|
|
|
|
|
|
|
### Threshold | AKA Heavyside
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Treshold(x) =
|
|
|
|
|
\begin{cases}
|
|
|
|
|
1 \text{ if } x \geq tresh \\
|
|
|
|
|
0 \text{ if } x < tresh
|
|
|
|
|
\end{cases}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
We usually don't use this as
|
|
|
|
|
**we can't propagate the gradient back**
|
|
|
|
|
|
|
|
|
|
### LogSigmoid
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
LogSigmoid(x) = \ln \left(
|
|
|
|
|
\frac{
|
|
|
|
|
1
|
|
|
|
|
}{
|
|
|
|
|
1 + e^{-x}
|
|
|
|
|
}
|
|
|
|
|
\right)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
This was designed to
|
|
|
|
|
**help with numerical instabilities**
|
|
|
|
|
|
|
|
|
|
### Softmin
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Softmin(x_j) = \frac{
|
|
|
|
|
e^{-x_j}
|
|
|
|
|
}{
|
|
|
|
|
\sum_{i} e^{-x_i}
|
|
|
|
|
} \forall i \in \{0, ..., N\}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
**IT IS AN OUTPUT FUNCTION** and transforms the `input`
|
|
|
|
|
into a vector of **probabilities**, giving
|
|
|
|
|
**higher values** to **small numbers**
|
|
|
|
|
|
|
|
|
|
### Softmax
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Softmax(x_j) = \frac{
|
|
|
|
|
e^{x_j}
|
|
|
|
|
}{
|
|
|
|
|
\sum_{i} e^{x_i}
|
|
|
|
|
} \forall i \in \{0, ..., N\}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
**IT IS AN OUTPUT FUNCTION** and transforms the `input`
|
|
|
|
|
into a vector of **probabilities**, giving
|
|
|
|
|
**higher values** to **high numbers**
|
|
|
|
|
|
|
|
|
|
In a way, we could say that the **softmax** is
|
|
|
|
|
just a [sigmoid](#sigmoid--aka-logistic) made for
|
|
|
|
|
all values, instead of just one.
|
|
|
|
|
|
|
|
|
|
In other words, a **softmax** is a **generalization**
|
|
|
|
|
of a **[sigmoid](#sigmoid--aka-logistic)**
|
|
|
|
|
function.
|
|
|
|
|
|
|
|
|
|
### LogSoftmax
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
LogSoftmax(x_j) = \ln \left(
|
|
|
|
|
\frac{
|
|
|
|
|
e^{x_j}
|
|
|
|
|
}{
|
|
|
|
|
\sum_{i} e^{x_i}
|
|
|
|
|
}
|
|
|
|
|
\right )\forall i \in \{0, ..., N\}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Used **mostly as a `loss-function`** but **uncommon
|
|
|
|
|
for `activation-functions`** and it is used to
|
|
|
|
|
**deal with numerical instabilities** and
|
|
|
|
|
as a **component for other losses**
|
|
|
|
|
|
|
|
|
|
[^PReLU]: [Microsoft Paper | arXiv:1502.01852v1 [cs.CV] 6 Feb 2015](https://arxiv.org/pdf/1502.01852v1)
|
|
|
|
|
|
|
|
|
|
[^RReLU]: [Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853v2)
|
|
|
|
|
|
|
|
|
|
[^SELU]: [Self-Normalizing Neural Networks](https://arxiv.org/pdf/1706.02515v5)
|
|
|
|
|
|
|
|
|
|
[^GELU]: [Github Page | 2nd April 2025](https://alaaalatif.github.io/2019-04-11-gelu/)
|