9.9 KiB
Activation Functions
Vanishing Gradient
One problem of Activation Functions is that some of them have a minuscule derivative, often less than 1.
The problem is that if we have more than 1 layer, since the backpropagation is multiplicative, the gradient
tends towards 0.
Usually these functions are said to be saturating, meaning that they have horizontal asymptotes, where their derivate is near 0
List of Non-Saturating Activation Functions
ReLU
ReLU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
0 \text{ if } x < 0
\end{cases}
It is not derivable, but on 0 we usually put the value as 0 or 1, though any value between them is
acceptable
\frac{d\,ReLU(x)}{dx} =
\begin{cases}
1 \text{ if } x \geq 0 \\
0 \text{ if } x < 0
\end{cases}
The problem is that this function saturates for negative values
Leaky ReLU
LeakyReLU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
a\cdot x \text{ if } x < 0
\end{cases}
It is not derivable, but on 0 we usually put the value as a or 1, though any value between them is
acceptable
\frac{d\,LeakyReLU(x)}{dx} =
\begin{cases}
1 \text{ if } x \geq 0 \\
a \text{ if } x < 0
\end{cases}
Here a is a fixed parameter
Parametric ReLu | AKA PReLU1
PReLU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
\vec{a} \cdot x \text{ if } x < 0
\end{cases}
It is not derivable, but on 0 we usually put the
value as \vec{a} or 1, though any value between them
is acceptable.
\frac{d\,PReLU(x) }{dx} =
\begin{cases}
1 \text{ if } x \geq 0 \\
\vec{a} \text{ if } x < 0
\end{cases}
Differently from LeakyReLU, here \vec{a}
is a learnable parameter that can differ across
channels (features)
Randomized Leaky ReLU | AKA RReLU2
RReLU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
\vec{a} \cdot x \text{ if } x < 0
\end{cases} \\
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[
It is not derivable, but on 0 we usually put the value as \vec{a} or 1, though any value between them is
acceptable
\begin{aligned}
\frac{d\,RReLU(x)}{dx} &=
\begin{cases}
1 \text{ if } x \geq 0 \\
\vec{a} \text{ if } x < 0
\end{cases} \\
\end{aligned}
Here \vec{a} is a random parameter that is
always sampled during training and fixed
during tests and inference to \frac{l + u}{2}
ELU
This function allows the system to average output to 0, thus it may converge faster
ELU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}
It is derivable for each point, as long as a = 1,
which is usually the case.
\frac{d\,ELU(x)}{dx} =
\begin{cases}
1 \text{ if } x \geq 0 \\
a \cdot e^x \text{ if } x < 0
\end{cases}
Like ReLU and all Saturating Functions, it shares the problem with large negative numbers
CELU
This is a flavour of ELU
CELU(x) =
\begin{cases}
x \text{ if } x \geq 0 \\
a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0
\end{cases}
It is derivable for each point, as long as a > 0,
which is usually the case.
\frac{d\,CELU(x)}{dx} =
\begin{cases}
1 \text{ if } x \geq 0 \\
e^{\frac{x}{a}} \text{ if } x < 0
\end{cases}
It has the same problems as ELU for saturation
SELU3
It aims to create a normalization for both the
average and standard deviation across
layers
SELU(x) =
\begin{cases}
\lambda \cdot x \text{ if } x \geq 0 \\
\lambda a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}
It is derivable for each point, as long as a = 1,
though, this is not the case, usually, as it's
recommended values are:
a = 1.6733\lambda = 1.0507
Apart from this, this is basically a scaled ELU.
\frac{d\,SELU(x)}{dx} =
\begin{cases}
\lambda \text{ if } x \geq 0 \\
\lambda a \cdot e^{x} \text{ if } x < 0
\end{cases}
It has the same problems as ELU for both derivation and saturation, though for the latter, since it scales, it has more space of manouver.
Softplus
This is a smoothed version of a ReLU and as such outputs only positive values
Softplus(x) =
\frac{1}{\beta} \cdot
\left(
\ln{
\left(1 + e^{\beta \cdot x} \right)
}
\right)
It is derivable for each point and it's a smooth approximation of ReLU aimed to constraint the output to positive values.
The larger $\beta$, the similar to ReLU
\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}
For numerical-stability when \beta > tresh, the
implementation reverts back to a linear function
GELU4
This function saturates like ramps over negative values.
GELU(x) = x \cdot \Phi(x) \\
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)
This can be considered as a smooth ReLU, however it's not monothonic
\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)
Tanhshrink
Tanhshrink(x) = x - \tanh(x)
It gives bigger values of its differentiation for values far from $0$ while being differentiable
Softshrink
Softshrink(x) =
\begin{cases}
x - \lambda \text{ if } x \geq \lambda \\
0 \text{ otherwise} \\
x + \lambda \text{ if } x \leq -\lambda \\
\end{cases}
\frac{d\,Softshrink(x)}{dx} =
\begin{cases}
1 \text{ if } x \geq \lambda \\
0 \text{ otherwise} \\
1 \text{ if } x \leq -\lambda \\
\end{cases}
Can be considered as a step of L1 criteria,
and as a
hard approximation of Tanhshrink.
It's also a step of ISTA Algorithm, but not commonly used as an activation function.
Hardshrink
Hardshrink(x) =
\begin{cases}
x \lambda \text{ if } x \geq \lambda \\
0 \text{ otherwise} \\
x \lambda \text{ if } x \leq -\lambda \\
\end{cases}
\frac{d\,Hardshrink(x)}{dx} =
\begin{cases}
1 \text{ if } x > \lambda \\
0 \text{ otherwise} \\
1 \text{ if } x < -\lambda \\
\end{cases}
This is even harsher than the Softshrink function, as it's not continuous, so it MUST BE AVOIDED IN BACKPROPAGATION
List of Saturating Activation Functions
ReLU6
ReLU6(x) =
\begin{cases}
6 \text{ if } x \geq 6 \\
x \text{ if } 0 \leq x < 6 \\
0 \text{ if } x < 0
\end{cases}
It is not derivable, but on 0 and 6 we usually
put the value as 0 or 1, though any value
between them is acceptable
\frac{d\,ReLU6(x)}{dx} =
\begin{cases}
0 \text{ if } x \geq 6 \\
1 \text{ if } 0 \leq x < 6\\
0 \text{ if } x < 0
\end{cases}
Sigmoid | AKA Logistic
\sigma(x) = \frac{1}{1 + e^{-x}}
It is differentiable, and offers a large importance over small values while bounding the result between
0 and $1$.
\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))
It is usually good as a switch for portions of
our system because of it's differentiable, however
it shouldn't be used for many layers as it
saturates very quickly
Tanh
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1
It is differentiable, and offers a large importance
over small values while
bounding the result between -1 and $1$,
making the convergence faster as it has a
0 mean
Softsign
Softsign(x) = \frac{x}{1 + |x|}
\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1}
Hardtanh
Hardtanh(x) =
\begin{cases}
M \text{ if } x \geq 6 \\
x \text{ if } m \leq x < M \\
m \text{ if } x < m
\end{cases}
It is not differentiable, but works well with values around $0$, small values.
\frac{d\,Hardtanh(x)}{dx} =
\begin{cases}
0 \text{ if } x \geq M \\
1 \text{ if } m \leq x < M\\
0 \text{ if } x < m
\end{cases}
M and m are usually 1 and -1 respectively,
but can be changed.
Threshold | AKA Heavyside
Treshold(x) =
\begin{cases}
1 \text{ if } x \geq tresh \\
0 \text{ if } x < tresh
\end{cases}
We usually don't use this as we can't propagate the gradient back
LogSigmoid
LogSigmoid(x) = \ln \left(
\frac{
1
}{
1 + e^{-x}
}
\right)
\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}
This was designed to help with numerical instabilities
Softmin
Softmin(x_j) = \frac{
e^{-x_j}
}{
\sum_{i} e^{-x_i}
} \forall i \in \{0, ..., N\}
IT IS AN OUTPUT FUNCTION and transforms the input
into a vector of probabilities, giving
higher values to small numbers
Softmax
Softmax(x_j) = \frac{
e^{x_j}
}{
\sum_{i} e^{x_i}
} \forall i \in \{0, ..., N\}
IT IS AN OUTPUT FUNCTION and transforms the input
into a vector of probabilities, giving
higher values to high numbers
In a way, we could say that the softmax is just a sigmoid made for all values, instead of just one.
In other words, a softmax is a generalization of a sigmoid function.
LogSoftmax
LogSoftmax(x_j) = \ln \left(
\frac{
e^{x_j}
}{
\sum_{i} e^{x_i}
}
\right )\forall i \in \{0, ..., N\}
Used mostly as a loss-function but uncommon
for activation-functions and it is used to
deal with numerical instabilities and
as a component for other losses