Christian Risi 247daf4d56 Revised Chapter 3 and added definitions to appendix

2025-11-17 17:04:33 +01:00

9.9 KiB

Raw Permalink Blame History

Activation Functions

Vanishing Gradient

One problem of Activation Functions is that some of them have a minuscule derivative, often less than 1.

The problem is that if we have more than 1 layer, since the backpropagation is multiplicative, the gradient tends towards 0.

Usually these functions are said to be saturating, meaning that they have horizontal asymptotes, where their derivate is near 0

List of Non-Saturating Activation Functions

ReLU



ReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    0 \text{ if } x < 0
\end{cases}

It is not derivable, but on 0 we usually put the value as 0 or 1, though any value between them is acceptable


\frac{d\,ReLU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    0 \text{ if } x < 0
\end{cases}

The problem is that this function saturates for negative values

Leaky ReLU


LeakyReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a\cdot x \text{ if } x < 0
\end{cases}

It is not derivable, but on 0 we usually put the value as a or 1, though any value between them is acceptable


\frac{d\,LeakyReLU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    a \text{ if } x < 0
\end{cases}

Here a is a fixed parameter

Parametric ReLu | AKA PReLU¹


PReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    \vec{a} \cdot x \text{ if } x < 0
\end{cases}

It is not derivable, but on 0 we usually put the value as \vec{a} or 1, though any value between them is acceptable.


\frac{d\,PReLU(x) }{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    \vec{a} \text{ if } x < 0
\end{cases}

Differently from LeakyReLU, here \vec{a} is a learnable parameter that can differ across channels (features)

Randomized Leaky ReLU | AKA RReLU²


RReLU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    \vec{a} \cdot x \text{ if } x < 0
\end{cases} \\
a_{i,j} \sim U (l, u): \;l < u \wedge l, u \in [0, 1[

It is not derivable, but on 0 we usually put the value as \vec{a} or 1, though any value between them is acceptable


\begin{aligned}
    \frac{d\,RReLU(x)}{dx} &=
\begin{cases}
    1 \text{ if } x \geq 0 \\
    \vec{a} \text{ if } x < 0
\end{cases} \\


\end{aligned}

Here \vec{a} is a random parameter that is always sampled during training and fixed during tests and inference to \frac{l + u}{2}

ELU

This function allows the system to average output to 0, thus it may converge faster


ELU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}

It is derivable for each point, as long as a = 1, which is usually the case.


\frac{d\,ELU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    a \cdot e^x \text{ if } x < 0
\end{cases}

Like ReLU and all Saturating Functions, it shares the problem with large negative numbers

CELU

This is a flavour of ELU


CELU(x) =
\begin{cases}
    x \text{ if } x \geq 0 \\
    a \cdot (e^{\frac{x}{a}} -1) \text{ if } x < 0
\end{cases}

It is derivable for each point, as long as a > 0, which is usually the case.


\frac{d\,CELU(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq 0 \\
    e^{\frac{x}{a}} \text{ if } x < 0
\end{cases}

It has the same problems as ELU for saturation

SELU³

It aims to create a normalization for both the average and standard deviation across layers


SELU(x) =
\begin{cases}
    \lambda \cdot x \text{ if } x \geq 0 \\
    \lambda a \cdot (e^{x} -1) \text{ if } x < 0
\end{cases}

It is derivable for each point, as long as a = 1, though, this is not the case, usually, as it's recommended values are:

a = 1.6733
\lambda = 1.0507

Apart from this, this is basically a scaled ELU.


\frac{d\,SELU(x)}{dx} =
\begin{cases}
    \lambda \text{ if } x \geq 0 \\
    \lambda a \cdot e^{x} \text{ if } x < 0
\end{cases}

It has the same problems as ELU for both derivation and saturation, though for the latter, since it scales, it has more space of manouver.

Softplus

This is a smoothed version of a ReLU and as such outputs only positive values


Softplus(x) =
\frac{1}{\beta} \cdot
\left(
    \ln{
        \left(1 + e^{\beta \cdot x} \right)
    }
\right)

It is derivable for each point and it's a smooth approximation of ReLU aimed to constraint the output to positive values.

The larger $\beta$, the similar to ReLU


\frac{d\,Softplus(x)}{dx} = \frac{e^{\beta*x}}{e^{\beta*x} + 1}

For numerical-stability when \beta > tresh, the implementation reverts back to a linear function

GELU⁴

This function saturates like ramps over negative values.


GELU(x) = x \cdot \Phi(x) \\
\Phi(x) = P(X \leq x) \,\, X \sim \mathcal{N}(0, 1)

This can be considered as a smooth ReLU, however it's not monothonic


\frac{d\,GELU(x)}{dx} = \Phi(x)+ x\cdot P(X = x)

Tanhshrink


Tanhshrink(x) = x - \tanh(x)

It gives bigger values of its differentiation for values far from $0$ while being differentiable

Softshrink


Softshrink(x) =
\begin{cases}
    x - \lambda \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    x + \lambda \text{ if } x \leq -\lambda \\
\end{cases}


\frac{d\,Softshrink(x)}{dx} =
\begin{cases}
    1 \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    1 \text{ if } x \leq -\lambda \\
\end{cases}

Can be considered as a step of L1 criteria, and as a hard approximation of Tanhshrink.

It's also a step of ISTA Algorithm, but not commonly used as an activation function.

Hardshrink


Hardshrink(x) =
\begin{cases}
    x \lambda \text{ if } x \geq \lambda \\
    0 \text{ otherwise} \\
    x \lambda \text{ if } x \leq -\lambda \\
\end{cases}


\frac{d\,Hardshrink(x)}{dx} =
\begin{cases}
    1 \text{ if } x > \lambda \\
    0 \text{ otherwise} \\
    1 \text{ if } x < -\lambda \\
\end{cases}

This is even harsher than the Softshrink function, as it's not continuous, so it MUST BE AVOIDED IN BACKPROPAGATION

List of Saturating Activation Functions

ReLU6



ReLU6(x) =
\begin{cases}
    6 \text{ if } x \geq 6 \\
    x \text{ if } 0 \leq x < 6 \\
    0 \text{ if } x < 0
\end{cases}

It is not derivable, but on 0 and 6 we usually put the value as 0 or 1, though any value between them is acceptable


\frac{d\,ReLU6(x)}{dx} =
\begin{cases}
    0 \text{ if } x \geq 6 \\
    1 \text{ if } 0 \leq x < 6\\
    0 \text{ if } x < 0
\end{cases}

Sigmoid | AKA Logistic


\sigma(x) = \frac{1}{1 + e^{-x}}

It is differentiable, and offers a large importance over small values while bounding the result between 0 and $1$.


\frac{d\,\sigma(x)}{dx} = \sigma(x)\cdot (1 - \sigma(x))

It is usually good as a switch for portions of our system because of it's differentiable, however it shouldn't be used for many layers as it saturates very quickly

Tanh


\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}


\frac{d\,\tanh(x)}{dx} = - \tanh^2(x) + 1

It is differentiable, and offers a large importance over small values while bounding the result between -1 and $1$, making the convergence faster as it has a 0 mean

Softsign


Softsign(x) = \frac{x}{1 + |x|}


\frac{d\,Softsign(x)}{dx} = \frac{1}{x^2 + 2|x| + 1}

Hardtanh


Hardtanh(x) =
\begin{cases}
    M \text{ if } x \geq 6 \\
    x \text{ if } m \leq x < M \\
    m \text{ if } x < m
\end{cases}

It is not differentiable, but works well with values around $0$, small values.


\frac{d\,Hardtanh(x)}{dx} =
\begin{cases}
    0 \text{ if } x \geq M \\
    1 \text{ if } m \leq x < M\\
    0 \text{ if } x < m
\end{cases}

M and m are usually 1 and -1 respectively, but can be changed.

Threshold | AKA Heavyside


Treshold(x) =
\begin{cases}
    1 \text{ if } x \geq tresh \\
    0 \text{ if } x < tresh
\end{cases}

We usually don't use this as we can't propagate the gradient back

LogSigmoid


LogSigmoid(x) = \ln \left(
    \frac{
        1
    }{
        1 + e^{-x}
    }
\right)


\frac{d\,LogSigmoid(x)}{dx} = \frac{1}{1 + e^x}

This was designed to help with numerical instabilities

Softmin


Softmin(x_j) = \frac{
    e^{-x_j}
}{
    \sum_{i} e^{-x_i}
} \forall i \in \{0, ..., N\}

IT IS AN OUTPUT FUNCTION and transforms the input into a vector of probabilities, giving higher values to small numbers

Softmax


Softmax(x_j) = \frac{
    e^{x_j}
}{
    \sum_{i} e^{x_i}
} \forall i \in \{0, ..., N\}

IT IS AN OUTPUT FUNCTION and transforms the input into a vector of probabilities, giving higher values to high numbers

In a way, we could say that the softmax is just a sigmoid made for all values, instead of just one.

In other words, a softmax is a generalization of a sigmoid function.

LogSoftmax


LogSoftmax(x_j) = \ln \left(
    \frac{
    e^{x_j}
}{
    \sum_{i} e^{x_i}
}
\right )\forall i \in \{0, ..., N\}

Used mostly as a loss-function but uncommon for activation-functions and it is used to deal with numerical instabilities and as a component for other losses

9.9 KiB Raw Permalink Blame History