2.0 KiB
ADADELTA1
ADADELTA was inspired by AdaGrad and
created to address some problems of it, like
sensitivity to initial parameters and corresponding
gradient1
First Formulation
To address all these problems, ADADELTA accumulates
gradients over a window, though in a
exponential decaying averaging way:
E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
(1 - \alpha) \cdot g^2_t
The update, which is very similar to the one in AdaGrad, becomes:
\bar{w}_{t+1, i} =
\bar{w}_{t, i} - \frac{
\eta
}{
\sqrt{E[g^2]_t + \epsilon}
} \cdot g_{t,i}
Technically speaking, the last equation can be rewritten as:
\bar{w}_{t+1, i} =
\bar{w}_{t, i} - \frac{
\eta
}{
RMS[g]_t
} \cdot g_{t,i}
Though, this is still not the actual equation as
it has the units all over the place.
Second Formulation
Technically speaking, this update is adimensional, so, as noted by the authors of the paper2, we should correct this problem by considering the curvature locally smooth and taking an approximation of it at the next step, by taking the value at the previous one, making the full update equation:
\bar{w}_{t + 1, i} =
\bar{w}_{t, i} - \frac{
RMS[\bar{w}_{i}]_{t - 1}
}{
RMS[g]_t
} \cdot g_{t,i}
As we can notice, the learning rate completely
disappears from the equation, eliminating the need to
set one
Note
It can be noticed that
RMSPropis basically the first update we derived for this method