2025-04-19 17:30:41 +02:00

2.0 KiB

ADADELTA1

ADADELTA was inspired by AdaGrad and created to address some problems of it, like sensitivity to initial parameters and corresponding gradient1

First Formulation

To address all these problems, ADADELTA accumulates gradients over a window, though in a exponential decaying averaging way:


E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
    (1 - \alpha) \cdot g^2_t

The update, which is very similar to the one in AdaGrad, becomes:


 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{E[g^2]_t + \epsilon}
    } \cdot g_{t,i}

Technically speaking, the last equation can be rewritten as:


 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        RMS[g]_t
    } \cdot g_{t,i}

Though, this is still not the actual equation as it has the units all over the place.

Second Formulation

Technically speaking, this update is adimensional, so, as noted by the authors of the paper2, we should correct this problem by considering the curvature locally smooth and taking an approximation of it at the next step, by taking the value at the previous one, making the full update equation:


 \bar{w}_{t + 1, i} =
    \bar{w}_{t, i} - \frac{
        RMS[\bar{w}_{i}]_{t - 1}
    }{
        RMS[g]_t
    } \cdot g_{t,i}

As we can notice, the learning rate completely disappears from the equation, eliminating the need to set one

Note

It can be noticed that RMSProp is basically the first update we derived for this method