ADADELTA¹

ADADELTA was inspired by AdaGrad and created to address some problems of it, like sensitivity to initial parameters and corresponding gradient¹

First Formulation

To address all these problems, ADADELTA accumulates gradients over a window, though in a exponential decaying averaging way:


E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
    (1 - \alpha) \cdot g^2_t

The update, which is very similar to the one in AdaGrad, becomes:


 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{E[g^2]_t + \epsilon}
    } \cdot g_{t,i}

Technically speaking, the last equation can be rewritten as:


 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        RMS[g]_t
    } \cdot g_{t,i}

Though, this is still not the actual equation as it has the units all over the place.

Second Formulation

Technically speaking, this update is adimensional, so, as noted by the authors of the paper², we should correct this problem by considering the curvature locally smooth and taking an approximation of it at the next step, by taking the value at the previous one, making the full update equation:


 \bar{w}_{t + 1, i} =
    \bar{w}_{t, i} - \frac{
        RMS[\bar{w}_{i}]_{t - 1}
    }{
        RMS[g]_t
    } \cdot g_{t,i}

As we can notice, the learning rate completely disappears from the equation, eliminating the need to set one

Note

It can be noticed that RMSProp is basically the first update we derived for this method

2.0 KiB Raw Blame History

ADADELTA1

First Formulation

Second Formulation

2.0 KiB

Raw Blame History

ADADELTA¹