# ADADELTA[^adadelta-offcial-paper] `ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and created to address some problems of it, like ***sensitivity to initial `parameters` and corresponding gradient***[^adadelta-offcial-paper] ## First Formulation To address all these problems, `ADADELTA` accumulates ***gradients over a `window`***, though in a ***exponential decaying averaging way***: $$ E[g^2]_t = \alpha \cdot E[g^2]_{t-1} + (1 - \alpha) \cdot g^2_t $$ The update, which is very similar to the one in [AdaGrad](./ADAGRAD.md#the-algorithm), becomes: $$ \bar{w}_{t+1, i} = \bar{w}_{t, i} - \frac{ \eta }{ \sqrt{E[g^2]_t + \epsilon} } \cdot g_{t,i} $$ Technically speaking, the last equation can be rewritten as: $$ \bar{w}_{t+1, i} = \bar{w}_{t, i} - \frac{ \eta }{ RMS[g]_t } \cdot g_{t,i} $$ Though, this is ***still not the actual equation*** as it has the `units` ***all over the place***. ## Second Formulation Technically speaking, this update is ***adimensional***, so, as noted by the authors of the paper[^adadelta-units], we should correct this problem by ***considering the curvature locally smooth*** and taking an approximation of it at the next step, by taking the value at the previous one, making the full update equation: $$ \bar{w}_{t + 1, i} = \bar{w}_{t, i} - \frac{ RMS[\bar{w}_{i}]_{t - 1} }{ RMS[g]_t } \cdot g_{t,i} $$ As we can notice, the ***`learning rate` completely disappears from the equation, eliminating the need to set one*** > [!NOTE] > > It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail) > is basically the [first update](#first-formulation) we derived for this method [^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701) [^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)