# ADADELTA[^adadelta-offcial-paper]

`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
created to address some problems of it, like
***sensitivity to initial `parameters` and corresponding
gradient***[^adadelta-offcial-paper]

## First Formulation

To address all these problems, `ADADELTA` accumulates
***gradients over a `window`***, though in a
***exponential decaying averaging way***:

$$
E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
    (1 - \alpha) \cdot g^2_t
$$

The update, which is very similar to the one in
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:

$$
 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{E[g^2]_t + \epsilon}
    } \cdot g_{t,i}
$$

Technically speaking, the last equation can be rewritten
as:

$$
 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        RMS[g]_t
    } \cdot g_{t,i}
$$

Though, this is ***still not the actual equation*** as
it has the `units` ***all over the place***.

## Second Formulation

Technically speaking, this update is ***adimensional***,
so, as noted by the authors of the
paper[^adadelta-units], we should correct this problem
by ***considering the curvature locally smooth*** and
taking an approximation of it at the next step, by taking
the value at the previous one, making the full
update equation:

$$
 \bar{w}_{t + 1, i} =
    \bar{w}_{t, i} - \frac{
        RMS[\bar{w}_{i}]_{t - 1}
    }{
        RMS[g]_t
    } \cdot g_{t,i}
$$

As we can notice, the ***`learning rate` completely
disappears from the equation, eliminating the need to
set one***

> [!NOTE]
>
> It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail)
> is basically the [first update](#first-formulation) we derived for this method

<!-- Footnotes -->

[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)

[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)