Deep-Learning/Chapters/5-Optimization/Fancy-Methods/ADADELTA.md

# ADADELTA[^adadelta-offcial-paper]

`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
created to address some problems of it, like
***sensitivity to initial `parameters` and corresponding
gradient***[^adadelta-offcial-paper]

## First Formulation

To address all these problems, `ADADELTA` accumulates
***gradients over a `window`***, though in a
***exponential decaying averaging way***:

$$
E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
    (1 - \alpha) \cdot g^2_t
$$

The update, which is very similar to the one in
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:

$$
 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{E[g^2]_t + \epsilon}
    } \cdot g_{t,i}
$$

Technically speaking, the last equation can be rewritten
as:

$$
 \bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        RMS[g]_t
    } \cdot g_{t,i}
$$

Though, this is ***still not the actual equation*** as
it has the `units` ***all over the place***.

## Second Formulation

Technically speaking, this update is ***adimensional***,
so, as noted by the authors of the
paper[^adadelta-units], we should correct this problem
by ***considering the curvature locally smooth*** and
taking an approximation of it at the next step, by taking
the value at the previous one, making the full
update equation:

$$
 \bar{w}_{t + 1, i} =
    \bar{w}_{t, i} - \frac{
        RMS[\bar{w}_{i}]_{t - 1}
    }{
        RMS[g]_t
    } \cdot g_{t,i}
$$

As we can notice, the ***`learning rate` completely
disappears from the equation, eliminating the need to
set one***

> [!NOTE]
>
> It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail)
> is basically the [first update](#first-formulation) we derived for this method

<!-- Footnotes -->

[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)

[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
Added ADADELTA 2025-04-19 17:30:41 +02:00			`# ADADELTA[^adadelta-offcial-paper]`

			`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
			`created to address some problems of it, like`
			***sensitivity to initial `parameters` and corresponding
			`gradient***[^adadelta-offcial-paper]`

			`## First Formulation`

			To address all these problems, `ADADELTA` accumulates
			*gradients over a `window`*, though in a
			`*exponential decaying averaging way*:`

			`$$`
			`E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +`
			`(1 - \alpha) \cdot g^2_t`
			`$$`

			`The update, which is very similar to the one in`
			`[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:`

			`$$`
			`\bar{w}_{t+1, i} =`
			`\bar{w}_{t, i} - \frac{`
			`\eta`
			`}{`
			`\sqrt{E[g^2]_t + \epsilon}`
			`} \cdot g_{t,i}`
			`$$`

			`Technically speaking, the last equation can be rewritten`
			`as:`

			`$$`
			`\bar{w}_{t+1, i} =`
			`\bar{w}_{t, i} - \frac{`
			`\eta`
			`}{`
			`RMS[g]_t`
			`} \cdot g_{t,i}`
			`$$`

			`Though, this is *still not the actual equation* as`
			it has the `units` *all over the place*.

			`## Second Formulation`

			`Technically speaking, this update is *adimensional*,`
			`so, as noted by the authors of the`
			`paper[^adadelta-units], we should correct this problem`
			`by *considering the curvature locally smooth* and`
			`taking an approximation of it at the next step, by taking`
			`the value at the previous one, making the full`
			`update equation:`

			`$$`
			`\bar{w}_{t + 1, i} =`
			`\bar{w}_{t, i} - \frac{`
			`RMS[\bar{w}_{i}]_{t - 1}`
			`}{`
			`RMS[g]_t`
			`} \cdot g_{t,i}`
			`$$`

			As we can notice, the ***`learning rate` completely
			`disappears from the equation, eliminating the need to`
			`set one***`

			`> [!NOTE]`
			`>`
			> It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail)
			`> is basically the [first update](#first-formulation) we derived for this method`

			`<!-- Footnotes -->`

			`[^adadelta-offcial-paper]: [Official ADADELTA Paper \| arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)`

			`[^adadelta-units]: [Official ADADELTA Paper \| Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation \| arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)`