79 lines
2.0 KiB
Markdown
Raw Permalink Normal View History

2025-04-19 17:30:41 +02:00
# ADADELTA[^adadelta-offcial-paper]
`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
created to address some problems of it, like
***sensitivity to initial `parameters` and corresponding
gradient***[^adadelta-offcial-paper]
## First Formulation
To address all these problems, `ADADELTA` accumulates
***gradients over a `window`***, though in a
***exponential decaying averaging way***:
$$
E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
(1 - \alpha) \cdot g^2_t
$$
The update, which is very similar to the one in
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
$$
\bar{w}_{t+1, i} =
\bar{w}_{t, i} - \frac{
\eta
}{
\sqrt{E[g^2]_t + \epsilon}
} \cdot g_{t,i}
$$
Technically speaking, the last equation can be rewritten
as:
$$
\bar{w}_{t+1, i} =
\bar{w}_{t, i} - \frac{
\eta
}{
RMS[g]_t
} \cdot g_{t,i}
$$
Though, this is ***still not the actual equation*** as
it has the `units` ***all over the place***.
## Second Formulation
Technically speaking, this update is ***adimensional***,
so, as noted by the authors of the
paper[^adadelta-units], we should correct this problem
by ***considering the curvature locally smooth*** and
taking an approximation of it at the next step, by taking
the value at the previous one, making the full
update equation:
$$
\bar{w}_{t + 1, i} =
\bar{w}_{t, i} - \frac{
RMS[\bar{w}_{i}]_{t - 1}
}{
RMS[g]_t
} \cdot g_{t,i}
$$
As we can notice, the ***`learning rate` completely
disappears from the equation, eliminating the need to
set one***
> [!NOTE]
>
> It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail)
> is basically the [first update](#first-formulation) we derived for this method
<!-- Footnotes -->
[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)