Added ADADELTA
This commit is contained in:
parent
d5e34ab54b
commit
d9fd801bba
78
Chapters/5-Optimization/Fancy-Methods/ADADELTA.md
Normal file
78
Chapters/5-Optimization/Fancy-Methods/ADADELTA.md
Normal file
@ -0,0 +1,78 @@
|
||||
# ADADELTA[^adadelta-offcial-paper]
|
||||
|
||||
`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
|
||||
created to address some problems of it, like
|
||||
***sensitivity to initial `parameters` and corresponding
|
||||
gradient***[^adadelta-offcial-paper]
|
||||
|
||||
## First Formulation
|
||||
|
||||
To address all these problems, `ADADELTA` accumulates
|
||||
***gradients over a `window`***, though in a
|
||||
***exponential decaying averaging way***:
|
||||
|
||||
$$
|
||||
E[g^2]_t = \alpha \cdot E[g^2]_{t-1} +
|
||||
(1 - \alpha) \cdot g^2_t
|
||||
$$
|
||||
|
||||
The update, which is very similar to the one in
|
||||
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
|
||||
|
||||
$$
|
||||
\bar{w}_{t+1, i} =
|
||||
\bar{w}_{t, i} - \frac{
|
||||
\eta
|
||||
}{
|
||||
\sqrt{E[g^2]_t + \epsilon}
|
||||
} \cdot g_{t,i}
|
||||
$$
|
||||
|
||||
Technically speaking, the last equation can be rewritten
|
||||
as:
|
||||
|
||||
$$
|
||||
\bar{w}_{t+1, i} =
|
||||
\bar{w}_{t, i} - \frac{
|
||||
\eta
|
||||
}{
|
||||
RMS[g]_t
|
||||
} \cdot g_{t,i}
|
||||
$$
|
||||
|
||||
Though, this is ***still not the actual equation*** as
|
||||
it has the `units` ***all over the place***.
|
||||
|
||||
## Second Formulation
|
||||
|
||||
Technically speaking, this update is ***adimensional***,
|
||||
so, as noted by the authors of the
|
||||
paper[^adadelta-units], we should correct this problem
|
||||
by ***considering the curvature locally smooth*** and
|
||||
taking an approximation of it at the next step, by taking
|
||||
the value at the previous one, making the full
|
||||
update equation:
|
||||
|
||||
$$
|
||||
\bar{w}_{t + 1, i} =
|
||||
\bar{w}_{t, i} - \frac{
|
||||
RMS[\bar{w}_{i}]_{t - 1}
|
||||
}{
|
||||
RMS[g]_t
|
||||
} \cdot g_{t,i}
|
||||
$$
|
||||
|
||||
As we can notice, the ***`learning rate` completely
|
||||
disappears from the equation, eliminating the need to
|
||||
set one***
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> It can be noticed that [`RMSProp`](./../INDEX.md#rmsprop-in-detail)
|
||||
> is basically the [first update](#first-formulation) we derived for this method
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||
|
||||
[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||
Loading…
x
Reference in New Issue
Block a user