1.1 KiB
1.1 KiB
AdamW1
The reasons for this algorithm to exist are the fact
that the authors of the original paper1 noticed
that with Adam, L2 regularization
offered diminishing returns than with SGD
Also this comes by the fact that many libraries
implemented weight-decay techniques with a rewritten
that made L2 and weight decay identical, but this
works only for SGD
and not for Adam2
Algorithm
See Adam to get \hat{\mu}_t and
\hat{\sigma}_t equations
\vec{w}_t = \vec{w}_{t-1} - \eta \left(\frac{
\hat{\mu}_t
}{
\sqrt{ \hat{ \sigma}_t + \epsilon}
} + \lambda \vec{w}_{t-1}
\right)
As we can see here, by implementing the weight-decay
here instead of the gradient, does not make it scale
with the std-dev
-
AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61 ↩︎