Christian Risi 2ddb65b09b Added AdamW
2025-04-20 11:58:53 +02:00

1.1 KiB

AdamW1

The reasons for this algorithm to exist are the fact that the authors of the original paper1 noticed that with Adam, L2 regularization offered diminishing returns than with SGD

Also this comes by the fact that many libraries implemented weight-decay techniques with a rewritten that made L2 and weight decay identical, but this works only for SGD and not for Adam2

Algorithm

See Adam to get \hat{\mu}_t and \hat{\sigma}_t equations


\vec{w}_t = \vec{w}_{t-1} - \eta  \left(\frac{
    \hat{\mu}_t
}{
    \sqrt{ \hat{ \sigma}_t + \epsilon}
} + \lambda \vec{w}_{t-1}
\right)

As we can see here, by implementing the weight-decay here instead of the gradient, does not make it scale with the std-dev


  1. AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3 ↩︎

  2. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61 ↩︎