# AdamW[^official-paper]

The reasons for this algorithm to exist are the fact
that the ***authors of the [original paper](https://arxiv.org/pdf/1711.05101v3)[^official-paper] noticed
that with [Adam](./ADAM.md), `L2 regularization`
offered diminishing returns than with `SGD`***

Also this comes by the fact that ***many libraries
implemented `weight-decay` techniques with a rewritten
that made `L2` and `weight decay` identical, but this
works only for `SGD`
and not for `Adam`***[^anelli-adamw-1]

## Algorithm

See [Adam](./ADAM.md) to get $\hat{\mu}_t$ and
$\hat{\sigma}_t$ equations

$$
\vec{w}_t = \vec{w}_{t-1} - \eta  \left(\frac{
    \hat{\mu}_t
}{
    \sqrt{ \hat{ \sigma}_t + \epsilon}
} + \lambda \vec{w}_{t-1}
\right)
$$

As we can see here, ***by implementing the `weight-decay`
here instead of the gradient, does not make it scale
with the `std-dev`***

<!-- Footnotes -->

[^official-paper]: [AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3](https://arxiv.org/pdf/1711.05101v3)

[^anelli-adamw-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61