diff --git a/Chapters/5-Optimization/Fancy-Methods/ADAM-W.md b/Chapters/5-Optimization/Fancy-Methods/ADAM-W.md new file mode 100644 index 0000000..b135558 --- /dev/null +++ b/Chapters/5-Optimization/Fancy-Methods/ADAM-W.md @@ -0,0 +1,36 @@ +# AdamW[^official-paper] + +The reasons for this algorithm to exist are the fact +that the ***authors of the [original paper](https://arxiv.org/pdf/1711.05101v3)[^official-paper] noticed +that with [Adam](./ADAM.md), `L2 regularization` +offered diminishing returns than with `SGD`*** + +Also this comes by the fact that ***many libraries +implemented `weight-decay` techniques with a rewritten +that made `L2` and `weight decay` identical, but this +works only for `SGD` +and not for `Adam`***[^anelli-adamw-1] + +## Algorithm + +See [Adam](./ADAM.md) to get $\hat{\mu}_t$ and +$\hat{\sigma}_t$ equations + +$$ +\vec{w}_t = \vec{w}_{t-1} - \eta \left(\frac{ + \hat{\mu}_t +}{ + \sqrt{ \hat{ \sigma}_t + \epsilon} +} + \lambda \vec{w}_{t-1} +\right) +$$ + +As we can see here, ***by implementing the `weight-decay` +here instead of the gradient, does not make it scale +with the `std-dev`*** + + + +[^official-paper]: [AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3](https://arxiv.org/pdf/1711.05101v3) + +[^anelli-adamw-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61