Added AdamW

2025-04-20 11:58:53 +02:00
parent 6cc1d8a4ed
commit 2ddb65b09b
1 changed files with 36 additions and 0 deletions
--- a/Chapters/5-Optimization/Fancy-Methods/ADAM-W.md
+++ b/Chapters/5-Optimization/Fancy-Methods/ADAM-W.md
@@ -0,0 +1,36 @@
+# AdamW[^official-paper]
+
+The reasons for this algorithm to exist are the fact
+that the ***authors of the [original paper](https://arxiv.org/pdf/1711.05101v3)[^official-paper] noticed
+that with [Adam](./ADAM.md), `L2 regularization`
+offered diminishing returns than with `SGD`***
+
+Also this comes by the fact that ***many libraries
+implemented `weight-decay` techniques with a rewritten
+that made `L2` and `weight decay` identical, but this
+works only for `SGD`
+and not for `Adam`***[^anelli-adamw-1]
+
+## Algorithm
+
+See [Adam](./ADAM.md) to get $\hat{\mu}_t$ and
+$\hat{\sigma}_t$ equations
+
+$$
+\vec{w}_t = \vec{w}_{t-1} - \eta  \left(\frac{
+    \hat{\mu}_t
+}{
+    \sqrt{ \hat{ \sigma}_t + \epsilon}
+} + \lambda \vec{w}_{t-1}
+\right)
+$$
+
+As we can see here, ***by implementing the `weight-decay`
+here instead of the gradient, does not make it scale
+with the `std-dev`***
+
+<!-- Footnotes -->
+
+[^official-paper]: [AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3](https://arxiv.org/pdf/1711.05101v3)
+
+[^anelli-adamw-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61