Added AdamW
This commit is contained in:
parent
6cc1d8a4ed
commit
2ddb65b09b
36
Chapters/5-Optimization/Fancy-Methods/ADAM-W.md
Normal file
36
Chapters/5-Optimization/Fancy-Methods/ADAM-W.md
Normal file
@ -0,0 +1,36 @@
|
|||||||
|
# AdamW[^official-paper]
|
||||||
|
|
||||||
|
The reasons for this algorithm to exist are the fact
|
||||||
|
that the ***authors of the [original paper](https://arxiv.org/pdf/1711.05101v3)[^official-paper] noticed
|
||||||
|
that with [Adam](./ADAM.md), `L2 regularization`
|
||||||
|
offered diminishing returns than with `SGD`***
|
||||||
|
|
||||||
|
Also this comes by the fact that ***many libraries
|
||||||
|
implemented `weight-decay` techniques with a rewritten
|
||||||
|
that made `L2` and `weight decay` identical, but this
|
||||||
|
works only for `SGD`
|
||||||
|
and not for `Adam`***[^anelli-adamw-1]
|
||||||
|
|
||||||
|
## Algorithm
|
||||||
|
|
||||||
|
See [Adam](./ADAM.md) to get $\hat{\mu}_t$ and
|
||||||
|
$\hat{\sigma}_t$ equations
|
||||||
|
|
||||||
|
$$
|
||||||
|
\vec{w}_t = \vec{w}_{t-1} - \eta \left(\frac{
|
||||||
|
\hat{\mu}_t
|
||||||
|
}{
|
||||||
|
\sqrt{ \hat{ \sigma}_t + \epsilon}
|
||||||
|
} + \lambda \vec{w}_{t-1}
|
||||||
|
\right)
|
||||||
|
$$
|
||||||
|
|
||||||
|
As we can see here, ***by implementing the `weight-decay`
|
||||||
|
here instead of the gradient, does not make it scale
|
||||||
|
with the `std-dev`***
|
||||||
|
|
||||||
|
<!-- Footnotes -->
|
||||||
|
|
||||||
|
[^official-paper]: [AdamW Official Paper | DECOUPLED WEIGHT DECAY REGULARIZATION | arXiv:1711.05101v3](https://arxiv.org/pdf/1711.05101v3)
|
||||||
|
|
||||||
|
[^anelli-adamw-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 60-61
|
||||||
Loading…
x
Reference in New Issue
Block a user