diff --git a/Chapters/5-Optimization/Fancy-Methods/ADAM.md b/Chapters/5-Optimization/Fancy-Methods/ADAM.md new file mode 100644 index 0000000..a3fdf99 --- /dev/null +++ b/Chapters/5-Optimization/Fancy-Methods/ADAM.md @@ -0,0 +1,99 @@ +# Adam + +## Mean and Variance + +Here we estimate the `running-mean` and +`running-not-centered-std-dev`: + +$$ +\begin{aligned} + \mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\ + \sigma_t &= \beta_2 \sigma_{t -1} + + (1 - \beta_2)g_t^2 +\end{aligned} +$$ + +Then we resize them, otherwise they will not be +scaled well[^anelli-adam-1]: + +$$ +\begin{aligned} + \hat{\mu}_t &= \frac{ + \mu_t + }{ + 1 - \beta_1^t + } \\ + \hat{\sigma}_t &= \frac{ + \sigma_t + }{ + 1 - \beta_2^t + } +\end{aligned} +$$ + +## Weight Update + +The `weight` update, which +***will need a `learning-rate`***, is: + +$$ +\vec{w}_t = \vec{w}_{t-1} - \eta \frac{ + \hat{\mu}_t +}{ + \sqrt{ \hat{ \sigma}_t + \epsilon} +} +$$ + +Usual values are these: + +- $\eta = 0.001$ +- $\beta_1 = 0.9$ +- $\beta_2 = 0.999$ +- $\epsilon = 10^{-8}$ + +## Pros and Cons + +### Pros + +- ***It adapts `learning-rates` locally*** +- ***handles sparse gradients*** +- ***More robust over non-optimal `hyperparameters`*** + +### Cons[^anelli-adam-2] + +- ***It fails to converge in some cases*** +- ***It generalizes worse than `SGD`***, especially for + images +- It needs ***3 times the `memory`*** of `SGD` because + of the 3 `buffers` needed +- We need to + ***tune 2 momentum parameters instead of 1*** + +## Notes[^anelli-adam-3] + +- Whenever the ***gradient*** is constant, the `local + gain` is 1, as + $\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because + it's ***not centered*** +- On the other hand, when the + ***grandient changes frequently*** the + $\hat{\sigma}_t >> \hat{\mu}_t$, thus the + ***steps will be smaller*** +- In a nustshell, this means that ***whenever we + are sure of following the right path, we take + big steps*** and ***when we are unsure, we + proceed more slowly***. +- Potentially speaking, because of the previous + statement, `Adam` may stop on a `local minimum` + that is the ***closest***, but could also be + ***shallower*** + + + +[^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980) + +[^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55 + +[^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54 + +[^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58