# Adam ## Mean and Variance Here we estimate the `running-mean` and `running-not-centered-std-dev`: $$ \begin{aligned} \mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\ \sigma_t &= \beta_2 \sigma_{t -1} + (1 - \beta_2)g_t^2 \end{aligned} $$ Then we resize them, otherwise they will not be scaled well[^anelli-adam-1]: $$ \begin{aligned} \hat{\mu}_t &= \frac{ \mu_t }{ 1 - \beta_1^t } \\ \hat{\sigma}_t &= \frac{ \sigma_t }{ 1 - \beta_2^t } \end{aligned} $$ ## Weight Update The `weight` update, which ***will need a `learning-rate`***, is: $$ \vec{w}_t = \vec{w}_{t-1} - \eta \frac{ \hat{\mu}_t }{ \sqrt{ \hat{ \sigma}_t + \epsilon} } $$ Usual values are these: - $\eta = 0.001$ - $\beta_1 = 0.9$ - $\beta_2 = 0.999$ - $\epsilon = 10^{-8}$ ## Pros and Cons ### Pros - ***It adapts `learning-rates` locally*** - ***handles sparse gradients*** - ***More robust over non-optimal `hyperparameters`*** ### Cons[^anelli-adam-2] - ***It fails to converge in some cases*** - ***It generalizes worse than `SGD`***, especially for images - It needs ***3 times the `memory`*** of `SGD` because of the 3 `buffers` needed - We need to ***tune 2 momentum parameters instead of 1*** ## Notes[^anelli-adam-3][^adamw-notes] - Whenever the ***gradient*** is constant, the `local gain` is 1, as $\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because it's ***not centered*** - On the other hand, when the ***grandient changes frequently*** the $\hat{\sigma}_t >> \hat{\mu}_t$, thus the ***steps will be smaller*** - In a nustshell, this means that ***whenever we are sure of following the right path, we take big steps*** and ***when we are unsure, we proceed more slowly***. - Potentially speaking, because of the previous statement, `Adam` may stop on a `local minimum` that is the ***closest***, but could also be ***shallower*** [^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980) [^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55 [^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54 [^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58 [^adamw-notes]: [AdamW Notes](./ADAM-W.md)