2025-04-20 11:59:07 +02:00

102 lines
2.3 KiB
Markdown

# Adam
## Mean and Variance
Here we estimate the `running-mean` and
`running-not-centered-std-dev`:
$$
\begin{aligned}
\mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
\sigma_t &= \beta_2 \sigma_{t -1} +
(1 - \beta_2)g_t^2
\end{aligned}
$$
Then we resize them, otherwise they will not be
scaled well[^anelli-adam-1]:
$$
\begin{aligned}
\hat{\mu}_t &= \frac{
\mu_t
}{
1 - \beta_1^t
} \\
\hat{\sigma}_t &= \frac{
\sigma_t
}{
1 - \beta_2^t
}
\end{aligned}
$$
## Weight Update
The `weight` update, which
***will need a `learning-rate`***, is:
$$
\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
\hat{\mu}_t
}{
\sqrt{ \hat{ \sigma}_t + \epsilon}
}
$$
Usual values are these:
- $\eta = 0.001$
- $\beta_1 = 0.9$
- $\beta_2 = 0.999$
- $\epsilon = 10^{-8}$
## Pros and Cons
### Pros
- ***It adapts `learning-rates` locally***
- ***handles sparse gradients***
- ***More robust over non-optimal `hyperparameters`***
### Cons[^anelli-adam-2]
- ***It fails to converge in some cases***
- ***It generalizes worse than `SGD`***, especially for
images
- It needs ***3 times the `memory`*** of `SGD` because
of the 3 `buffers` needed
- We need to
***tune 2 momentum parameters instead of 1***
## Notes[^anelli-adam-3][^adamw-notes]
- Whenever the ***gradient*** is constant, the `local
gain` is 1, as
$\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because
it's ***not centered***
- On the other hand, when the
***grandient changes frequently*** the
$\hat{\sigma}_t >> \hat{\mu}_t$, thus the
***steps will be smaller***
- In a nustshell, this means that ***whenever we
are sure of following the right path, we take
big steps*** and ***when we are unsure, we
proceed more slowly***.
- Potentially speaking, because of the previous
statement, `Adam` may stop on a `local minimum`
that is the ***closest***, but could also be
***shallower***
<!-- Footnotes -->
[^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980)
[^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55
[^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54
[^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58
[^adamw-notes]: [AdamW Notes](./ADAM-W.md)