102 lines
2.3 KiB
Markdown
102 lines
2.3 KiB
Markdown
# Adam
|
|
|
|
## Mean and Variance
|
|
|
|
Here we estimate the `running-mean` and
|
|
`running-not-centered-std-dev`:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
|
|
\sigma_t &= \beta_2 \sigma_{t -1} +
|
|
(1 - \beta_2)g_t^2
|
|
\end{aligned}
|
|
$$
|
|
|
|
Then we resize them, otherwise they will not be
|
|
scaled well[^anelli-adam-1]:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\hat{\mu}_t &= \frac{
|
|
\mu_t
|
|
}{
|
|
1 - \beta_1^t
|
|
} \\
|
|
\hat{\sigma}_t &= \frac{
|
|
\sigma_t
|
|
}{
|
|
1 - \beta_2^t
|
|
}
|
|
\end{aligned}
|
|
$$
|
|
|
|
## Weight Update
|
|
|
|
The `weight` update, which
|
|
***will need a `learning-rate`***, is:
|
|
|
|
$$
|
|
\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
|
|
\hat{\mu}_t
|
|
}{
|
|
\sqrt{ \hat{ \sigma}_t + \epsilon}
|
|
}
|
|
$$
|
|
|
|
Usual values are these:
|
|
|
|
- $\eta = 0.001$
|
|
- $\beta_1 = 0.9$
|
|
- $\beta_2 = 0.999$
|
|
- $\epsilon = 10^{-8}$
|
|
|
|
## Pros and Cons
|
|
|
|
### Pros
|
|
|
|
- ***It adapts `learning-rates` locally***
|
|
- ***handles sparse gradients***
|
|
- ***More robust over non-optimal `hyperparameters`***
|
|
|
|
### Cons[^anelli-adam-2]
|
|
|
|
- ***It fails to converge in some cases***
|
|
- ***It generalizes worse than `SGD`***, especially for
|
|
images
|
|
- It needs ***3 times the `memory`*** of `SGD` because
|
|
of the 3 `buffers` needed
|
|
- We need to
|
|
***tune 2 momentum parameters instead of 1***
|
|
|
|
## Notes[^anelli-adam-3][^adamw-notes]
|
|
|
|
- Whenever the ***gradient*** is constant, the `local
|
|
gain` is 1, as
|
|
$\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because
|
|
it's ***not centered***
|
|
- On the other hand, when the
|
|
***grandient changes frequently*** the
|
|
$\hat{\sigma}_t >> \hat{\mu}_t$, thus the
|
|
***steps will be smaller***
|
|
- In a nustshell, this means that ***whenever we
|
|
are sure of following the right path, we take
|
|
big steps*** and ***when we are unsure, we
|
|
proceed more slowly***.
|
|
- Potentially speaking, because of the previous
|
|
statement, `Adam` may stop on a `local minimum`
|
|
that is the ***closest***, but could also be
|
|
***shallower***
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980)
|
|
|
|
[^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55
|
|
|
|
[^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54
|
|
|
|
[^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58
|
|
|
|
[^adamw-notes]: [AdamW Notes](./ADAM-W.md)
|