Added Adam
This commit is contained in:
parent
d9fd801bba
commit
6cc1d8a4ed
99
Chapters/5-Optimization/Fancy-Methods/ADAM.md
Normal file
99
Chapters/5-Optimization/Fancy-Methods/ADAM.md
Normal file
@ -0,0 +1,99 @@
|
||||
# Adam
|
||||
|
||||
## Mean and Variance
|
||||
|
||||
Here we estimate the `running-mean` and
|
||||
`running-not-centered-std-dev`:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
|
||||
\sigma_t &= \beta_2 \sigma_{t -1} +
|
||||
(1 - \beta_2)g_t^2
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Then we resize them, otherwise they will not be
|
||||
scaled well[^anelli-adam-1]:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{\mu}_t &= \frac{
|
||||
\mu_t
|
||||
}{
|
||||
1 - \beta_1^t
|
||||
} \\
|
||||
\hat{\sigma}_t &= \frac{
|
||||
\sigma_t
|
||||
}{
|
||||
1 - \beta_2^t
|
||||
}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
## Weight Update
|
||||
|
||||
The `weight` update, which
|
||||
***will need a `learning-rate`***, is:
|
||||
|
||||
$$
|
||||
\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
|
||||
\hat{\mu}_t
|
||||
}{
|
||||
\sqrt{ \hat{ \sigma}_t + \epsilon}
|
||||
}
|
||||
$$
|
||||
|
||||
Usual values are these:
|
||||
|
||||
- $\eta = 0.001$
|
||||
- $\beta_1 = 0.9$
|
||||
- $\beta_2 = 0.999$
|
||||
- $\epsilon = 10^{-8}$
|
||||
|
||||
## Pros and Cons
|
||||
|
||||
### Pros
|
||||
|
||||
- ***It adapts `learning-rates` locally***
|
||||
- ***handles sparse gradients***
|
||||
- ***More robust over non-optimal `hyperparameters`***
|
||||
|
||||
### Cons[^anelli-adam-2]
|
||||
|
||||
- ***It fails to converge in some cases***
|
||||
- ***It generalizes worse than `SGD`***, especially for
|
||||
images
|
||||
- It needs ***3 times the `memory`*** of `SGD` because
|
||||
of the 3 `buffers` needed
|
||||
- We need to
|
||||
***tune 2 momentum parameters instead of 1***
|
||||
|
||||
## Notes[^anelli-adam-3]
|
||||
|
||||
- Whenever the ***gradient*** is constant, the `local
|
||||
gain` is 1, as
|
||||
$\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because
|
||||
it's ***not centered***
|
||||
- On the other hand, when the
|
||||
***grandient changes frequently*** the
|
||||
$\hat{\sigma}_t >> \hat{\mu}_t$, thus the
|
||||
***steps will be smaller***
|
||||
- In a nustshell, this means that ***whenever we
|
||||
are sure of following the right path, we take
|
||||
big steps*** and ***when we are unsure, we
|
||||
proceed more slowly***.
|
||||
- Potentially speaking, because of the previous
|
||||
statement, `Adam` may stop on a `local minimum`
|
||||
that is the ***closest***, but could also be
|
||||
***shallower***
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980)
|
||||
|
||||
[^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55
|
||||
|
||||
[^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54
|
||||
|
||||
[^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58
|
||||
Loading…
x
Reference in New Issue
Block a user