2.3 KiB
2.3 KiB
Adam
Mean and Variance
Here we estimate the running-mean and
running-not-centered-std-dev:
\begin{aligned}
\mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
\sigma_t &= \beta_2 \sigma_{t -1} +
(1 - \beta_2)g_t^2
\end{aligned}
Then we resize them, otherwise they will not be scaled well1:
\begin{aligned}
\hat{\mu}_t &= \frac{
\mu_t
}{
1 - \beta_1^t
} \\
\hat{\sigma}_t &= \frac{
\sigma_t
}{
1 - \beta_2^t
}
\end{aligned}
Weight Update
The weight update, which
will need a learning-rate, is:
\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
\hat{\mu}_t
}{
\sqrt{ \hat{ \sigma}_t + \epsilon}
}
Usual values are these:
\eta = 0.001\beta_1 = 0.9\beta_2 = 0.999\epsilon = 10^{-8}
Pros and Cons
Pros
- It adapts
learning-rateslocally - handles sparse gradients
- More robust over non-optimal
hyperparameters
Cons2
- It fails to converge in some cases
- It generalizes worse than
SGD, especially for images - It needs 3 times the
memoryofSGDbecause of the 3buffersneeded - We need to tune 2 momentum parameters instead of 1
Notes34
- Whenever the gradient is constant, the
local gainis 1, as\hat{\sigma}_t \rightarrow \hat{\mu}_t, because it's not centered - On the other hand, when the
grandient changes frequently the
\hat{\sigma}_t >> \hat{\mu}_t, thus the steps will be smaller - In a nustshell, this means that whenever we are sure of following the right path, we take big steps and when we are unsure, we proceed more slowly.
- Potentially speaking, because of the previous
statement,
Adammay stop on alocal minimumthat is the closest, but could also be shallower