Adam

Mean and Variance

Here we estimate the running-mean and running-not-centered-std-dev:


\begin{aligned}
    \mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
    \sigma_t &= \beta_2 \sigma_{t -1} +
        (1 - \beta_2)g_t^2
\end{aligned}

Then we resize them, otherwise they will not be scaled well¹:


\begin{aligned}
    \hat{\mu}_t &= \frac{
        \mu_t
    }{
        1 - \beta_1^t
    } \\
    \hat{\sigma}_t &= \frac{
        \sigma_t
    }{
        1 - \beta_2^t
    }
\end{aligned}

Weight Update

The weight update, which will need a learning-rate, is:


\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
    \hat{\mu}_t
}{
    \sqrt{ \hat{ \sigma}_t + \epsilon}
}

Usual values are these:

\eta = 0.001
\beta_1 = 0.9
\beta_2 = 0.999
\epsilon = 10^{-8}

Pros and Cons

Pros

It adapts learning-rates locally
handles sparse gradients
More robust over non-optimal hyperparameters

Cons²

It fails to converge in some cases
It generalizes worse than SGD, especially for images
It needs 3 times the memory of SGD because of the 3 buffers needed
We need to tune 2 momentum parameters instead of 1

Notes³⁴

Whenever the gradient is constant, the local gain is 1, as \hat{\sigma}_t \rightarrow \hat{\mu}_t, because it's not centered
On the other hand, when the grandient changes frequently the \hat{\sigma}_t >> \hat{\mu}_t, thus the steps will be smaller
In a nustshell, this means that whenever we are sure of following the right path, we take big steps and when we are unsure, we proceed more slowly.
Potentially speaking, because of the previous statement, Adam may stop on a local minimum that is the closest, but could also be shallower

Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58 ↩︎
AdamW Notes ↩︎

2.3 KiB Raw Blame History