2025-04-20 11:59:07 +02:00

2.3 KiB

Adam

Mean and Variance

Here we estimate the running-mean and running-not-centered-std-dev:


\begin{aligned}
    \mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
    \sigma_t &= \beta_2 \sigma_{t -1} +
        (1 - \beta_2)g_t^2
\end{aligned}

Then we resize them, otherwise they will not be scaled well1:


\begin{aligned}
    \hat{\mu}_t &= \frac{
        \mu_t
    }{
        1 - \beta_1^t
    } \\
    \hat{\sigma}_t &= \frac{
        \sigma_t
    }{
        1 - \beta_2^t
    }
\end{aligned}

Weight Update

The weight update, which will need a learning-rate, is:


\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
    \hat{\mu}_t
}{
    \sqrt{ \hat{ \sigma}_t + \epsilon}
}

Usual values are these:

  • \eta = 0.001
  • \beta_1 = 0.9
  • \beta_2 = 0.999
  • \epsilon = 10^{-8}

Pros and Cons

Pros

  • It adapts learning-rates locally
  • handles sparse gradients
  • More robust over non-optimal hyperparameters

Cons2

  • It fails to converge in some cases
  • It generalizes worse than SGD, especially for images
  • It needs 3 times the memory of SGD because of the 3 buffers needed
  • We need to tune 2 momentum parameters instead of 1

Notes34

  • Whenever the gradient is constant, the local gain is 1, as \hat{\sigma}_t \rightarrow \hat{\mu}_t, because it's not centered
  • On the other hand, when the grandient changes frequently the \hat{\sigma}_t >> \hat{\mu}_t, thus the steps will be smaller
  • In a nustshell, this means that whenever we are sure of following the right path, we take big steps and when we are unsure, we proceed more slowly.
  • Potentially speaking, because of the previous statement, Adam may stop on a local minimum that is the closest, but could also be shallower

  1. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55 ↩︎

  2. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54 ↩︎

  3. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58 ↩︎

  4. AdamW Notes ↩︎