2025-04-19 17:30:28 +02:00

3.2 KiB

AdaGrad1

AdaGrad is an optimization method aimed to:

"find needles in the haystack in the form of very predictive yet rarely observed features" 2

AdaGrad, opposed to a standard SGD which is the samefor each gradient geometry, tries to incorporate geometry from earlier iterations.

The Algorithm

To start, let's define the standard Regret3 for convex optimization:


R(N) = \sum_{t = 1}^T\left[
    f_t(\bar{w}_t) - f_t(w^*)
\right] \\
w^* \triangleq \text{optimal weights}

In a standard case, we move opposite to the direction of the gradient4:


\bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \eta g_{t, i}

Instead AdaGrad takes another approach52:


\begin{aligned}
    \bar{w}_{t+1, i} &=
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{G_{t, i,i} + \epsilon}
    } \cdot g_{t,i} \\

    G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
\end{aligned}

Here G_t is the sum of outer product of the gradient until time t, though usually it is not used G_t, which is impractical because of the high number of dimensions, so we use diag(G_t) which can be computed in linear time2

The \epsilon term here is used to avoid dividing by 05 and has a small value, usually in the order of 10^{-8}

Motivating its effectiveness6

  • When we have many dimensions, many features are irrelevant
  • Rarer Features are more relevant
  • It adapts \eta to the right metric space by projecting gradient stochastic updates with Mahalanobis norm, a distance of a point from a probability distribution.

Pros and Cons

Pros

  • It eliminates the need of manually tuning the learning rates, which is usually set to 0.01

Cons

  • The squared gradients are accumulated during iterations, making the learning-rate become smaller and smaller

  1. Adagrad | Official PyTorch Documentation | 19th April 2025 ↩︎

  2. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization ↩︎

  3. Definition of Regret | 19th April 2025 ↩︎

  4. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 ↩︎

  5. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 ↩︎

  6. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44 ↩︎