AdaGrad¹

AdaGrad is an optimization method aimed to:

"find needles in the haystack in the form of very predictive yet rarely observed features" ²

AdaGrad, opposed to a standard SGD which is the samefor each gradient geometry, tries to incorporate geometry from earlier iterations.

The Algorithm

To start, let's define the standard Regret³ for convex optimization:


R(N) = \sum_{t = 1}^T\left[
    f_t(\bar{w}_t) - f_t(w^*)
\right] \\
w^* \triangleq \text{optimal weights}

In a standard case, we move opposite to the direction of the gradient⁴:


\bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \eta g_{t, i}

Instead AdaGrad takes another approach⁵²:


\begin{aligned}
    \bar{w}_{t+1, i} &=
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{G_{t, i,i} + \epsilon}
    } \cdot g_{t,i} \\

    G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
\end{aligned}

Here G_t is the sum of outer product of the gradient until time t, though usually it is not used G_t, which is impractical because of the high number of dimensions, so we use diag(G_t) which can be computed in linear time²

The \epsilon term here is used to avoid dividing by 0⁵ and has a small value, usually in the order of 10^{-8}

Motivating its effectiveness⁶

When we have many dimensions, many features are irrelevant
Rarer Features are more relevant
It adapts \eta to the right metric space by projecting gradient stochastic updates with Mahalanobis norm, a distance of a point from a probability distribution.

Pros and Cons

Pros

It eliminates the need of manually tuning the learning rates, which is usually set to 0.01

Cons

The squared gradients are accumulated during iterations, making the learning-rate become smaller and smaller

Adagrad | Official PyTorch Documentation | 19th April 2025 ↩︎
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization ↩︎
Definition of Regret | 19th April 2025 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44 ↩︎

3.2 KiB Raw Permalink Blame History

AdaGrad1