3.2 KiB
AdaGrad1
AdaGrad is an optimization method aimed
to:
"find needles in the haystack in the form of very predictive yet rarely observed features" 2
AdaGrad, opposed to a standard SGD which is the
samefor each gradient geometry, tries to
incorporate geometry from earlier iterations.
The Algorithm
To start, let's define the standard
Regret3 for
convex optimization:
R(N) = \sum_{t = 1}^T\left[
f_t(\bar{w}_t) - f_t(w^*)
\right] \\
w^* \triangleq \text{optimal weights}
In a standard case, we move opposite to the direction of the gradient4:
\bar{w}_{t+1, i} =
\bar{w}_{t, i} - \eta g_{t, i}
Instead AdaGrad takes another
approach52:
\begin{aligned}
\bar{w}_{t+1, i} &=
\bar{w}_{t, i} - \frac{
\eta
}{
\sqrt{G_{t, i,i} + \epsilon}
} \cdot g_{t,i} \\
G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
\end{aligned}
Here G_t is the sum of outer product of the
gradient until time t, though usually it is
not used G_t, which is impractical because
of the high number of dimensions, so we use
diag(G_t) which can be
computed in linear time2
The \epsilon term here is used to
avoid dividing by 05 and has a
small value, usually in the order of 10^{-8}
Motivating its effectiveness6
- When we have many dimensions, many features are irrelevant
- Rarer Features are more relevant
- It adapts
\etato the right metric space by projecting gradient stochastic updates with Mahalanobis norm, a distance of a point from a probability distribution.
Pros and Cons
Pros
- It eliminates the need of manually tuning the
learning rates, which is usually set to0.01
Cons
- The squared gradients are accumulated during
iterations, making the
learning-ratebecome smaller and smaller
-
Adagrad | Official PyTorch Documentation | 19th April 2025 ↩︎
-
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44 ↩︎