99 lines
3.2 KiB
Markdown
99 lines
3.2 KiB
Markdown
|
|
# AdaGrad[^adagrad-torch]
|
||
|
|
|
||
|
|
`AdaGrad` is an ***optimization method*** aimed
|
||
|
|
to:
|
||
|
|
|
||
|
|
<u>***"find needles in the haystack in the form of
|
||
|
|
very predictive yet rarely observed features"***
|
||
|
|
[^adagrad-official-paper]</u>
|
||
|
|
|
||
|
|
`AdaGrad`, opposed to a standard `SGD` which is the
|
||
|
|
***samefor each gradient geometry***, tries to
|
||
|
|
***incorporate geometry from earlier iterations***.
|
||
|
|
|
||
|
|
## The Algorithm
|
||
|
|
|
||
|
|
To start, let's define the standard
|
||
|
|
`Regret`[^regret-definition] for
|
||
|
|
`convex optimization`:
|
||
|
|
|
||
|
|
$$
|
||
|
|
R(N) = \sum_{t = 1}^T\left[
|
||
|
|
f_t(\bar{w}_t) - f_t(w^*)
|
||
|
|
\right] \\
|
||
|
|
w^* \triangleq \text{optimal weights}
|
||
|
|
$$
|
||
|
|
|
||
|
|
In a standard case, we move opposite to the direction
|
||
|
|
of the ***gradient***[^anelli-adagrad-1]:
|
||
|
|
|
||
|
|
$$
|
||
|
|
\bar{w}_{t+1, i} =
|
||
|
|
\bar{w}_{t, i} - \eta g_{t, i}
|
||
|
|
$$
|
||
|
|
|
||
|
|
Instead `AdaGrad` takes another
|
||
|
|
approach[^anelli-adagrad-2][^adagrad-official-paper]:
|
||
|
|
|
||
|
|
$$
|
||
|
|
\begin{aligned}
|
||
|
|
\bar{w}_{t+1, i} &=
|
||
|
|
\bar{w}_{t, i} - \frac{
|
||
|
|
\eta
|
||
|
|
}{
|
||
|
|
\sqrt{G_{t, i,i} + \epsilon}
|
||
|
|
} \cdot g_{t,i} \\
|
||
|
|
|
||
|
|
G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
|
||
|
|
\end{aligned}
|
||
|
|
$$
|
||
|
|
|
||
|
|
Here $G_t$ is the ***sum of outer product*** of the
|
||
|
|
***gradient*** until time $t$, though ***usually it is
|
||
|
|
not used*** $G_t$, which is ***impractical because
|
||
|
|
of the high number of dimensions***, so we use
|
||
|
|
$diag(G_t)$ which can be
|
||
|
|
***computed in linear time***[^adagrad-official-paper]
|
||
|
|
|
||
|
|
The $\epsilon$ term here is used to
|
||
|
|
***avoid dividing by 0***[^anelli-adagrad-2] and has a
|
||
|
|
small value, usually in the order of $10^{-8}$
|
||
|
|
|
||
|
|
## Motivating its effectiveness[^anelli-adagrad-3]
|
||
|
|
|
||
|
|
- When we have ***many dimensions, many features are
|
||
|
|
irrelevant***
|
||
|
|
- ***Rarer Features are more relevant***
|
||
|
|
- It adapts $\eta$ to the right metric space
|
||
|
|
by projecting gradient stochastic updates with
|
||
|
|
[Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
|
||
|
|
a probability distribution.
|
||
|
|
|
||
|
|
## Pros and Cons
|
||
|
|
|
||
|
|
### Pros
|
||
|
|
|
||
|
|
- It eliminates the need of manually tuning the
|
||
|
|
`learning rates`, which is usually set to
|
||
|
|
$0.01$
|
||
|
|
|
||
|
|
### Cons
|
||
|
|
|
||
|
|
- The squared ***gradients*** are accumulated during
|
||
|
|
iterations, making the `learning-rate` become
|
||
|
|
***smaller and smaller***
|
||
|
|
|
||
|
|
<!-- Footnotes -->
|
||
|
|
|
||
|
|
[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
|
||
|
|
|
||
|
|
[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
|
||
|
|
|
||
|
|
[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
|
||
|
|
|
||
|
|
[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
|
||
|
|
|
||
|
|
[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
|
||
|
|
|
||
|
|
[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
|