Deep-Learning/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md

# AdaGrad[^adagrad-torch]

`AdaGrad` is an ***optimization method*** aimed
to:

<u>***"find needles in the haystack in the form of
very predictive yet rarely observed features"***
[^adagrad-official-paper]</u>

`AdaGrad`, opposed to a standard `SGD` which is the
***samefor each gradient geometry***, tries to
***incorporate geometry from earlier iterations***.

## The Algorithm

To start, let's define the standard
`Regret`[^regret-definition] for
`convex optimization`:

$$
R(N) = \sum_{t = 1}^T\left[
    f_t(\bar{w}_t) - f_t(w^*)
\right] \\
w^* \triangleq \text{optimal weights}
$$

In a standard case, we move opposite to the direction
of the ***gradient***[^anelli-adagrad-1]:

$$
\bar{w}_{t+1, i} =
    \bar{w}_{t, i} - \eta g_{t, i}
$$

Instead `AdaGrad` takes another
approach[^anelli-adagrad-2][^adagrad-official-paper]:

$$
\begin{aligned}
    \bar{w}_{t+1, i} &=
    \bar{w}_{t, i} - \frac{
        \eta
    }{
        \sqrt{G_{t, i,i} + \epsilon}
    } \cdot g_{t,i} \\

    G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
\end{aligned}
$$

Here $G_t$ is the ***sum of outer product*** of the
***gradient*** until time $t$, though ***usually it is
not used*** $G_t$, which is ***impractical because
of the high number of dimensions***, so we use
$diag(G_t)$ which can be
***computed in linear time***[^adagrad-official-paper]

The $\epsilon$ term here is used to
***avoid dividing by 0***[^anelli-adagrad-2] and has a
small value, usually in the order of $10^{-8}$

## Motivating its effectiveness[^anelli-adagrad-3]

- When we have ***many dimensions, many features are
    irrelevant***
- ***Rarer Features are more relevant***
- It adapts $\eta$ to the right metric space
    by projecting gradient stochastic updates with
    [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
    a probability distribution.

## Pros and Cons

### Pros

- It eliminates the need of manually tuning the
    `learning rates`, which is usually set to
    $0.01$

### Cons

- The squared ***gradients*** are accumulated during
    iterations, making the `learning-rate` become
    ***smaller and smaller***

<!-- Footnotes -->

[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)

[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)

[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)

[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42

[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43

[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
Added ADAGRAD 2025-04-19 17:30:28 +02:00			`# AdaGrad[^adagrad-torch]`

			`AdaGrad` is an *optimization method* aimed
			`to:`

			`<u>***"find needles in the haystack in the form of`
			`very predictive yet rarely observed features"***`
			`[^adagrad-official-paper]</u>`

			`AdaGrad`, opposed to a standard `SGD` which is the
			`*samefor each gradient geometry*, tries to`
			`*incorporate geometry from earlier iterations*.`

			`## The Algorithm`

			`To start, let's define the standard`
			`Regret`[^regret-definition] for
			`convex optimization`:

			`$$`
			`R(N) = \sum_{t = 1}^T\left[`
			`f_t(\bar{w}_t) - f_t(w^*)`
			`\right] \\`
			`w^* \triangleq \text{optimal weights}`
			`$$`

			`In a standard case, we move opposite to the direction`
			`of the *gradient*[^anelli-adagrad-1]:`

			`$$`
			`\bar{w}_{t+1, i} =`
			`\bar{w}_{t, i} - \eta g_{t, i}`
			`$$`

			Instead `AdaGrad` takes another
			`approach[^anelli-adagrad-2][^adagrad-official-paper]:`

			`$$`
			`\begin{aligned}`
			`\bar{w}_{t+1, i} &=`
			`\bar{w}_{t, i} - \frac{`
			`\eta`
			`}{`
			`\sqrt{G_{t, i,i} + \epsilon}`
			`} \cdot g_{t,i} \\`

			`G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T`
			`\end{aligned}`
			`$$`

			`Here $G_t$ is the *sum of outer product* of the`
			`*gradient* until time $t$, though ***usually it is`
			`not used* $G_t$, which is *impractical because`
			`of the high number of dimensions***, so we use`
			`$diag(G_t)$ which can be`
			`*computed in linear time*[^adagrad-official-paper]`

			`The $\epsilon$ term here is used to`
			`*avoid dividing by 0*[^anelli-adagrad-2] and has a`
			`small value, usually in the order of $10^{-8}$`

			`## Motivating its effectiveness[^anelli-adagrad-3]`

			`- When we have ***many dimensions, many features are`
			`irrelevant***`
			`- *Rarer Features are more relevant*`
			`- It adapts $\eta$ to the right metric space`
			`by projecting gradient stochastic updates with`
			`[Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from`
			`a probability distribution.`

			`## Pros and Cons`

			`### Pros`

			`- It eliminates the need of manually tuning the`
			`learning rates`, which is usually set to
			$0.01$

			`### Cons`

			`- The squared *gradients* are accumulated during`
			iterations, making the `learning-rate` become
			`*smaller and smaller*`

			`<!-- Footnotes -->`

			`[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)`

			`[^adagrad-torch]: [Adagrad \| Official PyTorch Documentation \| 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)`

			`[^regret-definition]: [Definition of Regret \| 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)`

			`[^anelli-adagrad-1]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 5 pg. 42`

			`[^anelli-adagrad-2]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 5 pg. 43`

			`[^anelli-adagrad-3]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 5 pg. 44`