# AdaGrad[^adagrad-torch] `AdaGrad` is an ***optimization method*** aimed to: ***"find needles in the haystack in the form of very predictive yet rarely observed features"*** [^adagrad-official-paper] `AdaGrad`, opposed to a standard `SGD` which is the ***samefor each gradient geometry***, tries to ***incorporate geometry from earlier iterations***. ## The Algorithm To start, let's define the standard `Regret`[^regret-definition] for `convex optimization`: $$ R(N) = \sum_{t = 1}^T\left[ f_t(\bar{w}_t) - f_t(w^*) \right] \\ w^* \triangleq \text{optimal weights} $$ In a standard case, we move opposite to the direction of the ***gradient***[^anelli-adagrad-1]: $$ \bar{w}_{t+1, i} = \bar{w}_{t, i} - \eta g_{t, i} $$ Instead `AdaGrad` takes another approach[^anelli-adagrad-2][^adagrad-official-paper]: $$ \begin{aligned} \bar{w}_{t+1, i} &= \bar{w}_{t, i} - \frac{ \eta }{ \sqrt{G_{t, i,i} + \epsilon} } \cdot g_{t,i} \\ G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T \end{aligned} $$ Here $G_t$ is the ***sum of outer product*** of the ***gradient*** until time $t$, though ***usually it is not used*** $G_t$, which is ***impractical because of the high number of dimensions***, so we use $diag(G_t)$ which can be ***computed in linear time***[^adagrad-official-paper] The $\epsilon$ term here is used to ***avoid dividing by 0***[^anelli-adagrad-2] and has a small value, usually in the order of $10^{-8}$ ## Motivating its effectiveness[^anelli-adagrad-3] - When we have ***many dimensions, many features are irrelevant*** - ***Rarer Features are more relevant*** - It adapts $\eta$ to the right metric space by projecting gradient stochastic updates with [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from a probability distribution. ## Pros and Cons ### Pros - It eliminates the need of manually tuning the `learning rates`, which is usually set to $0.01$ ### Cons - The squared ***gradients*** are accumulated during iterations, making the `learning-rate` become ***smaller and smaller*** [^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf) [^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html) [^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.) [^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 [^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 [^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44