From d5e34ab54b88b343c5d4c23c3a7a36ed86877293 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Sat, 19 Apr 2025 17:30:28 +0200 Subject: [PATCH] Added ADAGRAD --- .../5-Optimization/Fancy-Methods/ADAGRAD.md | 98 +++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md diff --git a/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md b/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md new file mode 100644 index 0000000..452728f --- /dev/null +++ b/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md @@ -0,0 +1,98 @@ +# AdaGrad[^adagrad-torch] + +`AdaGrad` is an ***optimization method*** aimed +to: + +***"find needles in the haystack in the form of +very predictive yet rarely observed features"*** +[^adagrad-official-paper] + +`AdaGrad`, opposed to a standard `SGD` which is the +***samefor each gradient geometry***, tries to +***incorporate geometry from earlier iterations***. + +## The Algorithm + +To start, let's define the standard +`Regret`[^regret-definition] for +`convex optimization`: + +$$ +R(N) = \sum_{t = 1}^T\left[ + f_t(\bar{w}_t) - f_t(w^*) +\right] \\ +w^* \triangleq \text{optimal weights} +$$ + +In a standard case, we move opposite to the direction +of the ***gradient***[^anelli-adagrad-1]: + +$$ +\bar{w}_{t+1, i} = + \bar{w}_{t, i} - \eta g_{t, i} +$$ + +Instead `AdaGrad` takes another +approach[^anelli-adagrad-2][^adagrad-official-paper]: + +$$ +\begin{aligned} + \bar{w}_{t+1, i} &= + \bar{w}_{t, i} - \frac{ + \eta + }{ + \sqrt{G_{t, i,i} + \epsilon} + } \cdot g_{t,i} \\ + + G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T +\end{aligned} +$$ + +Here $G_t$ is the ***sum of outer product*** of the +***gradient*** until time $t$, though ***usually it is +not used*** $G_t$, which is ***impractical because +of the high number of dimensions***, so we use +$diag(G_t)$ which can be +***computed in linear time***[^adagrad-official-paper] + +The $\epsilon$ term here is used to +***avoid dividing by 0***[^anelli-adagrad-2] and has a +small value, usually in the order of $10^{-8}$ + +## Motivating its effectiveness[^anelli-adagrad-3] + +- When we have ***many dimensions, many features are + irrelevant*** +- ***Rarer Features are more relevant*** +- It adapts $\eta$ to the right metric space + by projecting gradient stochastic updates with + [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from + a probability distribution. + +## Pros and Cons + +### Pros + +- It eliminates the need of manually tuning the + `learning rates`, which is usually set to + $0.01$ + +### Cons + +- The squared ***gradients*** are accumulated during + iterations, making the `learning-rate` become + ***smaller and smaller*** + + + +[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf) + +[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html) + +[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.) + +[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 + +[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 + +[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44