Added ADAGRAD

2025-04-19 17:30:28 +02:00
parent fbd1c0cccd
commit d5e34ab54b
1 changed files with 98 additions and 0 deletions
--- a/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md
+++ b/Chapters/5-Optimization/Fancy-Methods/ADAGRAD.md
@@ -0,0 +1,98 @@
+# AdaGrad[^adagrad-torch]
+
+`AdaGrad` is an ***optimization method*** aimed
+to:
+
+<u>***"find needles in the haystack in the form of
+very predictive yet rarely observed features"***
+[^adagrad-official-paper]</u>
+
+`AdaGrad`, opposed to a standard `SGD` which is the
+***samefor each gradient geometry***, tries to
+***incorporate geometry from earlier iterations***.
+
+## The Algorithm
+
+To start, let's define the standard
+`Regret`[^regret-definition] for
+`convex optimization`:
+
+$$
+R(N) = \sum_{t = 1}^T\left[
+    f_t(\bar{w}_t) - f_t(w^*)
+\right] \\
+w^* \triangleq \text{optimal weights}
+$$
+
+In a standard case, we move opposite to the direction
+of the ***gradient***[^anelli-adagrad-1]:
+
+$$
+\bar{w}_{t+1, i} =
+    \bar{w}_{t, i} - \eta g_{t, i}
+$$
+
+Instead `AdaGrad` takes another
+approach[^anelli-adagrad-2][^adagrad-official-paper]:
+
+$$
+\begin{aligned}
+    \bar{w}_{t+1, i} &=
+    \bar{w}_{t, i} - \frac{
+        \eta
+    }{
+        \sqrt{G_{t, i,i} + \epsilon}
+    } \cdot g_{t,i} \\
+
+    G_{t} &= \sum_{\tau = 1}^{t} g_{t} g_{t}^T
+\end{aligned}
+$$
+
+Here $G_t$ is the ***sum of outer product*** of the
+***gradient*** until time $t$, though ***usually it is
+not used*** $G_t$, which is ***impractical because
+of the high number of dimensions***, so we use
+$diag(G_t)$ which can be
+***computed in linear time***[^adagrad-official-paper]
+
+The $\epsilon$ term here is used to
+***avoid dividing by 0***[^anelli-adagrad-2] and has a
+small value, usually in the order of $10^{-8}$
+
+## Motivating its effectiveness[^anelli-adagrad-3]
+
+- When we have ***many dimensions, many features are
+    irrelevant***
+- ***Rarer Features are more relevant***
+- It adapts $\eta$ to the right metric space
+    by projecting gradient stochastic updates with
+    [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
+    a probability distribution.
+
+## Pros and Cons
+
+### Pros
+
+- It eliminates the need of manually tuning the
+    `learning rates`, which is usually set to
+    $0.01$
+
+### Cons
+
+- The squared ***gradients*** are accumulated during
+    iterations, making the `learning-rate` become
+    ***smaller and smaller***
+
+<!-- Footnotes -->
+
+[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
+
+[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
+
+[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
+
+[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
+
+[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
+
+[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44