Added Adam

2025-04-19 20:18:01 +02:00
parent d9fd801bba
commit 6cc1d8a4ed
1 changed files with 99 additions and 0 deletions
--- a/Chapters/5-Optimization/Fancy-Methods/ADAM.md
+++ b/Chapters/5-Optimization/Fancy-Methods/ADAM.md
@@ -0,0 +1,99 @@
+# Adam
+
+## Mean and Variance
+
+Here we estimate the `running-mean` and
+`running-not-centered-std-dev`:
+
+$$
+\begin{aligned}
+    \mu_t &= \beta_1 \mu_{t-1} + (1 - \beta_1)g_t \\
+    \sigma_t &= \beta_2 \sigma_{t -1} +
+        (1 - \beta_2)g_t^2
+\end{aligned}
+$$
+
+Then we resize them, otherwise they will not be
+scaled well[^anelli-adam-1]:
+
+$$
+\begin{aligned}
+    \hat{\mu}_t &= \frac{
+        \mu_t
+    }{
+        1 - \beta_1^t
+    } \\
+    \hat{\sigma}_t &= \frac{
+        \sigma_t
+    }{
+        1 - \beta_2^t
+    }
+\end{aligned}
+$$
+
+## Weight Update
+
+The `weight` update, which
+***will need a `learning-rate`***, is:
+
+$$
+\vec{w}_t = \vec{w}_{t-1} - \eta \frac{
+    \hat{\mu}_t
+}{
+    \sqrt{ \hat{ \sigma}_t + \epsilon}
+}
+$$
+
+Usual values are these:
+
+- $\eta = 0.001$
+- $\beta_1 = 0.9$
+- $\beta_2 = 0.999$
+- $\epsilon = 10^{-8}$
+
+## Pros and Cons
+
+### Pros
+
+- ***It adapts `learning-rates` locally***
+- ***handles sparse gradients***
+- ***More robust over non-optimal `hyperparameters`***
+
+### Cons[^anelli-adam-2]
+
+- ***It fails to converge in some cases***
+- ***It generalizes worse than `SGD`***, especially for
+    images
+- It needs ***3 times the `memory`*** of `SGD` because
+    of the 3 `buffers` needed
+- We need to
+   ***tune 2 momentum parameters instead of 1***
+
+## Notes[^anelli-adam-3]
+
+- Whenever the ***gradient*** is constant, the `local
+    gain` is 1, as
+    $\hat{\sigma}_t \rightarrow \hat{\mu}_t$, because
+    it's ***not centered***
+- On the other hand, when the
+    ***grandient changes frequently*** the
+    $\hat{\sigma}_t >> \hat{\mu}_t$, thus the
+    ***steps will be smaller***
+- In a nustshell, this means that ***whenever we
+    are sure of following the right path, we take
+    big steps*** and ***when we are unsure, we
+    proceed more slowly***.
+- Potentially speaking, because of the previous
+    statement, `Adam` may stop on a `local minimum`
+    that is the ***closest***, but could also be
+    ***shallower***
+
+<!-- Footnotes -->
+
+[^adam-official-paper]: [Official Paper | arXiv:1412.6980v9](https://arxiv.org/pdf/1412.6980)
+
+[^anelli-adam-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 55
+
+[^anelli-adam-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 54
+
+[^anelli-adam-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 57-58