Revised Optimization Notes

2025-11-20 18:47:36 +01:00
parent 2a96deaebf
commit 934c08d4c0
6 changed files with 1226 additions and 429 deletions
--- a/Chapters/15-Appendix-A/INDEX.md
+++ b/Chapters/15-Appendix-A/INDEX.md
@@ -33,7 +33,8 @@ $$

 ## Cross Entropy Loss derivation

-A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
+Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"*
+we get from distribution $p$ knowing
 results from distribution $q$. It is defined as the entropy of $p$ plus the
 [Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$

@@ -62,6 +63,23 @@ Usually $\hat{y}$ comes from using a
 [softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
 logaritm and probability values are at most 1, the closer to 0, the higher the loss

+## Computing PCA[^wiki-pca]
+
+> [!CAUTION]
+> $X$ here is the matrix of dataset with **<ins>features over rows</ins>**
+
+- $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation
+- $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$
+- $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues
+- $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$
+    highest eigenvalue
+- $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation
+
+> [!NOTE]
+> You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2
+> are closely related and apply the same concept but applying different
+> mathematical formulas.
+
 ## Laplace Operator[^khan-1]

 It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
@@ -80,8 +98,32 @@ It can also be used to compute the net flow of particles in that region of space
 > This is not a **discrete laplace operator**, which is instead a **matrix** here,
 > as there are many other formulations.

+## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)
+
+A Hessian Matrix represents the 2nd derivative of a function, thus it gives
+us the curvature of a function.
+
+It is also used to tell us whether the point is a local minimum (it is positive
+defined), local maximum (it is negative defined) or saddle (neither positive or
+negative defined).
+
+It is computed by computing the partial derivatives of the gradient along
+all dimensions and then transpose it.
+
+$$
+\nabla f = \begin{bmatrix}
+    \frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
+\end{bmatrix} \\
+H(f) = \begin{bmatrix}
+     \frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
+      \frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
+\end{bmatrix}
+$$
+
 [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)

 [^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)

 [^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
+
+[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
--- a/Chapters/15-Appendix-A/python-experiments/pca.ipynb
+++ b/Chapters/15-Appendix-A/python-experiments/pca.ipynb
@@ -0,0 +1,199 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8c14ea22",
+   "metadata": {},
+   "source": [
+    "# Computing PCA\n",
+    "\n",
+    "Here I'll be taking data from [Geeks4Geeks](https://www.geeksforgeeks.org/machine-learning/mathematical-approach-to-pca/)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b32eb5c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[1.8        1.87777778]\n",
+      "[[ 0.7         0.52222222]\n",
+      " [-1.3        -1.17777778]\n",
+      " [ 0.4         1.02222222]\n",
+      " [ 1.3         1.12222222]\n",
+      " [ 0.5         0.82222222]\n",
+      " [ 0.2        -0.27777778]\n",
+      " [-0.8        -0.77777778]\n",
+      " [-0.3        -0.27777778]\n",
+      " [-0.7        -0.97777778]]\n",
+      "[[0.6925     0.68875   ]\n",
+      " [0.68875    0.79444444]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "X : np.ndarray = np.array([\n",
+    "    [2.5, 2.4],\n",
+    "    [0.5, 0.7],\n",
+    "    [2.2, 2.9],\n",
+    "    [3.1, 3.0],\n",
+    "    [2.3, 2.7],\n",
+    "    [2.0, 1.6],\n",
+    "    [1.0, 1.1],\n",
+    "    [1.5, 1.6],\n",
+    "    [1.1, 0.9]\n",
+    "])\n",
+    "\n",
+    "# Compute mean values for features\n",
+    "mu_X = np.mean(X, 0)\n",
+    "\n",
+    "print(mu_X)\n",
+    "# \"Normalize\" Features\n",
+    "X = X - mu_X\n",
+    "print(X)\n",
+    "\n",
+    "# Compute covariance matrix applying\n",
+    "#   Bessel's correction (n-1) instead of n\n",
+    "Cov = (X.T @ X) / (X.shape[0] - 1)\n",
+    "\n",
+    "print(Cov)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78e9429f",
+   "metadata": {},
+   "source": [
+    "As you can notice, we did $X^T \\times X$ instead of $X \\times X^T$. This is because our \n",
+    "dataset had datapoints over rows instead of features."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "id": "f93b7a92",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[0.05283865 1.43410579]\n",
+      "[[-0.73273632 -0.68051267]\n",
+      " [ 0.68051267 -0.73273632]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Computing eigenvalues\n",
+    "eigen = np.linalg.eig(Cov)\n",
+    "eigen_values = eigen.eigenvalues\n",
+    "eigen_vectors = eigen.eigenvectors\n",
+    "\n",
+    "print(eigen_values)\n",
+    "print(eigen_vectors)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bfbdd9c3",
+   "metadata": {},
+   "source": [
+    "Now we'll generate the new X matrix by only using the first eigen vector"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "id": "7ce6c540",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(9, 1)\n",
+      "Compressed\n",
+      "[[-0.85901005]\n",
+      " [ 1.74766702]\n",
+      " [-1.02122441]\n",
+      " [-1.70695945]\n",
+      " [-0.94272842]\n",
+      " [ 0.06743533]\n",
+      " [ 1.11431616]\n",
+      " [ 0.40769167]\n",
+      " [ 1.19281215]]\n",
+      "Reconstruction\n",
+      "[[ 0.58456722  0.62942786]\n",
+      " [-1.18930955 -1.28057909]\n",
+      " [ 0.69495615  0.74828821]\n",
+      " [ 1.16160753  1.25075117]\n",
+      " [ 0.64153863  0.69077135]\n",
+      " [-0.0458906  -0.04941232]\n",
+      " [-0.75830626 -0.81649992]\n",
+      " [-0.27743934 -0.29873049]\n",
+      " [-0.81172378 -0.87401678]]\n",
+      "Difference\n",
+      "[[0.11543278 0.10720564]\n",
+      " [0.11069045 0.10280131]\n",
+      " [0.29495615 0.27393401]\n",
+      " [0.13839247 0.12852895]\n",
+      " [0.14153863 0.13145088]\n",
+      " [0.2458906  0.22836546]\n",
+      " [0.04169374 0.03872214]\n",
+      " [0.02256066 0.02095271]\n",
+      " [0.11172378 0.10376099]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Computing X coming from only 1st eigen vector\n",
+    "Z_pca = X @ eigen_vectors[:,1]\n",
+    "Z_pca = Z_pca.reshape([Z_pca.shape[0], 1])\n",
+    "\n",
+    "print(Z_pca.shape)\n",
+    "\n",
+    "\n",
+    "# X reconstructed\n",
+    "eigen_v = (eigen_vectors[:, 1].reshape([eigen_vectors[:, 1].shape[0], 1]))\n",
+    "X_rec = Z_pca @ eigen_v.T\n",
+    "\n",
+    "print(\"Compressed\")\n",
+    "print(Z_pca)\n",
+    "\n",
+    "print(\"Reconstruction\")\n",
+    "print(X_rec)\n",
+    "\n",
+    "print(\"Difference\")\n",
+    "print(abs(X - X_rec))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "deep_learning",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/Chapters/5-Optimization/INDEX-OLD.md
+++ b/Chapters/5-Optimization/INDEX-OLD.md
@@ -0,0 +1,501 @@
+# Optimization
+
+We basically try to see the error and minimize it by moving towards the ***gradient***
+
+## Types of Learning Algorithms
+
+In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
+Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
+
+So, often we train the `model` on a subset of samples.
+
+### Online Learning
+
+This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
+
+On each `point` we get the ***gradient*** and then we update `weights`.
+
+### Mini-Batch
+
+In this approach, we divide our `dataset` in small batches called `mini-batches`.
+These need to be ***balanced*** in order not to have ***imbalances***.
+
+This technique is the ***most used one***
+
+## Tips and Tricks
+
+### Learning Rate
+
+This is the `hyperparameter` we use to tune our
+***learning steps***.
+
+Sometimes we have it too big and this causes
+***overshootings***. So a quick solution may be to turn
+it down.
+
+However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
+
+### Weight initialization
+
+We need to avoid `neurons` to have the same
+***gradient***. This is easily achievable by using
+***small random values***.
+
+However, if we have a ***large `fan-in`***, then it's
+***easy to overshoot***, then it's better to initialize
+those `weights` ***proportionally to***
+$\sqrt{\text{fan-in}}$:
+
+$$
+w = \frac{
+    np.random(N)
+}{
+    \sqrt{N}
+}
+$$
+
+#### Xavier-Glorot Initialization
+
+<!-- TODO: Read Xavier-Glorot paper -->
+
+Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
+`uniform distribution` with a `std-dev`
+
+$$
+\sigma^2 = \text{gain} \cdot \sqrt{
+    \frac{
+        2
+    }{
+        \text{fan-in} + \text{fan-out}
+    }
+}
+$$
+
+and bounded between $a$ and $-a$
+
+$$
+a = \text{gain} \cdot \sqrt{
+    \frac{
+        6
+    }{
+        \text{fan-in} + \text{fan-out}
+    }
+}
+$$
+
+Alternatively, one can use a `normal-distribution`
+$\mathcal{N}(0, \sigma^2)$.
+
+Note that `gain` is in the **original paper** is equal
+to $1$
+
+### Decorrelating input components
+
+Since ***highly correlated features*** don't offer much
+in terms of ***new information***, probably we need
+to go in the ***latent space*** to find the
+`latent-variables` governing those `features`.
+
+#### PCA
+
+> [!CAUTION]
+> This topic won't be explained here as it's something
+> usually learnt for `Machine Learning`, a
+> ***prerequisite*** for approaching `Deep Learning`.
+
+This is a method we can use to discard `features` that
+will ***add little to no information***
+
+## Common problems in MultiLayer Networks
+
+### Hitting a Plateau
+
+This happenes wehn we have a ***big `learning-rate`***
+which makes `weights` go high in ***absolute value***.
+
+Because this happens ***too quickly***, we could
+see a ***quick diminishing error*** and this is usually
+***mistaken for a minimum point***, while instead
+it's a ***plateau***.
+
+## Speeding up Mini-Batch Learning
+
+### Momentum[^momentum]
+
+We use this method ***mainly when we use `SGD`*** as
+a ***learning techniques***
+
+This method is better explained if we imagine
+our error surface as an actual surface and we place a
+ball over it.
+
+***The ball will start rolling towards the steepest
+descent*** (initially), but ***after gaining enough
+velocity*** it will follow the ***previous direction
+, in some measure***.
+
+So, now the ***gradient*** does modify the ***velocity***
+rather than the ***position***, so the momentum will
+***dampen small variations***.
+
+Moreover, once the ***momentum builds up***, we will
+easily ***pass over plateaus*** as the
+***ball will continue to roll over*** until it is
+stopped by a negative ***gradient***
+
+#### Momentum Equations
+
+There are a couple of them, mainly.
+
+One of them uses a term to evaluate the `momentum`, $p$,
+called `SGD momentum` or `momentum term` or
+`momentum parameter`:
+
+$$
+\begin{aligned}
+    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
+    w_{k+1} &= w_{k} - \gamma p_{k+1}
+\end{aligned}
+$$
+
+The other one is ***logically equivalent*** to the
+previous, but it update the `weights` in ***one step***
+and is called `Stochastic Heavy Ball Method`:
+
+$$
+    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
+        + \beta ( w_k - w_{k-1})
+$$
+
+> [!NOTE]
+> This is how to choose $\beta$:
+>
+> $0 < \beta < 1$
+>
+> If $\beta = 0$, then we are doing
+> ***gradient descent***, if $\beta > 1$ then we
+> ***will have numerical instabilities***.
+>
+> The ***larger*** $\beta$ the
+> ***higher the `momentum`***, so it will
+> ***turn slower***
+
+> [!TIP]
+> usual values are $\beta = 0.9$ or $\beta = 0.99$
+> and usually we start from 0.5 initially, to raise it
+> whenever we are stuck.
+>
+> When we increase $\beta$, then the `learning rate`
+> ***must decrease accordingly***
+> (e.g. from 0.9 to 0.99, `learning-rate` must be
+> divided by a factor of 10)
+
+#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
+
+Differently from the previous
+[momentum](#momentum-equations),
+we take an ***intermediate*** step where we
+***update the `weights`*** according to the
+***previous `momentum`*** and then we compute the
+***new `momentum`*** in this new position, and then
+we ***update again***
+
+$$
+\begin{aligned}
+    \hat{w}_k & = w_k - \beta p_k \\
+    p_{k+1} &= \beta p_{k} +
+        \eta \nabla L(X, y, \hat{w}_k) \\
+    w_{k+1} &= w_{k} - \gamma p_{k+1}
+\end{aligned}
+$$
+
+#### Why Momentum Works
+
+While it has been ***hypothesized*** that
+***acceleration*** made ***convergence faster***, this
+is
+***only true for convex problems without much noise***,
+though this may be ***part of the story***
+
+The other half may be ***Noise Smoothing*** by
+smoothing the optimization process, however
+according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
+
+### Separate Adaptive Learning Rates
+
+Since `weights` may ***greatly vary*** across `layers`,
+having a ***single `learning-rate` might not be ideal.
+
+So the idea is to set a `local learning-rate` to
+control the `global` one as a ***multiplicative factor***
+
+#### Local Learning rates
+
+- Start with $1$ as the ***starting point*** for
+    `local learning-rates` which we'll call `gain` from
+    now on.
+- If the `gradient` has the ***same sign, increase it***
+- Otherwise, ***multiplicatively decrease it***
+
+$$
+    w_{i,j} = - g_{i,j} \cdot \eta \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    }
+
+    \\
+    g_{i,j}(t) = \begin{cases}
+
+    g_{i,j}(t - 1) + \delta
+    & \left( \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t)
+    \cdot
+    \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t-1) \right) > 0 \\
+
+
+    g_{i,j}(t - 1) \cdot (1 - \delta)
+    & \left( \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t)
+    \cdot
+    \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t-1) \right) \leq 0
+\end{cases}
+$$
+
+With this method, if there are oscillations, we will have
+`gains` around $1$
+
+> [!TIP]
+>
+> - Usually a value for $d$ is $0.05$
+> - Limit `gains` around some values:
+>
+>   - $[0.1, 10]$
+>   - $[0.01, 100]$
+>
+> - Use `full-batches` or `big mini-batches` so that
+>   the ***gradient*** doesn't oscillate because of
+>   sampling errors
+> - Combine it with [Momentum](#momentum)
+> - Remember that ***Adaptive `learning-rate`*** deals
+>   with ***axis-alignment***
+
+### rmsprop | Root Mean Square Propagation
+
+#### rprop | Resilient Propagation[^rprop-torch]
+
+This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
+but in this case we don't use the
+[AIMD](#local-learning-rates) technique and
+***we don't take into account*** the
+***magnitude of the gradient*** but ***only the sign***
+
+- If ***gradient*** has same sign:
+  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
+- else:
+  - $step_{k} = step_{k} \cdot \eta_-$
+    where $0 <\eta_- < 1$
+
+> [!TIP]
+>
+> Limit the step size in a range where:
+>
+> - $\inf < 50$
+> - $\sup > 1 \text{M}$
+
+> [!CAUTION]
+>
+> rprop does ***not work*** with `mini-batches` as
+> the ***sign of the gradient changes frequently***
+
+#### rmsprop in detail[^rmsprop-torch]
+
+The idea is that [rprop](#rprop--resilient-propagation)
+is ***equivalent to using the gradient divided by its
+value*** (as you either multiply for $1$ or $-1$),
+however it means that between `mini-batches` the
+***divisor*** changes each time, oscillating.
+
+The solution is to have a ***running average*** of
+the ***magnitude of the squared gradient for
+each `weight`***:
+
+$$
+    MeanSquare(w, t) =
+        \alpha MeanSquare(w, t-1) +
+        (1 - \alpha)
+        \left(
+            \frac{d\, Out}{d\, w}^2
+        \right)
+$$
+
+We then divide the ***gradient by the `square root`***
+of that value
+
+#### Further Developments
+
+- `rmsprop` with `momentum` does not work as it should
+- `rmsprop` with `Nesterov momentum` works best
+    if usedto divide the ***correction*** rather than
+    the ***jump***
+- `rmsprop` with `adaptive learnings` needs more
+    investigation
+
+### Fancy Methods
+
+#### Adaptive Gradient
+
+<!-- TODO: Expand over these -->
+
+##### Convex Case
+
+- Conjugate Gradient/Acceleration
+- L-BFGS
+- Quasi-Newton Methods
+
+##### Non-Convex Case
+
+Pay attention, here the `Hessian` may not be
+`Positive Semi Defined`, thus when the ***gradient*** is
+$0$ we don't necessarily know where we are.
+
+- Natural Gradient Methods
+- Curvature Adaptive
+  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
+  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
+  - [RMSprop](#rmsprop-in-detail)
+  - [ADAM](./Fancy-Methods/ADAM.md)
+  - l-BFGS
+  - [heavy ball gradient](#momentum)
+  - [momemtum](#momentum)
+- Noise Injection:
+  - Simulated Annealing
+  - Langevin Method
+
+#### Adagrad
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAGRAD.md)
+
+#### Adadelta
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADADELTA.md)
+
+#### ADAM
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAM.md)
+
+#### AdamW
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAM-W.md)
+
+#### LION
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/LION.md)
+
+### Hessian Free[^anelli-hessian-free]
+
+How much can we `learn` from a given
+`Loss` space?
+
+The ***best way to move*** would be along the
+***gradient***, assuming it has
+the ***same curvature***
+(e.g. It's  and has a local minimum).
+
+But ***usually this is not the case***, so we need
+to move ***where the ratio of gradient and curvature is
+high***
+
+#### Newton's Method
+
+This method takes into account the ***curvature***
+of the `Loss`
+
+With this method, the update would be:
+
+$$
+\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
+    d \, E
+}{
+    d \, \vec{w}
+}
+$$
+
+***If this could be feasible we'll go on the minimum in
+one step***, but it's not, as the
+***computations***
+needed to get a `Hessian` ***increase exponentially***.
+
+The thing is that whenever we ***update `weights`*** with
+the `Steepest Descent` method, each update *messes up*
+another, while the ***curvature*** can help to ***scale
+these updates*** so that they do not disturb each other.
+
+#### Curvature Approximations
+
+However, since the `Hessian` is
+***too expensive to compute***, we can approximate it.
+
+- We can take only the ***diagonal elements***
+- ***Other algorithms*** (e.g. Hessian Free)
+- ***Conjugate Gradient*** to minimize the
+    ***approximation error***
+
+#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
+
+> [!CAUTION]
+>
+> This is an oversemplification of the topic, so reading
+> the footnotes material is greatly advised.
+
+The basic idea is that, in order not to mess up previous
+directions, we ***`optimize` along perpendicular directions***.
+
+This method is ***guaranteed to mathematically succeed
+after N steps, the dimension of the space***, in practice
+the error will be minimal.
+
+This ***method works well for `non-quadratic errors`***
+and the `Hessian Free` `optimizer` uses this method
+on ***genuinely quadratic surfaces***, which are
+***quadratic approximations of the real surface***
+
+
+<!-- TODO: Add PDF 5 pg. 38 -->
+
+<!-- Footnotes -->
+
+[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
+
+[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
+
+[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
+
+[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
+
+[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
+
+[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
+
+[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
+
+[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
--- a/Chapters/5-Optimization/INDEX.md
+++ b/Chapters/5-Optimization/INDEX.md
@@ -1,501 +1,556 @@
 # Optimization

-We basically try to see the error and minimize it by moving towards the ***gradient***
+## Beyond Full Batches

-## Types of Learning Algorithms
+Even though full batches give the best picture of a probability dristribution
+of data points, it's computationally expensive.

-In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
-Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
+Since data is usually **highly redundant**, we can think of getting smaller
+sets that are classes balanced, **mini-batches**, to update weights.
+While this doesn't give the same results as full batches, is still reliable.

-So, often we train the `model` on a subset of samples.
+When we need to bring things to the extreme, we can even update over a single
+data point, **online learning**, however they are not as efficient as
+mini-batches as **they do not use matrix multiplications, which are GPU efficient**

-### Online Learning
+## Learning rate Scheduling

-This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
+## Xavier-Glorot Weight initialization

-On each `point` we get the ***gradient*** and then we update `weights`.
+> [!WARNING]
+> Before Xavier-Glorot there was another initialization technique proportional
+> to fan-in:
+>
+> $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$
+>
+> Though, Xavier-Glorot is not the only available initialization as there are
+> many others[^torch-init]

-### Mini-Batch
+Whenever we initialize weights, we need to be careful to **break simmetry**, as
+**identical hiddden nodes gets the exact same results**, making us
+lose representation power.

-In this approach, we divide our `dataset` in small batches called `mini-batches`.
-These need to be ***balanced*** in order not to have ***imbalances***.
+Another problem with weight initialization is the **overshooting**. This is
+caused by **many small changes over weights**. The idea to solve this is by
+**initializing weights proprotionally to fan-in (input) and fan-out (output)**

-This technique is the ***most used one***
-
-## Tips and Tricks
-
-### Learning Rate
-
-This is the `hyperparameter` we use to tune our
-***learning steps***.
-
-Sometimes we have it too big and this causes
-***overshootings***. So a quick solution may be to turn
-it down.
-
-However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
-
-### Weight initialization
-
-We need to avoid `neurons` to have the same
-***gradient***. This is easily achievable by using
-***small random values***.
-
-However, if we have a ***large `fan-in`***, then it's
-***easy to overshoot***, then it's better to initialize
-those `weights` ***proportionally to***
-$\sqrt{\text{fan-in}}$:
-
-$$
-w = \frac{
-    np.random(N)
-}{
-    \sqrt{N}
-}
-$$
-
-#### Xavier-Glorot Initialization
-
-<!-- TODO: Read Xavier-Glorot paper -->
-
-Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
-`uniform distribution` with a `std-dev`
-
-$$
-\sigma^2 = \text{gain} \cdot \sqrt{
-    \frac{
-        2
-    }{
-        \text{fan-in} + \text{fan-out}
-    }
-}
-$$
-
-and bounded between $a$ and $-a$
-
-$$
-a = \text{gain} \cdot \sqrt{
-    \frac{
-        6
-    }{
-        \text{fan-in} + \text{fan-out}
-    }
-}
-$$
-
-Alternatively, one can use a `normal-distribution`
-$\mathcal{N}(0, \sigma^2)$.
-
-Note that `gain` is in the **original paper** is equal
-to $1$
-
-### Decorrelating input components
-
-Since ***highly correlated features*** don't offer much
-in terms of ***new information***, probably we need
-to go in the ***latent space*** to find the
-`latent-variables` governing those `features`.
-
-#### PCA
-
-> [!CAUTION]
-> This topic won't be explained here as it's something
-> usually learnt for `Machine Learning`, a
-> ***prerequisite*** for approaching `Deep Learning`.
-
-This is a method we can use to discard `features` that
-will ***add little to no information***
-
-## Common problems in MultiLayer Networks
-
-### Hitting a Plateau
-
-This happenes wehn we have a ***big `learning-rate`***
-which makes `weights` go high in ***absolute value***.
-
-Because this happens ***too quickly***, we could
-see a ***quick diminishing error*** and this is usually
-***mistaken for a minimum point***, while instead
-it's a ***plateau***.
-
-## Speeding up Mini-Batch Learning
-
-### Momentum[^momentum]
-
-We use this method ***mainly when we use `SGD`*** as
-a ***learning techniques***
-
-This method is better explained if we imagine
-our error surface as an actual surface and we place a
-ball over it.
-
-***The ball will start rolling towards the steepest
-descent*** (initially), but ***after gaining enough
-velocity*** it will follow the ***previous direction
-, in some measure***.
-
-So, now the ***gradient*** does modify the ***velocity***
-rather than the ***position***, so the momentum will
-***dampen small variations***.
-
-Moreover, once the ***momentum builds up***, we will
-easily ***pass over plateaus*** as the
-***ball will continue to roll over*** until it is
-stopped by a negative ***gradient***
-
-#### Momentum Equations
-
-There are a couple of them, mainly.
-
-One of them uses a term to evaluate the `momentum`, $p$,
-called `SGD momentum` or `momentum term` or
-`momentum parameter`:
+A technique we use to initialize weights comes from Xavier and Glorot, called
+Xavier-Glorot initialization:

 $$
 \begin{aligned}
-    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
-    w_{k+1} &= w_{k} - \gamma p_{k+1}
+    &W \propto \frac{rand(in, out)}{in + out} \\
+    &rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\
+    &\,\,\,\,\text{or} \\
+    &rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot
+        \sqrt{\frac{2}{in + out}}
 \end{aligned}
 $$

-The other one is ***logically equivalent*** to the
-previous, but it update the `weights` in ***one step***
-and is called `Stochastic Heavy Ball Method`:
+In other words, xavier glorot extracts weights from either a uniform distribution,
+or a normal one, scaled by a factor $g$ called gain

-$$
-    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
-        + \beta ( w_k - w_{k-1})
-$$
+[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html)

-> [!NOTE]
-> This is how to choose $\beta$:
->
-> $0 < \beta < 1$
->
-> If $\beta = 0$, then we are doing
-> ***gradient descent***, if $\beta > 1$ then we
-> ***will have numerical instabilities***.
->
-> The ***larger*** $\beta$ the
-> ***higher the `momentum`***, so it will
-> ***turn slower***
+## Momentum

 > [!TIP]
-> usual values are $\beta = 0.9$ or $\beta = 0.99$
-> and usually we start from 0.5 initially, to raise it
-> whenever we are stuck.
->
-> When we increase $\beta$, then the `learning rate`
-> ***must decrease accordingly***
-> (e.g. from 0.9 to 0.99, `learning-rate` must be
-> divided by a factor of 10)
+> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by
+> a factor of 10

-#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
+It's a technique inspired by physics. Imagine a ball rolling over a plane. Once
+it has enough speed, even if the plane changes inclination, the ball has
+still energy to move along the previous way because of its momentum.

-Differently from the previous
-[momentum](#momentum-equations),
-we take an ***intermediate*** step where we
-***update the `weights`*** according to the
-***previous `momentum`*** and then we compute the
-***new `momentum`*** in this new position, and then
-we ***update again***
+Whenever on a gradient descent we have oscillations, **momentum dampens** all
+movements steering us from the previous direction. Here momentum at time $k$
+is $p_k$

 $$
 \begin{aligned}
-    \hat{w}_k & = w_k - \beta p_k \\
-    p_{k+1} &= \beta p_{k} +
-        \eta \nabla L(X, y, \hat{w}_k) \\
-    w_{k+1} &= w_{k} - \gamma p_{k+1}
+    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\
+    W_{k+1} &= W_{k} - \gamma p_{k+1} \\
+    \beta &\in [0, 1]
 \end{aligned}
 $$

-#### Why Momentum Works
-
-While it has been ***hypothesized*** that
-***acceleration*** made ***convergence faster***, this
-is
-***only true for convex problems without much noise***,
-though this may be ***part of the story***
-
-The other half may be ***Noise Smoothing*** by
-smoothing the optimization process, however
-according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
-
-### Separate Adaptive Learning Rates
-
-Since `weights` may ***greatly vary*** across `layers`,
-having a ***single `learning-rate` might not be ideal.
-
-So the idea is to set a `local learning-rate` to
-control the `global` one as a ***multiplicative factor***
-
-#### Local Learning rates
-
- Start with $1$ as the ***starting point*** for
-    `local learning-rates` which we'll call `gain` from
-    now on.
- If the `gradient` has the ***same sign, increase it***
- Otherwise, ***multiplicatively decrease it***
+Or, in a more compact way, logically equivalent to the previous one:

 $$
-    w_{i,j} = - g_{i,j} \cdot \eta \frac{
-        d \, Out
-    }{
-        d \, w_{i,j}
-    }
+    W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1})
+$$

-    \\
-    g_{i,j}(t) = \begin{cases}
+The larger $\beta$ the slower it curves, accumulating more of previous directions.
+To play it safe, use smaller values once you are at the beginning where updates
+are large and slowly turn it up to values near 1

-    g_{i,j}(t - 1) + \delta
-    & \left( \frac{
-        d \, Out
-    }{
-        d \, w_{i,j}
-    } (t)
-    \cdot
-    \frac{
-        d \, Out
-    }{
-        d \, w_{i,j}
-    } (t-1) \right) > 0 \\
+> [!NOTE]
+>
+> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign
+>   rate
+> - $\gamma$: Learning rate
+> - $\beta$: hyperparameter of dampening factor
+> - $\nabla L(X, Y, W_{k})$:  gradient of the loss
+>

+## Nesterov Acceleated Gradient (aka NAG)

-    g_{i,j}(t - 1) \cdot (1 - \delta)
-    & \left( \frac{
-        d \, Out
-    }{
-        d \, w_{i,j}
-    } (t)
-    \cdot
-    \frac{
-        d \, Out
-    }{
-        d \, w_{i,j}
-    } (t-1) \right) \leq 0
+This method takes inpiration from Nesterov's optimization for convex functions and
+applies it to momentum. Its quirk is that it never computes the gradient where it
+lands on, but on a temporary computation of them before the actual update.
+
+|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] |
+|--|--|
+| ![momentum descent](./pngs/vanilla-momentum.gif) | ![nesterov momentum descent](./pngs/nesterov.gif) |
+
+To illustrate better its quirk, here's the formulation:
+
+$$
+\begin{aligned}
+    \hat{W}_{k} &= W_{k} - \beta p_k \\
+    p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\
+    W_{k+1} &= W_{k} - \gamma p_{k+1}
+\end{aligned}
+$$
+
+As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$
+which will be our actual weights. The idea is to follow the previous momentum
+blindly, see where it goes and then make the correction.
+
+[^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12)
+
+## Justifying Faster Optimization for Momentum Based Methods
+
+While many people justify the speed of momentum based methods for its acceleration,
+this doesn't hold true as it's only accelerated for convex functions.
+
+Since we have no idea, most of the times, how our gradient function looks like,
+we can't make assumptions about it being convex.
+
+So, the most compelling explanation lies in the fact that a momentum based
+optimization is like computing a running average of the loss gradient, smoothing
+the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD
+
+## Separate Adaptive Learning Rate
+
+The idea is that each weight of each layer may need its own learnig rate to avoid
+overshooting and smooth the magnitude of received gradients, high over last layers
+and low over first ones (architecture wise)
+
+The trick is to have a global learning rate that is adjusted by a local gain that
+is increased each time the weight keeps the same sign and viceversa:
+
+$$
+\Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\
+
+g_{i,j}(n +1 ) = \begin{cases}
+    g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\
+    g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0
 \end{cases}
 $$

-With this method, if there are oscillations, we will have
-`gains` around $1$
+This method ensures that if the weight oscillates, the gain will dampen it.
+Moreover, should it be totally random, it will hover near 1, keeping gradient
+updates unchanged.
+
+> [!NOTE]
+> The way $g$ is updated is similar to AIMD in TCP Congestion
+
+<!-- Comment for linter complains-->

 > [!TIP]
 >
-> - Usually a value for $d$ is $0.05$
-> - Limit `gains` around some values:
+> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$
+> - **Use full batch or big mini-batches** - This ensures that the change in sign
+>       is not due to sampling errors
+> - **Combine this with momentum**
+> - **Use this to deal with axis-alignment problems**
 >
->   - $[0.1, 10]$
->   - $[0.01, 100]$
->
-> - Use `full-batches` or `big mini-batches` so that
->   the ***gradient*** doesn't oscillate because of
->   sampling errors
-> - Combine it with [Momentum](#momentum)
-> - Remember that ***Adaptive `learning-rate`*** deals
->   with ***axis-alignment***

-### rmsprop | Root Mean Square Propagation
+## Resilient Backpropagation (aka RProp)

-#### rprop | Resilient Propagation[^rprop-torch]
-
-This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
-but in this case we don't use the
-[AIMD](#local-learning-rates) technique and
-***we don't take into account*** the
-***magnitude of the gradient*** but ***only the sign***
-
- If ***gradient*** has same sign:
-  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
- else:
-  - $step_{k} = step_{k} \cdot \eta_-$
-    where $0 <\eta_- < 1$
-
-> [!TIP]
->
-> Limit the step size in a range where:
->
-> - $\inf < 50$
-> - $\sup > 1 \text{M}$
-
-> [!CAUTION]
->
-> rprop does ***not work*** with `mini-batches` as
-> the ***sign of the gradient changes frequently***
-
-#### rmsprop in detail[^rmsprop-torch]
-
-The idea is that [rprop](#rprop--resilient-propagation)
-is ***equivalent to using the gradient divided by its
-value*** (as you either multiply for $1$ or $-1$),
-however it means that between `mini-batches` the
-***divisor*** changes each time, oscillating.
-
-The solution is to have a ***running average*** of
-the ***magnitude of the squared gradient for
-each `weight`***:
+Instead of using the magnitude of the gradient, **RProp uses the sign to derive
+updates** that is multiplied by a step value. Here's the formulation[^florian-1]:

 $$
-    MeanSquare(w, t) =
-        \alpha MeanSquare(w, t-1) +
-        (1 - \alpha)
-        \left(
-            \frac{d\, Out}{d\, w}^2
-        \right)
+w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left(
+    \frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}
+\right) \\
+s_{i,j}^{(n)} = \begin{cases}
+    s_{i,j}^{(n - 1)} \cdot 1.2 &
+        \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
+        \cdot
+        \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\
+    s_{i,j}^{(n - 1)} \cdot 0.5 &
+        \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
+        \cdot
+        \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0
+\end{cases} \\
+s_{i,j} \in [10^{-6}, 50]
 $$

-We then divide the ***gradient by the `square root`***
-of that value
+It is noticeable that , like
+[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase
+or decreases the gain. However, since it uses multiplication to increase it, makes
+it unusable for anything but full-batches, beacause of its fast growth.

-#### Further Developments
+[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/)

- `rmsprop` with `momentum` does not work as it should
- `rmsprop` with `Nesterov momentum` works best
-    if usedto divide the ***correction*** rather than
-    the ***jump***
- `rmsprop` with `adaptive learnings` needs more
-    investigation
+## Root Mean Square Propagation (aka RMSProp)

-### Fancy Methods
+As the name implies, it propagates the loss over, a bit like momentum. Since
+[RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient,
+it's almost like dividing the gradient by its magnitude, which is bad in case of
+mini-batches, as all divisors are different.

-#### Adaptive Gradient
-
-<!-- TODO: Expand over these -->
-
-##### Convex Case
-
- Conjugate Gradient/Acceleration
- L-BFGS
- Quasi-Newton Methods
-
-##### Non-Convex Case
-
-Pay attention, here the `Hessian` may not be
-`Positive Semi Defined`, thus when the ***gradient*** is
-$0$ we don't necessarily know where we are.
-
- Natural Gradient Methods
- Curvature Adaptive
-  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
-  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
-  - [RMSprop](#rmsprop-in-detail)
-  - [ADAM](./Fancy-Methods/ADAM.md)
-  - l-BFGS
-  - [heavy ball gradient](#momentum)
-  - [momemtum](#momentum)
- Noise Injection:
-  - Simulated Annealing
-  - Langevin Method
-
-#### Adagrad
-
-> [!NOTE]
-> [Here in detail](./Fancy-Methods/ADAGRAD.md)
-
-#### Adadelta
-
-> [!NOTE]
-> [Here in detail](./Fancy-Methods/ADADELTA.md)
-
-#### ADAM
-
-> [!NOTE]
-> [Here in detail](./Fancy-Methods/ADAM.md)
-
-#### AdamW
-
-> [!NOTE]
-> [Here in detail](./Fancy-Methods/ADAM-W.md)
-
-#### LION
-
-> [!NOTE]
-> [Here in detail](./Fancy-Methods/LION.md)
-
-### Hessian Free[^anelli-hessian-free]
-
-How much can we `learn` from a given
-`Loss` space?
-
-The ***best way to move*** would be along the
-***gradient***, assuming it has
-the ***same curvature***
-(e.g. It's  and has a local minimum).
-
-But ***usually this is not the case***, so we need
-to move ***where the ratio of gradient and curvature is
-high***
-
-#### Newton's Method
-
-This method takes into account the ***curvature***
-of the `Loss`
-
-With this method, the update would be:
+RMSProp solves this by keeping the gradient magnitude similar across mini-batches
+by keeping a running average of it:

 $$
-\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
-    d \, E
-}{
-    d \, \vec{w}
-}
+L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left(
+    \frac{d \, Loss}{d\, W^{(k -1)}}
+    \right)^2 \\
+W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\
+     \text{usually } \beta = 0.9
 $$

-***If this could be feasible we'll go on the minimum in
-one step***, but it's not, as the
-***computations***
-needed to get a `Hessian` ***increase exponentially***.
+What this method does is keeping a running average of the measn square error,
+hence the name, and use it to normalize the gradient keeping it similar across
+mini-batches.

-The thing is that whenever we ***update `weights`*** with
-the `Steepest Descent` method, each update *messes up*
-another, while the ***curvature*** can help to ***scale
-these updates*** so that they do not disturb each other.
-
-#### Curvature Approximations
-
-However, since the `Hessian` is
-***too expensive to compute***, we can approximate it.
-
- We can take only the ***diagonal elements***
- ***Other algorithms*** (e.g. Hessian Free)
- ***Conjugate Gradient*** to minimize the
-    ***approximation error***
-
-#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
-
-> [!CAUTION]
+> [!NOTE]
+> While it can be used with momentum, it doesn't seem to add as much benefits as
+> using it standalone.
+>
+> With Nesterov, it works best if used to normalize the correction, rather than
+> the jump. While for the adaptive learning rates, it still requires further
+> investigations to prove the efficacy.
 >
-> This is an oversemplification of the topic, so reading
-> the footnotes material is greatly advised.

-The basic idea is that, in order not to mess up previous
-directions, we ***`optimize` along perpendicular directions***.
+## Adaptive Gradient Methods

-This method is ***guaranteed to mathematically succeed
-after N steps, the dimension of the space***, in practice
-the error will be minimal.
+<!--
+    MARK: AdaGrad
+-->
+### AdaGrad[^adagrad-torch]

-This ***method works well for `non-quadratic errors`***
-and the `Hessian Free` `optimizer` uses this method
-on ***genuinely quadratic surfaces***, which are
-***quadratic approximations of the real surface***
+`AdaGrad` is an ***optimization method*** aimed
+to:

+<ins>***"find needles in the haystack in the form of
+very predictive yet rarely observed features"***
+[^adagrad-official-paper]</ins>

-<!-- TODO: Add PDF 5 pg. 38 -->
+`AdaGrad`, opposed to a standard `SGD` that is the
+***same for each gradient geometry***, tries to
+***incorporate geometry from earlier iterations***.
+
+#### AdaGrad Algorithm
+
+Instead `AdaGrad` takes another
+approach[^anelli-adagrad-2][^adagrad-official-paper]:
+
+$$
+\begin{aligned}
+    g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\
+    G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\
+    w_{i}^{(k + 1)} &=
+    w_{i}^{(k)} - \eta \cdot\frac{
+        1
+    }{
+        \sqrt{G_{i,i}^{(k +1)} + \epsilon}
+    } \cdot g_{i}^{(k+1)} \\
+
+\end{aligned}
+$$
+
+Here $G^{(k)}$ is the ***sum of outer product*** of the
+***gradient*** until time $t$, though ***usually it is
+not used*** $G_t$, which is ***impractical because
+of the high number of dimensions***, so we use
+$diag(G_t)$ which can be
+***computed in linear time***[^adagrad-official-paper]
+
+The $\epsilon$ term here is used to
+***avoid dividing by 0***[^anelli-adagrad-2] and has a
+small value, usually in the order of $10^{-8}$
+
+> [!NOTE]
+>
+> This example is tough to understand if we where to apply it to a matrix $W$
+> instead of a vector. To make it easier to understand in matricial notation:
+>
+> $$
+>    \begin{aligned}
+>        \nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\
+>        G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\
+>        W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}
+                {\sqrt{G^{(k+1)} + \epsilon}}
+>    \end{aligned}
+> $$
+>
+> In other words, compute the gradient and scale it for the sum of its squares
+> until that point
+
+#### AdaGrad effectiveness[^anelli-adagrad-3]
+
+- When we have ***many dimensions, many features are
+    irrelevant***
+- ***Rarer Features are more relevant***
+- It adapts $\eta$ to the right metric space
+    by projecting gradient stochastic updates with
+    [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
+    a probability distribution.
+
+#### AdaGrad Considerations
+
+- It eliminates the need of manually tuning the
+    `learning rates`, which is usually set to
+    $0.01$
+- The squared ***gradients*** are accumulated during
+    iterations, making the `learning-rate` become
+    ***smaller and smaller***, thus becoming 0 and untrainable

 <!-- Footnotes -->

-[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
+[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)

-[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
+[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)

-[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
+[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)

-[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
+[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42

-[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
+[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43

-[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
+[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44

-[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
+### AdaDelta[^adadelta-offcial-paper]
+
+`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
+created to address some problems of it, like
+***sensitivity to initial `parameters` and corresponding
+gradient***[^adadelta-offcial-paper]
+
+To address all these problems, `ADADELTA` accumulates
+***gradients over a `window` as a running average***, rather than ***accumulating
+it over all instances***:
+
+$$
+G^{(k+1)} = \beta \cdot G^{(k)} +
+    (1 - \beta) \cdot \nabla L^{(k+1)}
+$$
+
+The update, which is very similar to the one in
+[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
+
+$$
+\begin{aligned}
+    W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}}
+\end{aligned}
+$$
+
+Technically speaking, the last equation is basically equivalent to the
+[RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is
+equivalent to the running average of the mean square.
+
+However, as the author pointed out[^adadelta-units], this equation does not
+respect units of measures. We should correct this problem
+by ***considering the curvature locally smooth*** and
+taking an approximation of it at the next step, becoming:
+
+$$
+\begin{aligned}
+    \Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}}
+        \nabla L^{(k)}\\
+    S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\
+    W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)}
+\end{aligned}
+$$
+
+As we can notice, the ***`learning rate` completely
+disappears from the equation, eliminating the need to
+set one***
+
+> [!WARNING]
+> Here $\Delta W$ is already negative, that's why there's a $+$ in the last
+> equation
+
+<!-- Footnotes -->
+
+[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
+
+[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
+
+### Adaptive Moment Estimation (aka AdaM)
+
+AdaM computes both the momentum and the squared gradients with running
+averages, which are 0 filled at time $k = 0$:
+
+$$
+\begin{aligned}
+    M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\
+    V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\
+\end{aligned}
+$$
+
+> [!WARNING]
+> The squared gradient can be thought as the variance, however it's not centered
+
+Then it corrects them to be used in the final formulation:
+
+$$
+\begin{aligned}
+    \hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}}  \\
+    \hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\
+\end{aligned}
+$$
+
+> [!WARNING]
+> $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep.
+
+Then it computes the update in this way:
+
+$$
+W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}}
+    {\sqrt{\hat{V}^{(k+1)}} + \epsilon}
+$$
+
+Even though Adam works, it doesn't generalize well and, particularly in image
+problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers
+instead of 1 as for SGD, which 2 of them need parameters tuning.
+
+> [!NOTE]
+> Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and
+> $\epsilon = 10^-8$
+
+### AdamW
+
+AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty,
+AdaM already implements it, however it is usually added to momentum, getting
+scaled by $\sqrt{\hat{V}}$:
+
+$$
+W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}}
+    {\sqrt{\hat{V}^{(k+1)}} + \epsilon}
+$$
+
+AdamW authors saw that this was inefficient as it was influences by the uncentered
+variance, thus modified the formula to this:
+
+$$
+W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} }
+    {\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)}
+$$
+
+### Lion (evoLved sIgn mOmeNtum)[^official-paper]
+
+`Lion` is the result of a ***genetic search algorithm*** aimed to
+find the best `optimizer`.
+
+It starts from a population of `AdamW` algorithms to
+***speed up the search***. Opposed to
+`Adam` and `AdamW`, it keeps track
+***only for the momentum*** and ***gradient sign***,
+requiring ***less `memory`***.
+
+Since ***uniform updates yields larger norms***,
+`Lion` requires a ***smaller `learning-rate`***
+and a ***larger decoupled `weight-decay`***
+$\lambda$[^official-paper-1].
+
+The ***advantages of `Lion` over `Adam` and `AdamW`
+increase with the size of
+the `mini-batch`***[^official-paper-1]
+
+#### Symbolic Representation[^official-paper-2]
+
+New ***trained algorithms*** are represented
+`simbolically`, bringing these advantages:
+
+- `Algorithms` must be ***implemented*** as `programs`
+- It ***easier to analyze, comprehend and transfer to
+    new task*** these `algorithms`, rather than other
+    `algorithms` such as `NeuralNetworks`
+- We can **estimate the *complexity*** by looking
+    at the ***length of code***
+
+#### Tournament[^official-paper-3]
+
+The best code is found with a ***tournament style
+evolution***. Each cycle it picks the ***best
+`algorithm`*** which will be
+***copied and mutated*** and the ***oldest is removed***
+
+<!-- Footnotes -->
+
+[^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
+
+[^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
+
+[^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
+
+[^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
+
+## Hessian Free Optimization
+
+Since we are moving on a function which gradient is not constant, by looking at
+the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix),
+we can see when it starts to change.
+
+### Newton's Method
+
+This method would technically give us the solution in one step on a quadratic
+function, but it is unfeasible due to the memory and computational requirements:
+
+$$
+\Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W}
+$$
+
+### Conjugate Gradient
+
+The idea is to correct the weights so that we reduce the gradient to 0 across
+perpendicular directions. This means that, for each update, we are not messing up
+previous optimizations.
+
+While it is usually used for quadratic error surfaces, there's a non linear variant
+(non-linear conjugate gradient) that usually works well. However it is also
+possible to approximate the true error function with a quadratic one, using the
+standard method.
+
+It gives a solution after $N$ steps over an $N$ dimensional quadratic surface,
+however we need to penalize frequent changes in weights, especially for hidden
+activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
+
+## Optimization Tricks
+
+### Input decorrelation
+
+If you have a linear neuron, think of a Feed Forward and not of a Convolution,
+it's better to decorrelate input components.
+
+A way to achieve this is through a
+[PCA](./../15-Appendix-A/INDEX.md#computing-pca),
+transforming the error surface from an ellipse to a circle.
+
+### Recognize Plateaus
+
+If we start with big learning rates, since weights gain a big magnitude, the
+derivative will be small and the error will not decrease significantly.
+
+This may seem a local minima, but this is usually a plateau.
+
+### Mini-Batch Speed up
+
+To speed up mini batch training use these methods:
+
+- [**Momentum**](#momentum)
+- [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate)
+- [**rmsprop**](#root-mean-square-propagation-aka-rmsprop)
+- [**Adaptive Gradients Methods**](#adaptive-gradient-methods)
+
+### Mini-batches vs Full-Batches
+
+The rule of thumb is to use **full-batches for small datasets or small redundancy**
+, while **mini-batches for redundant datasets**

-[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
--- a/Chapters/5-Optimization/pngs/nesterov.gif
+++ b/Chapters/5-Optimization/pngs/nesterov.gif
--- a/Chapters/5-Optimization/pngs/vanilla-momentum.gif
+++ b/Chapters/5-Optimization/pngs/vanilla-momentum.gif