Revised Optimization Notes

2025-11-20 18:47:36 +01:00
parent 2a96deaebf
commit 934c08d4c0
6 changed files with 1226 additions and 429 deletions
--- a/Chapters/15-Appendix-A/INDEX.md
+++ b/Chapters/15-Appendix-A/INDEX.md
@@ -33,7 +33,8 @@ $$
 ## Cross Entropy Loss derivation
-A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
+Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"*
 we get from distribution $p$ knowing
 results from distribution $q$. It is defined as the entropy of $p$ plus the
 [Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
@@ -62,6 +63,23 @@ Usually $\hat{y}$ comes from using a
 [softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
 logaritm and probability values are at most 1, the closer to 0, the higher the loss
 ## Computing PCA[^wiki-pca]
 > [!CAUTION]
 > $X$ here is the matrix of dataset with **<ins>features over rows</ins>**
 - $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation
 - $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$
 - $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues
 - $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$
    highest eigenvalue
 - $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation
 > [!NOTE]
 > You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2
 > are closely related and apply the same concept but applying different
 > mathematical formulas.
 ## Laplace Operator[^khan-1]
 It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
@@ -80,8 +98,32 @@ It can also be used to compute the net flow of particles in that region of space
 > This is not a **discrete laplace operator**, which is instead a **matrix** here,
 > as there are many other formulations.
 ## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)
 A Hessian Matrix represents the 2nd derivative of a function, thus it gives
 us the curvature of a function.
 It is also used to tell us whether the point is a local minimum (it is positive
 defined), local maximum (it is negative defined) or saddle (neither positive or
 negative defined).
 It is computed by computing the partial derivatives of the gradient along
 all dimensions and then transpose it.
 $$
 \nabla f = \begin{bmatrix}
    \frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
 \end{bmatrix} \\
 H(f) = \begin{bmatrix}
     \frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
      \frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
 \end{bmatrix}
 $$
 [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
 [^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
 [^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
 [^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
--- a/Chapters/15-Appendix-A/python-experiments/pca.ipynb
+++ b/Chapters/15-Appendix-A/python-experiments/pca.ipynb
@@ -0,0 +1,199 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8c14ea22",
   "metadata": {},
   "source": [
    "# Computing PCA\n",
    "\n",
    "Here I'll be taking data from [Geeks4Geeks](https://www.geeksforgeeks.org/machine-learning/mathematical-approach-to-pca/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b32eb5c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1.8        1.87777778]\n",
      "[[ 0.7         0.52222222]\n",
      " [-1.3        -1.17777778]\n",
      " [ 0.4         1.02222222]\n",
      " [ 1.3         1.12222222]\n",
      " [ 0.5         0.82222222]\n",
      " [ 0.2        -0.27777778]\n",
      " [-0.8        -0.77777778]\n",
      " [-0.3        -0.27777778]\n",
      " [-0.7        -0.97777778]]\n",
      "[[0.6925     0.68875   ]\n",
      " [0.68875    0.79444444]]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "X : np.ndarray = np.array([\n",
    "    [2.5, 2.4],\n",
    "    [0.5, 0.7],\n",
    "    [2.2, 2.9],\n",
    "    [3.1, 3.0],\n",
    "    [2.3, 2.7],\n",
    "    [2.0, 1.6],\n",
    "    [1.0, 1.1],\n",
    "    [1.5, 1.6],\n",
    "    [1.1, 0.9]\n",
    "])\n",
    "\n",
    "# Compute mean values for features\n",
    "mu_X = np.mean(X, 0)\n",
    "\n",
    "print(mu_X)\n",
    "# \"Normalize\" Features\n",
    "X = X - mu_X\n",
    "print(X)\n",
    "\n",
    "# Compute covariance matrix applying\n",
    "#   Bessel's correction (n-1) instead of n\n",
    "Cov = (X.T @ X) / (X.shape[0] - 1)\n",
    "\n",
    "print(Cov)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e9429f",
   "metadata": {},
   "source": [
    "As you can notice, we did $X^T \\times X$ instead of $X \\times X^T$. This is because our \n",
    "dataset had datapoints over rows instead of features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "f93b7a92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.05283865 1.43410579]\n",
      "[[-0.73273632 -0.68051267]\n",
      " [ 0.68051267 -0.73273632]]\n"
     ]
    }
   ],
   "source": [
    "# Computing eigenvalues\n",
    "eigen = np.linalg.eig(Cov)\n",
    "eigen_values = eigen.eigenvalues\n",
    "eigen_vectors = eigen.eigenvectors\n",
    "\n",
    "print(eigen_values)\n",
    "print(eigen_vectors)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfbdd9c3",
   "metadata": {},
   "source": [
    "Now we'll generate the new X matrix by only using the first eigen vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "7ce6c540",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9, 1)\n",
      "Compressed\n",
      "[[-0.85901005]\n",
      " [ 1.74766702]\n",
      " [-1.02122441]\n",
      " [-1.70695945]\n",
      " [-0.94272842]\n",
      " [ 0.06743533]\n",
      " [ 1.11431616]\n",
      " [ 0.40769167]\n",
      " [ 1.19281215]]\n",
      "Reconstruction\n",
      "[[ 0.58456722  0.62942786]\n",
      " [-1.18930955 -1.28057909]\n",
      " [ 0.69495615  0.74828821]\n",
      " [ 1.16160753  1.25075117]\n",
      " [ 0.64153863  0.69077135]\n",
      " [-0.0458906  -0.04941232]\n",
      " [-0.75830626 -0.81649992]\n",
      " [-0.27743934 -0.29873049]\n",
      " [-0.81172378 -0.87401678]]\n",
      "Difference\n",
      "[[0.11543278 0.10720564]\n",
      " [0.11069045 0.10280131]\n",
      " [0.29495615 0.27393401]\n",
      " [0.13839247 0.12852895]\n",
      " [0.14153863 0.13145088]\n",
      " [0.2458906  0.22836546]\n",
      " [0.04169374 0.03872214]\n",
      " [0.02256066 0.02095271]\n",
      " [0.11172378 0.10376099]]\n"
     ]
    }
   ],
   "source": [
    "# Computing X coming from only 1st eigen vector\n",
    "Z_pca = X @ eigen_vectors[:,1]\n",
    "Z_pca = Z_pca.reshape([Z_pca.shape[0], 1])\n",
    "\n",
    "print(Z_pca.shape)\n",
    "\n",
    "\n",
    "# X reconstructed\n",
    "eigen_v = (eigen_vectors[:, 1].reshape([eigen_vectors[:, 1].shape[0], 1]))\n",
    "X_rec = Z_pca @ eigen_v.T\n",
    "\n",
    "print(\"Compressed\")\n",
    "print(Z_pca)\n",
    "\n",
    "print(\"Reconstruction\")\n",
    "print(X_rec)\n",
    "\n",
    "print(\"Difference\")\n",
    "print(abs(X - X_rec))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "deep_learning",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/Chapters/5-Optimization/INDEX-OLD.md
+++ b/Chapters/5-Optimization/INDEX-OLD.md
@@ -0,0 +1,501 @@
 # Optimization
 We basically try to see the error and minimize it by moving towards the ***gradient***
 ## Types of Learning Algorithms
 In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
 Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
 So, often we train the `model` on a subset of samples.
 ### Online Learning
 This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
 On each `point` we get the ***gradient*** and then we update `weights`.
 ### Mini-Batch
 In this approach, we divide our `dataset` in small batches called `mini-batches`.
 These need to be ***balanced*** in order not to have ***imbalances***.
 This technique is the ***most used one***
 ## Tips and Tricks
 ### Learning Rate
 This is the `hyperparameter` we use to tune our
 ***learning steps***.
 Sometimes we have it too big and this causes
 ***overshootings***. So a quick solution may be to turn
 it down.
 However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
 ### Weight initialization
 We need to avoid `neurons` to have the same
 ***gradient***. This is easily achievable by using
 ***small random values***.
 However, if we have a ***large `fan-in`***, then it's
 ***easy to overshoot***, then it's better to initialize
 those `weights` ***proportionally to***
 $\sqrt{\text{fan-in}}$:
 $$
 w = \frac{
    np.random(N)
 }{
    \sqrt{N}
 }
 $$
 #### Xavier-Glorot Initialization
 <!-- TODO: Read Xavier-Glorot paper -->
 Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
 `uniform distribution` with a `std-dev`
 $$
 \sigma^2 = \text{gain} \cdot \sqrt{
    \frac{
        2
    }{
        \text{fan-in} + \text{fan-out}
    }
 }
 $$
 and bounded between $a$ and $-a$
 $$
 a = \text{gain} \cdot \sqrt{
    \frac{
        6
    }{
        \text{fan-in} + \text{fan-out}
    }
 }
 $$
 Alternatively, one can use a `normal-distribution`
 $\mathcal{N}(0, \sigma^2)$.
 Note that `gain` is in the **original paper** is equal
 to $1$
 ### Decorrelating input components
 Since ***highly correlated features*** don't offer much
 in terms of ***new information***, probably we need
 to go in the ***latent space*** to find the
 `latent-variables` governing those `features`.
 #### PCA
 > [!CAUTION]
 > This topic won't be explained here as it's something
 > usually learnt for `Machine Learning`, a
 > ***prerequisite*** for approaching `Deep Learning`.
 This is a method we can use to discard `features` that
 will ***add little to no information***
 ## Common problems in MultiLayer Networks
 ### Hitting a Plateau
 This happenes wehn we have a ***big `learning-rate`***
 which makes `weights` go high in ***absolute value***.
 Because this happens ***too quickly***, we could
 see a ***quick diminishing error*** and this is usually
 ***mistaken for a minimum point***, while instead
 it's a ***plateau***.
 ## Speeding up Mini-Batch Learning
 ### Momentum[^momentum]
 We use this method ***mainly when we use `SGD`*** as
 a ***learning techniques***
 This method is better explained if we imagine
 our error surface as an actual surface and we place a
 ball over it.
 ***The ball will start rolling towards the steepest
 descent*** (initially), but ***after gaining enough
 velocity*** it will follow the ***previous direction
 , in some measure***.
 So, now the ***gradient*** does modify the ***velocity***
 rather than the ***position***, so the momentum will
 ***dampen small variations***.
 Moreover, once the ***momentum builds up***, we will
 easily ***pass over plateaus*** as the
 ***ball will continue to roll over*** until it is
 stopped by a negative ***gradient***
 #### Momentum Equations
 There are a couple of them, mainly.
 One of them uses a term to evaluate the `momentum`, $p$,
 called `SGD momentum` or `momentum term` or
 `momentum parameter`:
 $$
 \begin{aligned}
    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
 \end{aligned}
 $$
 The other one is ***logically equivalent*** to the
 previous, but it update the `weights` in ***one step***
 and is called `Stochastic Heavy Ball Method`:
 $$
    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
        + \beta ( w_k - w_{k-1})
 $$
 > [!NOTE]
 > This is how to choose $\beta$:
 >
 > $0 < \beta < 1$
 >
 > If $\beta = 0$, then we are doing
 > ***gradient descent***, if $\beta > 1$ then we
 > ***will have numerical instabilities***.
 >
 > The ***larger*** $\beta$ the
 > ***higher the `momentum`***, so it will
 > ***turn slower***
 > [!TIP]
 > usual values are $\beta = 0.9$ or $\beta = 0.99$
 > and usually we start from 0.5 initially, to raise it
 > whenever we are stuck.
 >
 > When we increase $\beta$, then the `learning rate`
 > ***must decrease accordingly***
 > (e.g. from 0.9 to 0.99, `learning-rate` must be
 > divided by a factor of 10)
 #### Nesterov (1983) Sutskever (2012) Accelerated Momentum
 Differently from the previous
 [momentum](#momentum-equations),
 we take an ***intermediate*** step where we
 ***update the `weights`*** according to the
 ***previous `momentum`*** and then we compute the
 ***new `momentum`*** in this new position, and then
 we ***update again***
 $$
 \begin{aligned}
    \hat{w}_k & = w_k - \beta p_k \\
    p_{k+1} &= \beta p_{k} +
        \eta \nabla L(X, y, \hat{w}_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
 \end{aligned}
 $$
 #### Why Momentum Works
 While it has been ***hypothesized*** that
 ***acceleration*** made ***convergence faster***, this
 is
 ***only true for convex problems without much noise***,
 though this may be ***part of the story***
 The other half may be ***Noise Smoothing*** by
 smoothing the optimization process, however
 according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
 ### Separate Adaptive Learning Rates
 Since `weights` may ***greatly vary*** across `layers`,
 having a ***single `learning-rate` might not be ideal.
 So the idea is to set a `local learning-rate` to
 control the `global` one as a ***multiplicative factor***
 #### Local Learning rates
 - Start with $1$ as the ***starting point*** for
    `local learning-rates` which we'll call `gain` from
    now on.
 - If the `gradient` has the ***same sign, increase it***
 - Otherwise, ***multiplicatively decrease it***
 $$
    w_{i,j} = - g_{i,j} \cdot \eta \frac{
        d \, Out
    }{
        d \, w_{i,j}
    }
    \\
    g_{i,j}(t) = \begin{cases}
    g_{i,j}(t - 1) + \delta
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) > 0 \\
    g_{i,j}(t - 1) \cdot (1 - \delta)
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) \leq 0
 \end{cases}
 $$
 With this method, if there are oscillations, we will have
 `gains` around $1$
 > [!TIP]
 >
 > - Usually a value for $d$ is $0.05$
 > - Limit `gains` around some values:
 >
 >   - $[0.1, 10]$
 >   - $[0.01, 100]$
 >
 > - Use `full-batches` or `big mini-batches` so that
 >   the ***gradient*** doesn't oscillate because of
 >   sampling errors
 > - Combine it with [Momentum](#momentum)
 > - Remember that ***Adaptive `learning-rate`*** deals
 >   with ***axis-alignment***
 ### rmsprop | Root Mean Square Propagation
 #### rprop | Resilient Propagation[^rprop-torch]
 This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
 but in this case we don't use the
 [AIMD](#local-learning-rates) technique and
 ***we don't take into account*** the
 ***magnitude of the gradient*** but ***only the sign***
 - If ***gradient*** has same sign:
  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
 - else:
  - $step_{k} = step_{k} \cdot \eta_-$
    where $0 <\eta_- < 1$
 > [!TIP]
 >
 > Limit the step size in a range where:
 >
 > - $\inf < 50$
 > - $\sup > 1 \text{M}$
 > [!CAUTION]
 >
 > rprop does ***not work*** with `mini-batches` as
 > the ***sign of the gradient changes frequently***
 #### rmsprop in detail[^rmsprop-torch]
 The idea is that [rprop](#rprop--resilient-propagation)
 is ***equivalent to using the gradient divided by its
 value*** (as you either multiply for $1$ or $-1$),
 however it means that between `mini-batches` the
 ***divisor*** changes each time, oscillating.
 The solution is to have a ***running average*** of
 the ***magnitude of the squared gradient for
 each `weight`***:
 $$
    MeanSquare(w, t) =
        \alpha MeanSquare(w, t-1) +
        (1 - \alpha)
        \left(
            \frac{d\, Out}{d\, w}^2
        \right)
 $$
 We then divide the ***gradient by the `square root`***
 of that value
 #### Further Developments
 - `rmsprop` with `momentum` does not work as it should
 - `rmsprop` with `Nesterov momentum` works best
    if usedto divide the ***correction*** rather than
    the ***jump***
 - `rmsprop` with `adaptive learnings` needs more
    investigation
 ### Fancy Methods
 #### Adaptive Gradient
 <!-- TODO: Expand over these -->
 ##### Convex Case
 - Conjugate Gradient/Acceleration
 - L-BFGS
 - Quasi-Newton Methods
 ##### Non-Convex Case
 Pay attention, here the `Hessian` may not be
 `Positive Semi Defined`, thus when the ***gradient*** is
 $0$ we don't necessarily know where we are.
 - Natural Gradient Methods
 - Curvature Adaptive
  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
  - [RMSprop](#rmsprop-in-detail)
  - [ADAM](./Fancy-Methods/ADAM.md)
  - l-BFGS
  - [heavy ball gradient](#momentum)
  - [momemtum](#momentum)
 - Noise Injection:
  - Simulated Annealing
  - Langevin Method
 #### Adagrad
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAGRAD.md)
 #### Adadelta
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADADELTA.md)
 #### ADAM
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAM.md)
 #### AdamW
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAM-W.md)
 #### LION
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/LION.md)
 ### Hessian Free[^anelli-hessian-free]
 How much can we `learn` from a given
 `Loss` space?
 The ***best way to move*** would be along the
 ***gradient***, assuming it has
 the ***same curvature***
 (e.g. It's  and has a local minimum).
 But ***usually this is not the case***, so we need
 to move ***where the ratio of gradient and curvature is
 high***
 #### Newton's Method
 This method takes into account the ***curvature***
 of the `Loss`
 With this method, the update would be:
 $$
 \Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
    d \, E
 }{
    d \, \vec{w}
 }
 $$
 ***If this could be feasible we'll go on the minimum in
 one step***, but it's not, as the
 ***computations***
 needed to get a `Hessian` ***increase exponentially***.
 The thing is that whenever we ***update `weights`*** with
 the `Steepest Descent` method, each update *messes up*
 another, while the ***curvature*** can help to ***scale
 these updates*** so that they do not disturb each other.
 #### Curvature Approximations
 However, since the `Hessian` is
 ***too expensive to compute***, we can approximate it.
 - We can take only the ***diagonal elements***
 - ***Other algorithms*** (e.g. Hessian Free)
 - ***Conjugate Gradient*** to minimize the
    ***approximation error***
 #### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
 > [!CAUTION]
 >
 > This is an oversemplification of the topic, so reading
 > the footnotes material is greatly advised.
 The basic idea is that, in order not to mess up previous
 directions, we ***`optimize` along perpendicular directions***.
 This method is ***guaranteed to mathematically succeed
 after N steps, the dimension of the space***, in practice
 the error will be minimal.
 This ***method works well for `non-quadratic errors`***
 and the `Hessian Free` `optimizer` uses this method
 on ***genuinely quadratic surfaces***, which are
 ***quadratic approximations of the real surface***
 <!-- TODO: Add PDF 5 pg. 38 -->
 <!-- Footnotes -->
 [^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
 [^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
 [^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
 [^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
 [^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
 [^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
 [^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
 [^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
--- a/Chapters/5-Optimization/INDEX.md
+++ b/Chapters/5-Optimization/INDEX.md
@@ -1,501 +1,556 @@
 # Optimization
-We basically try to see the error and minimize it by moving towards the ***gradient***
+## Beyond Full Batches
-## Types of Learning Algorithms
+Even though full batches give the best picture of a probability dristribution
 of data points, it's computationally expensive.
-In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
+Since data is usually **highly redundant**, we can think of getting smaller
-Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
+sets that are classes balanced, **mini-batches**, to update weights.
 While this doesn't give the same results as full batches, is still reliable.
-So, often we train the `model` on a subset of samples.
+When we need to bring things to the extreme, we can even update over a single
 data point, **online learning**, however they are not as efficient as
 mini-batches as **they do not use matrix multiplications, which are GPU efficient**
-### Online Learning
+## Learning rate Scheduling
-This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
+## Xavier-Glorot Weight initialization
-On each `point` we get the ***gradient*** and then we update `weights`.
+> [!WARNING]
 > Before Xavier-Glorot there was another initialization technique proportional
 > to fan-in:
 >
 > $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$
 >
 > Though, Xavier-Glorot is not the only available initialization as there are
 > many others[^torch-init]
-### Mini-Batch
+Whenever we initialize weights, we need to be careful to **break simmetry**, as
 **identical hiddden nodes gets the exact same results**, making us
 lose representation power.
-In this approach, we divide our `dataset` in small batches called `mini-batches`.
+Another problem with weight initialization is the **overshooting**. This is
-These need to be ***balanced*** in order not to have ***imbalances***.
+caused by **many small changes over weights**. The idea to solve this is by
 **initializing weights proprotionally to fan-in (input) and fan-out (output)**
-This technique is the ***most used one***
+A technique we use to initialize weights comes from Xavier and Glorot, called
-
+Xavier-Glorot initialization:
 ## Tips and Tricks
 ### Learning Rate
 This is the `hyperparameter` we use to tune our
 ***learning steps***.
 Sometimes we have it too big and this causes
 ***overshootings***. So a quick solution may be to turn
 it down.
 However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
 ### Weight initialization
 We need to avoid `neurons` to have the same
 ***gradient***. This is easily achievable by using
 ***small random values***.
 However, if we have a ***large `fan-in`***, then it's
 ***easy to overshoot***, then it's better to initialize
 those `weights` ***proportionally to***
 $\sqrt{\text{fan-in}}$:
 $$
 w = \frac{
    np.random(N)
 }{
    \sqrt{N}
 }
 $$
 #### Xavier-Glorot Initialization
 <!-- TODO: Read Xavier-Glorot paper -->
 Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
 `uniform distribution` with a `std-dev`
 $$
 \sigma^2 = \text{gain} \cdot \sqrt{
    \frac{
        2
    }{
        \text{fan-in} + \text{fan-out}
    }
 }
 $$
 and bounded between $a$ and $-a$
 $$
 a = \text{gain} \cdot \sqrt{
    \frac{
        6
    }{
        \text{fan-in} + \text{fan-out}
    }
 }
 $$
 Alternatively, one can use a `normal-distribution`
 $\mathcal{N}(0, \sigma^2)$.
 Note that `gain` is in the **original paper** is equal
 to $1$
 ### Decorrelating input components
 Since ***highly correlated features*** don't offer much
 in terms of ***new information***, probably we need
 to go in the ***latent space*** to find the
 `latent-variables` governing those `features`.
 #### PCA
 > [!CAUTION]
 > This topic won't be explained here as it's something
 > usually learnt for `Machine Learning`, a
 > ***prerequisite*** for approaching `Deep Learning`.
 This is a method we can use to discard `features` that
 will ***add little to no information***
 ## Common problems in MultiLayer Networks
 ### Hitting a Plateau
 This happenes wehn we have a ***big `learning-rate`***
 which makes `weights` go high in ***absolute value***.
 Because this happens ***too quickly***, we could
 see a ***quick diminishing error*** and this is usually
 ***mistaken for a minimum point***, while instead
 it's a ***plateau***.
 ## Speeding up Mini-Batch Learning
 ### Momentum[^momentum]
 We use this method ***mainly when we use `SGD`*** as
 a ***learning techniques***
 This method is better explained if we imagine
 our error surface as an actual surface and we place a
 ball over it.
 ***The ball will start rolling towards the steepest
 descent*** (initially), but ***after gaining enough
 velocity*** it will follow the ***previous direction
 , in some measure***.
 So, now the ***gradient*** does modify the ***velocity***
 rather than the ***position***, so the momentum will
 ***dampen small variations***.
 Moreover, once the ***momentum builds up***, we will
 easily ***pass over plateaus*** as the
 ***ball will continue to roll over*** until it is
 stopped by a negative ***gradient***
 #### Momentum Equations
 There are a couple of them, mainly.
 One of them uses a term to evaluate the `momentum`, $p$,
 called `SGD momentum` or `momentum term` or
 `momentum parameter`:
 $$
 \begin{aligned}
-    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
+    &W \propto \frac{rand(in, out)}{in + out} \\
-    w_{k+1} &= w_{k} - \gamma p_{k+1}
+    &rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\
    &\,\,\,\,\text{or} \\
    &rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot
        \sqrt{\frac{2}{in + out}}
 \end{aligned}
 $$
-The other one is ***logically equivalent*** to the
+In other words, xavier glorot extracts weights from either a uniform distribution,
-previous, but it update the `weights` in ***one step***
+or a normal one, scaled by a factor $g$ called gain
 and is called `Stochastic Heavy Ball Method`:
-$$
+[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html)
    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
        + \beta ( w_k - w_{k-1})
 $$
-> [!NOTE]
+## Momentum
 > This is how to choose $\beta$:
 >
 > $0 < \beta < 1$
 >
 > If $\beta = 0$, then we are doing
 > ***gradient descent***, if $\beta > 1$ then we
 > ***will have numerical instabilities***.
 >
 > The ***larger*** $\beta$ the
 > ***higher the `momentum`***, so it will
 > ***turn slower***
 > [!TIP]
-> usual values are $\beta = 0.9$ or $\beta = 0.99$
+> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by
-> and usually we start from 0.5 initially, to raise it
+> a factor of 10
 > whenever we are stuck.
 >
 > When we increase $\beta$, then the `learning rate`
 > ***must decrease accordingly***
 > (e.g. from 0.9 to 0.99, `learning-rate` must be
 > divided by a factor of 10)
-#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
+It's a technique inspired by physics. Imagine a ball rolling over a plane. Once
 it has enough speed, even if the plane changes inclination, the ball has
 still energy to move along the previous way because of its momentum.
-Differently from the previous
+Whenever on a gradient descent we have oscillations, **momentum dampens** all
-[momentum](#momentum-equations),
+movements steering us from the previous direction. Here momentum at time $k$
-we take an ***intermediate*** step where we
+is $p_k$
 ***update the `weights`*** according to the
 ***previous `momentum`*** and then we compute the
 ***new `momentum`*** in this new position, and then
 we ***update again***
 $$
 \begin{aligned}
-    \hat{w}_k & = w_k - \beta p_k \\
+    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\
-    p_{k+1} &= \beta p_{k} +
+    W_{k+1} &= W_{k} - \gamma p_{k+1} \\
-        \eta \nabla L(X, y, \hat{w}_k) \\
+    \beta &\in [0, 1]
    w_{k+1} &= w_{k} - \gamma p_{k+1}
 \end{aligned}
 $$
-#### Why Momentum Works
+Or, in a more compact way, logically equivalent to the previous one:
 While it has been ***hypothesized*** that
 ***acceleration*** made ***convergence faster***, this
 is
 ***only true for convex problems without much noise***,
 though this may be ***part of the story***
 The other half may be ***Noise Smoothing*** by
 smoothing the optimization process, however
 according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
 ### Separate Adaptive Learning Rates
 Since `weights` may ***greatly vary*** across `layers`,
 having a ***single `learning-rate` might not be ideal.
 So the idea is to set a `local learning-rate` to
 control the `global` one as a ***multiplicative factor***
 #### Local Learning rates
 - Start with $1$ as the ***starting point*** for
    `local learning-rates` which we'll call `gain` from
    now on.
 - If the `gradient` has the ***same sign, increase it***
 - Otherwise, ***multiplicatively decrease it***
 $$
-    w_{i,j} = - g_{i,j} \cdot \eta \frac{
+    W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1})
-        d \, Out
+$$
    }{
        d \, w_{i,j}
    }
-    \\
+The larger $\beta$ the slower it curves, accumulating more of previous directions.
-    g_{i,j}(t) = \begin{cases}
+To play it safe, use smaller values once you are at the beginning where updates
 are large and slowly turn it up to values near 1
-    g_{i,j}(t - 1) + \delta
+> [!NOTE]
-    & \left( \frac{
+>
-        d \, Out
+> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign
-    }{
+>   rate
-        d \, w_{i,j}
+> - $\gamma$: Learning rate
-    } (t)
+> - $\beta$: hyperparameter of dampening factor
-    \cdot
+> - $\nabla L(X, Y, W_{k})$:  gradient of the loss
-    \frac{
+>
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) > 0 \\
 ## Nesterov Acceleated Gradient (aka NAG)
-    g_{i,j}(t - 1) \cdot (1 - \delta)
+This method takes inpiration from Nesterov's optimization for convex functions and
-    & \left( \frac{
+applies it to momentum. Its quirk is that it never computes the gradient where it
-        d \, Out
+lands on, but on a temporary computation of them before the actual update.
-    }{
+
-        d \, w_{i,j}
+|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] |
-    } (t)
+|--|--|
-    \cdot
+| ![momentum descent](./pngs/vanilla-momentum.gif) | ![nesterov momentum descent](./pngs/nesterov.gif) |
-    \frac{
+
-        d \, Out
+To illustrate better its quirk, here's the formulation:
-    }{
+
-        d \, w_{i,j}
+$$
-    } (t-1) \right) \leq 0
+\begin{aligned}
    \hat{W}_{k} &= W_{k} - \beta p_k \\
    p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\
    W_{k+1} &= W_{k} - \gamma p_{k+1}
 \end{aligned}
 $$
 As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$
 which will be our actual weights. The idea is to follow the previous momentum
 blindly, see where it goes and then make the correction.
 [^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12)
 ## Justifying Faster Optimization for Momentum Based Methods
 While many people justify the speed of momentum based methods for its acceleration,
 this doesn't hold true as it's only accelerated for convex functions.
 Since we have no idea, most of the times, how our gradient function looks like,
 we can't make assumptions about it being convex.
 So, the most compelling explanation lies in the fact that a momentum based
 optimization is like computing a running average of the loss gradient, smoothing
 the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD
 ## Separate Adaptive Learning Rate
 The idea is that each weight of each layer may need its own learnig rate to avoid
 overshooting and smooth the magnitude of received gradients, high over last layers
 and low over first ones (architecture wise)
 The trick is to have a global learning rate that is adjusted by a local gain that
 is increased each time the weight keeps the same sign and viceversa:
 $$
 \Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\
 g_{i,j}(n +1 ) = \begin{cases}
    g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\
    g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0
 \end{cases}
 $$
-With this method, if there are oscillations, we will have
+This method ensures that if the weight oscillates, the gain will dampen it.
-`gains` around $1$
+Moreover, should it be totally random, it will hover near 1, keeping gradient
 updates unchanged.
 > [!NOTE]
 > The way $g$ is updated is similar to AIMD in TCP Congestion
 <!-- Comment for linter complains-->
 > [!TIP]
 >
-> - Usually a value for $d$ is $0.05$
+> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$
-> - Limit `gains` around some values:
+> - **Use full batch or big mini-batches** - This ensures that the change in sign
 >       is not due to sampling errors
 > - **Combine this with momentum**
 > - **Use this to deal with axis-alignment problems**
 >
 >   - $[0.1, 10]$
 >   - $[0.01, 100]$
 >
 > - Use `full-batches` or `big mini-batches` so that
 >   the ***gradient*** doesn't oscillate because of
 >   sampling errors
 > - Combine it with [Momentum](#momentum)
 > - Remember that ***Adaptive `learning-rate`*** deals
 >   with ***axis-alignment***
-### rmsprop | Root Mean Square Propagation
+## Resilient Backpropagation (aka RProp)
-#### rprop | Resilient Propagation[^rprop-torch]
+Instead of using the magnitude of the gradient, **RProp uses the sign to derive
-
+updates** that is multiplied by a step value. Here's the formulation[^florian-1]:
 This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
 but in this case we don't use the
 [AIMD](#local-learning-rates) technique and
 ***we don't take into account*** the
 ***magnitude of the gradient*** but ***only the sign***
 - If ***gradient*** has same sign:
  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
 - else:
  - $step_{k} = step_{k} \cdot \eta_-$
    where $0 <\eta_- < 1$
 > [!TIP]
 >
 > Limit the step size in a range where:
 >
 > - $\inf < 50$
 > - $\sup > 1 \text{M}$
 > [!CAUTION]
 >
 > rprop does ***not work*** with `mini-batches` as
 > the ***sign of the gradient changes frequently***
 #### rmsprop in detail[^rmsprop-torch]
 The idea is that [rprop](#rprop--resilient-propagation)
 is ***equivalent to using the gradient divided by its
 value*** (as you either multiply for $1$ or $-1$),
 however it means that between `mini-batches` the
 ***divisor*** changes each time, oscillating.
 The solution is to have a ***running average*** of
 the ***magnitude of the squared gradient for
 each `weight`***:
 $$
-    MeanSquare(w, t) =
+w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left(
-        \alpha MeanSquare(w, t-1) +
+    \frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}
-        (1 - \alpha)
+\right) \\
-        \left(
+s_{i,j}^{(n)} = \begin{cases}
-            \frac{d\, Out}{d\, w}^2
+    s_{i,j}^{(n - 1)} \cdot 1.2 &
-        \right)
+        \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
        \cdot
        \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\
    s_{i,j}^{(n - 1)} \cdot 0.5 &
        \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
        \cdot
        \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0
 \end{cases} \\
 s_{i,j} \in [10^{-6}, 50]
 $$
-We then divide the ***gradient by the `square root`***
+It is noticeable that , like
-of that value
+[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase
 or decreases the gain. However, since it uses multiplication to increase it, makes
 it unusable for anything but full-batches, beacause of its fast growth.
-#### Further Developments
+[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/)
- `rmsprop` with `momentum` does not work as it should
+## Root Mean Square Propagation (aka RMSProp)
 - `rmsprop` with `Nesterov momentum` works best
    if usedto divide the ***correction*** rather than
    the ***jump***
 - `rmsprop` with `adaptive learnings` needs more
    investigation
-### Fancy Methods
+As the name implies, it propagates the loss over, a bit like momentum. Since
 [RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient,
 it's almost like dividing the gradient by its magnitude, which is bad in case of
 mini-batches, as all divisors are different.
-#### Adaptive Gradient
+RMSProp solves this by keeping the gradient magnitude similar across mini-batches
-
+by keeping a running average of it:
 <!-- TODO: Expand over these -->
 ##### Convex Case
 - Conjugate Gradient/Acceleration
 - L-BFGS
 - Quasi-Newton Methods
 ##### Non-Convex Case
 Pay attention, here the `Hessian` may not be
 `Positive Semi Defined`, thus when the ***gradient*** is
 $0$ we don't necessarily know where we are.
 - Natural Gradient Methods
 - Curvature Adaptive
  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
  - [RMSprop](#rmsprop-in-detail)
  - [ADAM](./Fancy-Methods/ADAM.md)
  - l-BFGS
  - [heavy ball gradient](#momentum)
  - [momemtum](#momentum)
 - Noise Injection:
  - Simulated Annealing
  - Langevin Method
 #### Adagrad
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAGRAD.md)
 #### Adadelta
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADADELTA.md)
 #### ADAM
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAM.md)
 #### AdamW
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/ADAM-W.md)
 #### LION
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/LION.md)
 ### Hessian Free[^anelli-hessian-free]
 How much can we `learn` from a given
 `Loss` space?
 The ***best way to move*** would be along the
 ***gradient***, assuming it has
 the ***same curvature***
 (e.g. It's  and has a local minimum).
 But ***usually this is not the case***, so we need
 to move ***where the ratio of gradient and curvature is
 high***
 #### Newton's Method
 This method takes into account the ***curvature***
 of the `Loss`
 With this method, the update would be:
 $$
-\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
+L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left(
-    d \, E
+    \frac{d \, Loss}{d\, W^{(k -1)}}
-}{
+    \right)^2 \\
-    d \, \vec{w}
+W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\
-}
+     \text{usually } \beta = 0.9
 $$
-***If this could be feasible we'll go on the minimum in
+What this method does is keeping a running average of the measn square error,
-one step***, but it's not, as the
+hence the name, and use it to normalize the gradient keeping it similar across
-***computations***
+mini-batches.
 needed to get a `Hessian` ***increase exponentially***.
-The thing is that whenever we ***update `weights`*** with
+> [!NOTE]
-the `Steepest Descent` method, each update *messes up*
+> While it can be used with momentum, it doesn't seem to add as much benefits as
-another, while the ***curvature*** can help to ***scale
+> using it standalone.
-these updates*** so that they do not disturb each other.
+>
-
+> With Nesterov, it works best if used to normalize the correction, rather than
-#### Curvature Approximations
+> the jump. While for the adaptive learning rates, it still requires further
-
+> investigations to prove the efficacy.
 However, since the `Hessian` is
 ***too expensive to compute***, we can approximate it.
 - We can take only the ***diagonal elements***
 - ***Other algorithms*** (e.g. Hessian Free)
 - ***Conjugate Gradient*** to minimize the
    ***approximation error***
 #### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
 > [!CAUTION]
 >
 > This is an oversemplification of the topic, so reading
 > the footnotes material is greatly advised.
-The basic idea is that, in order not to mess up previous
+## Adaptive Gradient Methods
 directions, we ***`optimize` along perpendicular directions***.
-This method is ***guaranteed to mathematically succeed
+<!--
-after N steps, the dimension of the space***, in practice
+    MARK: AdaGrad
-the error will be minimal.
+-->
 ### AdaGrad[^adagrad-torch]
-This ***method works well for `non-quadratic errors`***
+`AdaGrad` is an ***optimization method*** aimed
-and the `Hessian Free` `optimizer` uses this method
+to:
 on ***genuinely quadratic surfaces***, which are
 ***quadratic approximations of the real surface***
 <ins>***"find needles in the haystack in the form of
 very predictive yet rarely observed features"***
 [^adagrad-official-paper]</ins>
-<!-- TODO: Add PDF 5 pg. 38 -->
+`AdaGrad`, opposed to a standard `SGD` that is the
 ***same for each gradient geometry***, tries to
 ***incorporate geometry from earlier iterations***.
 #### AdaGrad Algorithm
 Instead `AdaGrad` takes another
 approach[^anelli-adagrad-2][^adagrad-official-paper]:
 $$
 \begin{aligned}
    g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\
    G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\
    w_{i}^{(k + 1)} &=
    w_{i}^{(k)} - \eta \cdot\frac{
        1
    }{
        \sqrt{G_{i,i}^{(k +1)} + \epsilon}
    } \cdot g_{i}^{(k+1)} \\
 \end{aligned}
 $$
 Here $G^{(k)}$ is the ***sum of outer product*** of the
 ***gradient*** until time $t$, though ***usually it is
 not used*** $G_t$, which is ***impractical because
 of the high number of dimensions***, so we use
 $diag(G_t)$ which can be
 ***computed in linear time***[^adagrad-official-paper]
 The $\epsilon$ term here is used to
 ***avoid dividing by 0***[^anelli-adagrad-2] and has a
 small value, usually in the order of $10^{-8}$
 > [!NOTE]
 >
 > This example is tough to understand if we where to apply it to a matrix $W$
 > instead of a vector. To make it easier to understand in matricial notation:
 >
 > $$
 >    \begin{aligned}
 >        \nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\
 >        G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\
 >        W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}
                {\sqrt{G^{(k+1)} + \epsilon}}
 >    \end{aligned}
 > $$
 >
 > In other words, compute the gradient and scale it for the sum of its squares
 > until that point
 #### AdaGrad effectiveness[^anelli-adagrad-3]
 - When we have ***many dimensions, many features are
    irrelevant***
 - ***Rarer Features are more relevant***
 - It adapts $\eta$ to the right metric space
    by projecting gradient stochastic updates with
    [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
    a probability distribution.
 #### AdaGrad Considerations
 - It eliminates the need of manually tuning the
    `learning rates`, which is usually set to
    $0.01$
 - The squared ***gradients*** are accumulated during
    iterations, making the `learning-rate` become
    ***smaller and smaller***, thus becoming 0 and untrainable
 <!-- Footnotes -->
-[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
+[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
-[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
+[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
-[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
+[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
-[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
+[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
-[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
+[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
-[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
+[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
-[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
+### AdaDelta[^adadelta-offcial-paper]
 `ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
 created to address some problems of it, like
 ***sensitivity to initial `parameters` and corresponding
 gradient***[^adadelta-offcial-paper]
 To address all these problems, `ADADELTA` accumulates
 ***gradients over a `window` as a running average***, rather than ***accumulating
 it over all instances***:
 $$
 G^{(k+1)} = \beta \cdot G^{(k)} +
    (1 - \beta) \cdot \nabla L^{(k+1)}
 $$
 The update, which is very similar to the one in
 [AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
 $$
 \begin{aligned}
    W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}}
 \end{aligned}
 $$
 Technically speaking, the last equation is basically equivalent to the
 [RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is
 equivalent to the running average of the mean square.
 However, as the author pointed out[^adadelta-units], this equation does not
 respect units of measures. We should correct this problem
 by ***considering the curvature locally smooth*** and
 taking an approximation of it at the next step, becoming:
 $$
 \begin{aligned}
    \Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}}
        \nabla L^{(k)}\\
    S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\
    W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)}
 \end{aligned}
 $$
 As we can notice, the ***`learning rate` completely
 disappears from the equation, eliminating the need to
 set one***
 > [!WARNING]
 > Here $\Delta W$ is already negative, that's why there's a $+$ in the last
 > equation
 <!-- Footnotes -->
 [^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
 [^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
 ### Adaptive Moment Estimation (aka AdaM)
 AdaM computes both the momentum and the squared gradients with running
 averages, which are 0 filled at time $k = 0$:
 $$
 \begin{aligned}
    M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\
    V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\
 \end{aligned}
 $$
 > [!WARNING]
 > The squared gradient can be thought as the variance, however it's not centered
 Then it corrects them to be used in the final formulation:
 $$
 \begin{aligned}
    \hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}}  \\
    \hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\
 \end{aligned}
 $$
 > [!WARNING]
 > $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep.
 Then it computes the update in this way:
 $$
 W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}}
    {\sqrt{\hat{V}^{(k+1)}} + \epsilon}
 $$
 Even though Adam works, it doesn't generalize well and, particularly in image
 problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers
 instead of 1 as for SGD, which 2 of them need parameters tuning.
 > [!NOTE]
 > Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and
 > $\epsilon = 10^-8$
 ### AdamW
 AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty,
 AdaM already implements it, however it is usually added to momentum, getting
 scaled by $\sqrt{\hat{V}}$:
 $$
 W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}}
    {\sqrt{\hat{V}^{(k+1)}} + \epsilon}
 $$
 AdamW authors saw that this was inefficient as it was influences by the uncentered
 variance, thus modified the formula to this:
 $$
 W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} }
    {\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)}
 $$
 ### Lion (evoLved sIgn mOmeNtum)[^official-paper]
 `Lion` is the result of a ***genetic search algorithm*** aimed to
 find the best `optimizer`.
 It starts from a population of `AdamW` algorithms to
 ***speed up the search***. Opposed to
 `Adam` and `AdamW`, it keeps track
 ***only for the momentum*** and ***gradient sign***,
 requiring ***less `memory`***.
 Since ***uniform updates yields larger norms***,
 `Lion` requires a ***smaller `learning-rate`***
 and a ***larger decoupled `weight-decay`***
 $\lambda$[^official-paper-1].
 The ***advantages of `Lion` over `Adam` and `AdamW`
 increase with the size of
 the `mini-batch`***[^official-paper-1]
 #### Symbolic Representation[^official-paper-2]
 New ***trained algorithms*** are represented
 `simbolically`, bringing these advantages:
 - `Algorithms` must be ***implemented*** as `programs`
 - It ***easier to analyze, comprehend and transfer to
    new task*** these `algorithms`, rather than other
    `algorithms` such as `NeuralNetworks`
 - We can **estimate the *complexity*** by looking
    at the ***length of code***
 #### Tournament[^official-paper-3]
 The best code is found with a ***tournament style
 evolution***. Each cycle it picks the ***best
 `algorithm`*** which will be
 ***copied and mutated*** and the ***oldest is removed***
 <!-- Footnotes -->
 [^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
 [^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
 [^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
 [^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
 ## Hessian Free Optimization
 Since we are moving on a function which gradient is not constant, by looking at
 the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix),
 we can see when it starts to change.
 ### Newton's Method
 This method would technically give us the solution in one step on a quadratic
 function, but it is unfeasible due to the memory and computational requirements:
 $$
 \Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W}
 $$
 ### Conjugate Gradient
 The idea is to correct the weights so that we reduce the gradient to 0 across
 perpendicular directions. This means that, for each update, we are not messing up
 previous optimizations.
 While it is usually used for quadratic error surfaces, there's a non linear variant
 (non-linear conjugate gradient) that usually works well. However it is also
 possible to approximate the true error function with a quadratic one, using the
 standard method.
 It gives a solution after $N$ steps over an $N$ dimensional quadratic surface,
 however we need to penalize frequent changes in weights, especially for hidden
 activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
 ## Optimization Tricks
 ### Input decorrelation
 If you have a linear neuron, think of a Feed Forward and not of a Convolution,
 it's better to decorrelate input components.
 A way to achieve this is through a
 [PCA](./../15-Appendix-A/INDEX.md#computing-pca),
 transforming the error surface from an ellipse to a circle.
 ### Recognize Plateaus
 If we start with big learning rates, since weights gain a big magnitude, the
 derivative will be small and the error will not decrease significantly.
 This may seem a local minima, but this is usually a plateau.
 ### Mini-Batch Speed up
 To speed up mini batch training use these methods:
 - [**Momentum**](#momentum)
 - [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate)
 - [**rmsprop**](#root-mean-square-propagation-aka-rmsprop)
 - [**Adaptive Gradients Methods**](#adaptive-gradient-methods)
 ### Mini-batches vs Full-Batches
 The rule of thumb is to use **full-batches for small datasets or small redundancy**
 , while **mini-batches for redundant datasets**
 [^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
--- a/Chapters/5-Optimization/pngs/nesterov.gif
+++ b/Chapters/5-Optimization/pngs/nesterov.gif
--- a/Chapters/5-Optimization/pngs/vanilla-momentum.gif
+++ b/Chapters/5-Optimization/pngs/vanilla-momentum.gif