Deep-Learning/Chapters/5-Optimization/INDEX-OLD.md

# Optimization

We basically try to see the error and minimize it by moving towards the ***gradient***

## Types of Learning Algorithms

In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.

So, often we train the `model` on a subset of samples.

### Online Learning

This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.

On each `point` we get the ***gradient*** and then we update `weights`.

### Mini-Batch

In this approach, we divide our `dataset` in small batches called `mini-batches`.
These need to be ***balanced*** in order not to have ***imbalances***.

This technique is the ***most used one***

## Tips and Tricks

### Learning Rate

This is the `hyperparameter` we use to tune our
***learning steps***.

Sometimes we have it too big and this causes
***overshootings***. So a quick solution may be to turn
it down.

However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`

### Weight initialization

We need to avoid `neurons` to have the same
***gradient***. This is easily achievable by using
***small random values***.

However, if we have a ***large `fan-in`***, then it's
***easy to overshoot***, then it's better to initialize
those `weights` ***proportionally to***
$\sqrt{\text{fan-in}}$:

$$
w = \frac{
    np.random(N)
}{
    \sqrt{N}
}
$$

#### Xavier-Glorot Initialization

<!-- TODO: Read Xavier-Glorot paper -->

Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
`uniform distribution` with a `std-dev`

$$
\sigma^2 = \text{gain} \cdot \sqrt{
    \frac{
        2
    }{
        \text{fan-in} + \text{fan-out}
    }
}
$$

and bounded between $a$ and $-a$

$$
a = \text{gain} \cdot \sqrt{
    \frac{
        6
    }{
        \text{fan-in} + \text{fan-out}
    }
}
$$

Alternatively, one can use a `normal-distribution`
$\mathcal{N}(0, \sigma^2)$.

Note that `gain` is in the **original paper** is equal
to $1$

### Decorrelating input components

Since ***highly correlated features*** don't offer much
in terms of ***new information***, probably we need
to go in the ***latent space*** to find the
`latent-variables` governing those `features`.

#### PCA

> [!CAUTION]
> This topic won't be explained here as it's something
> usually learnt for `Machine Learning`, a
> ***prerequisite*** for approaching `Deep Learning`.

This is a method we can use to discard `features` that
will ***add little to no information***

## Common problems in MultiLayer Networks

### Hitting a Plateau

This happenes wehn we have a ***big `learning-rate`***
which makes `weights` go high in ***absolute value***.

Because this happens ***too quickly***, we could
see a ***quick diminishing error*** and this is usually
***mistaken for a minimum point***, while instead
it's a ***plateau***.

## Speeding up Mini-Batch Learning

### Momentum[^momentum]

We use this method ***mainly when we use `SGD`*** as
a ***learning techniques***

This method is better explained if we imagine
our error surface as an actual surface and we place a
ball over it.

***The ball will start rolling towards the steepest
descent*** (initially), but ***after gaining enough
velocity*** it will follow the ***previous direction
, in some measure***.

So, now the ***gradient*** does modify the ***velocity***
rather than the ***position***, so the momentum will
***dampen small variations***.

Moreover, once the ***momentum builds up***, we will
easily ***pass over plateaus*** as the
***ball will continue to roll over*** until it is
stopped by a negative ***gradient***

#### Momentum Equations

There are a couple of them, mainly.

One of them uses a term to evaluate the `momentum`, $p$,
called `SGD momentum` or `momentum term` or
`momentum parameter`:

$$
\begin{aligned}
    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}
$$

The other one is ***logically equivalent*** to the
previous, but it update the `weights` in ***one step***
and is called `Stochastic Heavy Ball Method`:

$$
    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
        + \beta ( w_k - w_{k-1})
$$

> [!NOTE]
> This is how to choose $\beta$:
>
> $0 < \beta < 1$
>
> If $\beta = 0$, then we are doing
> ***gradient descent***, if $\beta > 1$ then we
> ***will have numerical instabilities***.
>
> The ***larger*** $\beta$ the
> ***higher the `momentum`***, so it will
> ***turn slower***

> [!TIP]
> usual values are $\beta = 0.9$ or $\beta = 0.99$
> and usually we start from 0.5 initially, to raise it
> whenever we are stuck.
>
> When we increase $\beta$, then the `learning rate`
> ***must decrease accordingly***
> (e.g. from 0.9 to 0.99, `learning-rate` must be
> divided by a factor of 10)

#### Nesterov (1983) Sutskever (2012) Accelerated Momentum

Differently from the previous
[momentum](#momentum-equations),
we take an ***intermediate*** step where we
***update the `weights`*** according to the
***previous `momentum`*** and then we compute the
***new `momentum`*** in this new position, and then
we ***update again***

$$
\begin{aligned}
    \hat{w}_k & = w_k - \beta p_k \\
    p_{k+1} &= \beta p_{k} +
        \eta \nabla L(X, y, \hat{w}_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}
$$

#### Why Momentum Works

While it has been ***hypothesized*** that
***acceleration*** made ***convergence faster***, this
is
***only true for convex problems without much noise***,
though this may be ***part of the story***

The other half may be ***Noise Smoothing*** by
smoothing the optimization process, however
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.

### Separate Adaptive Learning Rates

Since `weights` may ***greatly vary*** across `layers`,
having a ***single `learning-rate` might not be ideal.

So the idea is to set a `local learning-rate` to
control the `global` one as a ***multiplicative factor***

#### Local Learning rates

- Start with $1$ as the ***starting point*** for
    `local learning-rates` which we'll call `gain` from
    now on.
- If the `gradient` has the ***same sign, increase it***
- Otherwise, ***multiplicatively decrease it***

$$
    w_{i,j} = - g_{i,j} \cdot \eta \frac{
        d \, Out
    }{
        d \, w_{i,j}
    }

    \\
    g_{i,j}(t) = \begin{cases}

    g_{i,j}(t - 1) + \delta
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) > 0 \\


    g_{i,j}(t - 1) \cdot (1 - \delta)
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) \leq 0
\end{cases}
$$

With this method, if there are oscillations, we will have
`gains` around $1$

> [!TIP]
>
> - Usually a value for $d$ is $0.05$
> - Limit `gains` around some values:
>
>   - $[0.1, 10]$
>   - $[0.01, 100]$
>
> - Use `full-batches` or `big mini-batches` so that
>   the ***gradient*** doesn't oscillate because of
>   sampling errors
> - Combine it with [Momentum](#momentum)
> - Remember that ***Adaptive `learning-rate`*** deals
>   with ***axis-alignment***

### rmsprop | Root Mean Square Propagation

#### rprop | Resilient Propagation[^rprop-torch]

This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
but in this case we don't use the
[AIMD](#local-learning-rates) technique and
***we don't take into account*** the
***magnitude of the gradient*** but ***only the sign***

- If ***gradient*** has same sign:
  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
- else:
  - $step_{k} = step_{k} \cdot \eta_-$
    where $0 <\eta_- < 1$

> [!TIP]
>
> Limit the step size in a range where:
>
> - $\inf < 50$
> - $\sup > 1 \text{M}$

> [!CAUTION]
>
> rprop does ***not work*** with `mini-batches` as
> the ***sign of the gradient changes frequently***

#### rmsprop in detail[^rmsprop-torch]

The idea is that [rprop](#rprop--resilient-propagation)
is ***equivalent to using the gradient divided by its
value*** (as you either multiply for $1$ or $-1$),
however it means that between `mini-batches` the
***divisor*** changes each time, oscillating.

The solution is to have a ***running average*** of
the ***magnitude of the squared gradient for
each `weight`***:

$$
    MeanSquare(w, t) =
        \alpha MeanSquare(w, t-1) +
        (1 - \alpha)
        \left(
            \frac{d\, Out}{d\, w}^2
        \right)
$$

We then divide the ***gradient by the `square root`***
of that value

#### Further Developments

- `rmsprop` with `momentum` does not work as it should
- `rmsprop` with `Nesterov momentum` works best
    if usedto divide the ***correction*** rather than
    the ***jump***
- `rmsprop` with `adaptive learnings` needs more
    investigation

### Fancy Methods

#### Adaptive Gradient

<!-- TODO: Expand over these -->

##### Convex Case

- Conjugate Gradient/Acceleration
- L-BFGS
- Quasi-Newton Methods

##### Non-Convex Case

Pay attention, here the `Hessian` may not be
`Positive Semi Defined`, thus when the ***gradient*** is
$0$ we don't necessarily know where we are.

- Natural Gradient Methods
- Curvature Adaptive
  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
  - [RMSprop](#rmsprop-in-detail)
  - [ADAM](./Fancy-Methods/ADAM.md)
  - l-BFGS
  - [heavy ball gradient](#momentum)
  - [momemtum](#momentum)
- Noise Injection:
  - Simulated Annealing
  - Langevin Method

#### Adagrad

> [!NOTE]
> [Here in detail](./Fancy-Methods/ADAGRAD.md)

#### Adadelta

> [!NOTE]
> [Here in detail](./Fancy-Methods/ADADELTA.md)

#### ADAM

> [!NOTE]
> [Here in detail](./Fancy-Methods/ADAM.md)

#### AdamW

> [!NOTE]
> [Here in detail](./Fancy-Methods/ADAM-W.md)

#### LION

> [!NOTE]
> [Here in detail](./Fancy-Methods/LION.md)

### Hessian Free[^anelli-hessian-free]

How much can we `learn` from a given
`Loss` space?

The ***best way to move*** would be along the
***gradient***, assuming it has
the ***same curvature***
(e.g. It's  and has a local minimum).

But ***usually this is not the case***, so we need
to move ***where the ratio of gradient and curvature is
high***

#### Newton's Method

This method takes into account the ***curvature***
of the `Loss`

With this method, the update would be:

$$
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
    d \, E
}{
    d \, \vec{w}
}
$$

***If this could be feasible we'll go on the minimum in
one step***, but it's not, as the
***computations***
needed to get a `Hessian` ***increase exponentially***.

The thing is that whenever we ***update `weights`*** with
the `Steepest Descent` method, each update *messes up*
another, while the ***curvature*** can help to ***scale
these updates*** so that they do not disturb each other.

#### Curvature Approximations

However, since the `Hessian` is
***too expensive to compute***, we can approximate it.

- We can take only the ***diagonal elements***
- ***Other algorithms*** (e.g. Hessian Free)
- ***Conjugate Gradient*** to minimize the
    ***approximation error***

#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]

> [!CAUTION]
>
> This is an oversemplification of the topic, so reading
> the footnotes material is greatly advised.

The basic idea is that, in order not to mess up previous
directions, we ***`optimize` along perpendicular directions***.

This method is ***guaranteed to mathematically succeed
after N steps, the dimension of the space***, in practice
the error will be minimal.

This ***method works well for `non-quadratic errors`***
and the `Hessian Free` `optimizer` uses this method
on ***genuinely quadratic surfaces***, which are
***quadratic approximations of the real surface***


<!-- TODO: Add PDF 5 pg. 38 -->

<!-- Footnotes -->

[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)

[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4

[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1

[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)

[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)

[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81

[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)

[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
Revised Optimization Notes 2025-11-20 18:47:36 +01:00			`# Optimization`

			`We basically try to see the error and minimize it by moving towards the *gradient*`

			`## Types of Learning Algorithms`

			In `Deep Learning` it's not unusual to be facing *highly redundant* `datasets`.
			Because of this, usually *gradient* from some `samples` is the *same* for some others.

			So, often we train the `model` on a subset of samples.

			`### Online Learning`

			This is the *extreme* of our techniques to deal with *redundancy* of `data`.

			On each `point` we get the *gradient* and then we update `weights`.

			`### Mini-Batch`

			In this approach, we divide our `dataset` in small batches called `mini-batches`.
			`These need to be *balanced* in order not to have *imbalances*.`

			`This technique is the *most used one*`

			`## Tips and Tricks`

			`### Learning Rate`

			This is the `hyperparameter` we use to tune our
			`*learning steps*.`

			`Sometimes we have it too big and this causes`
			`*overshootings*. So a quick solution may be to turn`
			`it down.`

			However, we are *trading speed for accuracy*, thus it's better to wait before tuning this `parameter`

			`### Weight initialization`

			We need to avoid `neurons` to have the same
			`*gradient*. This is easily achievable by using`
			`*small random values*.`

			However, if we have a *large `fan-in`*, then it's
			`*easy to overshoot*, then it's better to initialize`
			those `weights` *proportionally to*
			`$\sqrt{\text{fan-in}}$:`

			`$$`
			`w = \frac{`
			`np.random(N)`
			`}{`
			`\sqrt{N}`
			`}`
			`$$`

			`#### Xavier-Glorot Initialization`

			`<!-- TODO: Read Xavier-Glorot paper -->`

			Here `weights` are *proportional* to $\sqrt{\text{fan-in}}$ as well, but we *sample* from a
			`uniform distribution` with a `std-dev`

			`$$`
			`\sigma^2 = \text{gain} \cdot \sqrt{`
			`\frac{`
			`2`
			`}{`
			`\text{fan-in} + \text{fan-out}`
			`}`
			`}`
			`$$`

			`and bounded between $a$ and $-a$`

			`$$`
			`a = \text{gain} \cdot \sqrt{`
			`\frac{`
			`6`
			`}{`
			`\text{fan-in} + \text{fan-out}`
			`}`
			`}`
			`$$`

			Alternatively, one can use a `normal-distribution`
			`$\mathcal{N}(0, \sigma^2)$.`

			Note that `gain` is in the original paper is equal
			`to $1$`

			`### Decorrelating input components`

			`Since *highly correlated features* don't offer much`
			`in terms of *new information*, probably we need`
			`to go in the *latent space* to find the`
			`latent-variables` governing those `features`.

			`#### PCA`

			`> [!CAUTION]`
			`> This topic won't be explained here as it's something`
			> usually learnt for `Machine Learning`, a
			> *prerequisite* for approaching `Deep Learning`.

			This is a method we can use to discard `features` that
			`will *add little to no information*`

			`## Common problems in MultiLayer Networks`

			`### Hitting a Plateau`

			This happenes wehn we have a *big `learning-rate`*
			which makes `weights` go high in *absolute value*.

			`Because this happens *too quickly*, we could`
			`see a *quick diminishing error* and this is usually`
			`*mistaken for a minimum point*, while instead`
			`it's a *plateau*.`

			`## Speeding up Mini-Batch Learning`

			`### Momentum[^momentum]`

			We use this method *mainly when we use `SGD`* as
			`a *learning techniques*`

			`This method is better explained if we imagine`
			`our error surface as an actual surface and we place a`
			`ball over it.`

			`***The ball will start rolling towards the steepest`
			`descent* (initially), but *after gaining enough`
			`velocity* it will follow the *previous direction`
			`, in some measure***.`

			`So, now the *gradient* does modify the *velocity*`
			`rather than the *position*, so the momentum will`
			`*dampen small variations*.`

			`Moreover, once the *momentum builds up*, we will`
			`easily *pass over plateaus* as the`
			`*ball will continue to roll over* until it is`
			`stopped by a negative *gradient*`

			`#### Momentum Equations`

			`There are a couple of them, mainly.`

			One of them uses a term to evaluate the `momentum`, $p$,
			called `SGD momentum` or `momentum term` or
			`momentum parameter`:

			`$$`
			`\begin{aligned}`
			`p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\`
			`w_{k+1} &= w_{k} - \gamma p_{k+1}`
			`\end{aligned}`
			`$$`

			`The other one is *logically equivalent* to the`
			previous, but it update the `weights` in *one step*
			and is called `Stochastic Heavy Ball Method`:

			`$$`
			`w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)`
			`+ \beta ( w_k - w_{k-1})`
			`$$`

			`> [!NOTE]`
			`> This is how to choose $\beta$:`
			`>`
			`> $0 < \beta < 1$`
			`>`
			`> If $\beta = 0$, then we are doing`
			`> *gradient descent*, if $\beta > 1$ then we`
			`> *will have numerical instabilities*.`
			`>`
			`> The *larger* $\beta$ the`
			> *higher the `momentum`*, so it will
			`> *turn slower*`

			`> [!TIP]`
			`> usual values are $\beta = 0.9$ or $\beta = 0.99$`
			`> and usually we start from 0.5 initially, to raise it`
			`> whenever we are stuck.`
			`>`
			> When we increase $\beta$, then the `learning rate`
			`> *must decrease accordingly*`
			> (e.g. from 0.9 to 0.99, `learning-rate` must be
			`> divided by a factor of 10)`

			`#### Nesterov (1983) Sutskever (2012) Accelerated Momentum`

			`Differently from the previous`
			`[momentum](#momentum-equations),`
			`we take an *intermediate* step where we`
			*update the `weights`* according to the
			*previous `momentum`* and then we compute the
			*new `momentum`* in this new position, and then
			`we *update again*`

			`$$`
			`\begin{aligned}`
			`\hat{w}_k & = w_k - \beta p_k \\`
			`p_{k+1} &= \beta p_{k} +`
			`\eta \nabla L(X, y, \hat{w}_k) \\`
			`w_{k+1} &= w_{k} - \gamma p_{k+1}`
			`\end{aligned}`
			`$$`

			`#### Why Momentum Works`

			`While it has been *hypothesized* that`
			`*acceleration* made *convergence faster*, this`
			`is`
			`*only true for convex problems without much noise*,`
			`though this may be *part of the story*`

			`The other half may be *Noise Smoothing* by`
			`smoothing the optimization process, however`
			`according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.`

			`### Separate Adaptive Learning Rates`

			Since `weights` may *greatly vary* across `layers`,
			having a ***single `learning-rate` might not be ideal.

			So the idea is to set a `local learning-rate` to
			control the `global` one as a *multiplicative factor*

			`#### Local Learning rates`

			`- Start with $1$ as the *starting point* for`
			`local learning-rates` which we'll call `gain` from
			`now on.`
			- If the `gradient` has the *same sign, increase it*
			`- Otherwise, *multiplicatively decrease it*`

			`$$`
			`w_{i,j} = - g_{i,j} \cdot \eta \frac{`
			`d \, Out`
			`}{`
			`d \, w_{i,j}`
			`}`

			`\\`
			`g_{i,j}(t) = \begin{cases}`

			`g_{i,j}(t - 1) + \delta`
			`& \left( \frac{`
			`d \, Out`
			`}{`
			`d \, w_{i,j}`
			`} (t)`
			`\cdot`
			`\frac{`
			`d \, Out`
			`}{`
			`d \, w_{i,j}`
			`} (t-1) \right) > 0 \\`


			`g_{i,j}(t - 1) \cdot (1 - \delta)`
			`& \left( \frac{`
			`d \, Out`
			`}{`
			`d \, w_{i,j}`
			`} (t)`
			`\cdot`
			`\frac{`
			`d \, Out`
			`}{`
			`d \, w_{i,j}`
			`} (t-1) \right) \leq 0`
			`\end{cases}`
			`$$`

			`With this method, if there are oscillations, we will have`
			`gains` around $1$

			`> [!TIP]`
			`>`
			`> - Usually a value for $d$ is $0.05$`
			> - Limit `gains` around some values:
			`>`
			`> - $[0.1, 10]$`
			`> - $[0.01, 100]$`
			`>`
			> - Use `full-batches` or `big mini-batches` so that
			`> the *gradient* doesn't oscillate because of`
			`> sampling errors`
			`> - Combine it with [Momentum](#momentum)`
			> - Remember that *Adaptive `learning-rate`* deals
			`> with *axis-alignment*`

			`### rmsprop \| Root Mean Square Propagation`

			`#### rprop \| Resilient Propagation[^rprop-torch]`

			`This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),`
			`but in this case we don't use the`
			`[AIMD](#local-learning-rates) technique and`
			`*we don't take into account* the`
			`*magnitude of the gradient* but *only the sign*`

			`- If *gradient* has same sign:`
			`- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$`
			`- else:`
			`- $step_{k} = step_{k} \cdot \eta_-$`
			`where $0 <\eta_- < 1$`

			`> [!TIP]`
			`>`
			`> Limit the step size in a range where:`
			`>`
			`> - $\inf < 50$`
			`> - $\sup > 1 \text{M}$`

			`> [!CAUTION]`
			`>`
			> rprop does *not work* with `mini-batches` as
			`> the *sign of the gradient changes frequently*`

			`#### rmsprop in detail[^rmsprop-torch]`

			`The idea is that [rprop](#rprop--resilient-propagation)`
			`is ***equivalent to using the gradient divided by its`
			`value*** (as you either multiply for $1$ or $-1$),`
			however it means that between `mini-batches` the
			`*divisor* changes each time, oscillating.`

			`The solution is to have a *running average* of`
			`the ***magnitude of the squared gradient for`
			each `weight`***:

			`$$`
			`MeanSquare(w, t) =`
			`\alpha MeanSquare(w, t-1) +`
			`(1 - \alpha)`
			`\left(`
			`\frac{d\, Out}{d\, w}^2`
			`\right)`
			`$$`

			We then divide the *gradient by the `square root`*
			`of that value`

			`#### Further Developments`

			- `rmsprop` with `momentum` does not work as it should
			- `rmsprop` with `Nesterov momentum` works best
			`if usedto divide the *correction* rather than`
			`the *jump*`
			- `rmsprop` with `adaptive learnings` needs more
			`investigation`

			`### Fancy Methods`

			`#### Adaptive Gradient`

			`<!-- TODO: Expand over these -->`

			`##### Convex Case`

			`- Conjugate Gradient/Acceleration`
			`- L-BFGS`
			`- Quasi-Newton Methods`

			`##### Non-Convex Case`

			Pay attention, here the `Hessian` may not be
			`Positive Semi Defined`, thus when the *gradient* is
			`$0$ we don't necessarily know where we are.`

			`- Natural Gradient Methods`
			`- Curvature Adaptive`
			`- [Adagrad](./Fancy-Methods/ADAGRAD.md)`
			`- [AdaDelta](./Fancy-Methods/ADADELTA.md)`
			`- [RMSprop](#rmsprop-in-detail)`
			`- [ADAM](./Fancy-Methods/ADAM.md)`
			`- l-BFGS`
			`- [heavy ball gradient](#momentum)`
			`- [momemtum](#momentum)`
			`- Noise Injection:`
			`- Simulated Annealing`
			`- Langevin Method`

			`#### Adagrad`

			`> [!NOTE]`
			`> [Here in detail](./Fancy-Methods/ADAGRAD.md)`

			`#### Adadelta`

			`> [!NOTE]`
			`> [Here in detail](./Fancy-Methods/ADADELTA.md)`

			`#### ADAM`

			`> [!NOTE]`
			`> [Here in detail](./Fancy-Methods/ADAM.md)`

			`#### AdamW`

			`> [!NOTE]`
			`> [Here in detail](./Fancy-Methods/ADAM-W.md)`

			`#### LION`

			`> [!NOTE]`
			`> [Here in detail](./Fancy-Methods/LION.md)`

			`### Hessian Free[^anelli-hessian-free]`

			How much can we `learn` from a given
			`Loss` space?

			`The *best way to move* would be along the`
			`*gradient*, assuming it has`
			`the *same curvature*`
			`(e.g. It's and has a local minimum).`

			`But *usually this is not the case*, so we need`
			`to move ***where the ratio of gradient and curvature is`
			`high***`

			`#### Newton's Method`

			`This method takes into account the *curvature*`
			of the `Loss`

			`With this method, the update would be:`

			`$$`
			`\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{`
			`d \, E`
			`}{`
			`d \, \vec{w}`
			`}`
			`$$`

			`***If this could be feasible we'll go on the minimum in`
			`one step***, but it's not, as the`
			`*computations*`
			needed to get a `Hessian` *increase exponentially*.

			The thing is that whenever we *update `weights`* with
			the `Steepest Descent` method, each update messes up
			`another, while the *curvature* can help to ***scale`
			`these updates*** so that they do not disturb each other.`

			`#### Curvature Approximations`

			However, since the `Hessian` is
			`*too expensive to compute*, we can approximate it.`

			`- We can take only the *diagonal elements*`
			`- *Other algorithms* (e.g. Hessian Free)`
			`- *Conjugate Gradient* to minimize the`
			`*approximation error*`

			`#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]`

			`> [!CAUTION]`
			`>`
			`> This is an oversemplification of the topic, so reading`
			`> the footnotes material is greatly advised.`

			`The basic idea is that, in order not to mess up previous`
			directions, we *`optimize` along perpendicular directions*.

			`This method is ***guaranteed to mathematically succeed`
			`after N steps, the dimension of the space***, in practice`
			`the error will be minimal.`

			This *method works well for `non-quadratic errors`*
			and the `Hessian Free` `optimizer` uses this method
			`on *genuinely quadratic surfaces*, which are`
			`*quadratic approximations of the real surface*`


			`<!-- TODO: Add PDF 5 pg. 38 -->`

			`<!-- Footnotes -->`

			`[^momentum]: [Distill Pub \| 18th April 2025](https://distill.pub/2017/momentum/)`

			`[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent \| arXiv:2402.02325v4`

			`[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization \| arXiv:2402.02325v1`

			`[^rprop-torch]: [Rprop \| Official PyTorch Documentation \| 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)`

			`[^rmsprop-torch]: [RMSprop \| Official PyTorch Documentation \| 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)`

			`[^anelli-hessian-free]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 5 pg. 67-81`

			`[^conjugate-wikipedia]: [Conjugate Gradient Method \| Wikipedia \| 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)`

			`[^anelli-conjugate-gradient]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 5 pg. 74-76`