Revised Optimization Notes
This commit is contained in:
parent
2a96deaebf
commit
934c08d4c0
@ -33,7 +33,8 @@ $$
|
||||
|
||||
## Cross Entropy Loss derivation
|
||||
|
||||
A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
|
||||
Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"*
|
||||
we get from distribution $p$ knowing
|
||||
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
||||
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
||||
|
||||
@ -62,6 +63,23 @@ Usually $\hat{y}$ comes from using a
|
||||
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
||||
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
||||
|
||||
## Computing PCA[^wiki-pca]
|
||||
|
||||
> [!CAUTION]
|
||||
> $X$ here is the matrix of dataset with **<ins>features over rows</ins>**
|
||||
|
||||
- $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation
|
||||
- $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$
|
||||
- $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues
|
||||
- $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$
|
||||
highest eigenvalue
|
||||
- $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation
|
||||
|
||||
> [!NOTE]
|
||||
> You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2
|
||||
> are closely related and apply the same concept but applying different
|
||||
> mathematical formulas.
|
||||
|
||||
## Laplace Operator[^khan-1]
|
||||
|
||||
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
||||
@ -80,8 +98,32 @@ It can also be used to compute the net flow of particles in that region of space
|
||||
> This is not a **discrete laplace operator**, which is instead a **matrix** here,
|
||||
> as there are many other formulations.
|
||||
|
||||
## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)
|
||||
|
||||
A Hessian Matrix represents the 2nd derivative of a function, thus it gives
|
||||
us the curvature of a function.
|
||||
|
||||
It is also used to tell us whether the point is a local minimum (it is positive
|
||||
defined), local maximum (it is negative defined) or saddle (neither positive or
|
||||
negative defined).
|
||||
|
||||
It is computed by computing the partial derivatives of the gradient along
|
||||
all dimensions and then transpose it.
|
||||
|
||||
$$
|
||||
\nabla f = \begin{bmatrix}
|
||||
\frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
|
||||
\end{bmatrix} \\
|
||||
H(f) = \begin{bmatrix}
|
||||
\frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
|
||||
\frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
|
||||
\end{bmatrix}
|
||||
$$
|
||||
|
||||
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
||||
|
||||
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
||||
|
||||
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
||||
|
||||
[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
|
||||
|
||||
199
Chapters/15-Appendix-A/python-experiments/pca.ipynb
Normal file
199
Chapters/15-Appendix-A/python-experiments/pca.ipynb
Normal file
@ -0,0 +1,199 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8c14ea22",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Computing PCA\n",
|
||||
"\n",
|
||||
"Here I'll be taking data from [Geeks4Geeks](https://www.geeksforgeeks.org/machine-learning/mathematical-approach-to-pca/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0b32eb5c",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[1.8 1.87777778]\n",
|
||||
"[[ 0.7 0.52222222]\n",
|
||||
" [-1.3 -1.17777778]\n",
|
||||
" [ 0.4 1.02222222]\n",
|
||||
" [ 1.3 1.12222222]\n",
|
||||
" [ 0.5 0.82222222]\n",
|
||||
" [ 0.2 -0.27777778]\n",
|
||||
" [-0.8 -0.77777778]\n",
|
||||
" [-0.3 -0.27777778]\n",
|
||||
" [-0.7 -0.97777778]]\n",
|
||||
"[[0.6925 0.68875 ]\n",
|
||||
" [0.68875 0.79444444]]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"X : np.ndarray = np.array([\n",
|
||||
" [2.5, 2.4],\n",
|
||||
" [0.5, 0.7],\n",
|
||||
" [2.2, 2.9],\n",
|
||||
" [3.1, 3.0],\n",
|
||||
" [2.3, 2.7],\n",
|
||||
" [2.0, 1.6],\n",
|
||||
" [1.0, 1.1],\n",
|
||||
" [1.5, 1.6],\n",
|
||||
" [1.1, 0.9]\n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"# Compute mean values for features\n",
|
||||
"mu_X = np.mean(X, 0)\n",
|
||||
"\n",
|
||||
"print(mu_X)\n",
|
||||
"# \"Normalize\" Features\n",
|
||||
"X = X - mu_X\n",
|
||||
"print(X)\n",
|
||||
"\n",
|
||||
"# Compute covariance matrix applying\n",
|
||||
"# Bessel's correction (n-1) instead of n\n",
|
||||
"Cov = (X.T @ X) / (X.shape[0] - 1)\n",
|
||||
"\n",
|
||||
"print(Cov)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "78e9429f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you can notice, we did $X^T \\times X$ instead of $X \\times X^T$. This is because our \n",
|
||||
"dataset had datapoints over rows instead of features."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 84,
|
||||
"id": "f93b7a92",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[0.05283865 1.43410579]\n",
|
||||
"[[-0.73273632 -0.68051267]\n",
|
||||
" [ 0.68051267 -0.73273632]]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Computing eigenvalues\n",
|
||||
"eigen = np.linalg.eig(Cov)\n",
|
||||
"eigen_values = eigen.eigenvalues\n",
|
||||
"eigen_vectors = eigen.eigenvectors\n",
|
||||
"\n",
|
||||
"print(eigen_values)\n",
|
||||
"print(eigen_vectors)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bfbdd9c3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we'll generate the new X matrix by only using the first eigen vector"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 85,
|
||||
"id": "7ce6c540",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"(9, 1)\n",
|
||||
"Compressed\n",
|
||||
"[[-0.85901005]\n",
|
||||
" [ 1.74766702]\n",
|
||||
" [-1.02122441]\n",
|
||||
" [-1.70695945]\n",
|
||||
" [-0.94272842]\n",
|
||||
" [ 0.06743533]\n",
|
||||
" [ 1.11431616]\n",
|
||||
" [ 0.40769167]\n",
|
||||
" [ 1.19281215]]\n",
|
||||
"Reconstruction\n",
|
||||
"[[ 0.58456722 0.62942786]\n",
|
||||
" [-1.18930955 -1.28057909]\n",
|
||||
" [ 0.69495615 0.74828821]\n",
|
||||
" [ 1.16160753 1.25075117]\n",
|
||||
" [ 0.64153863 0.69077135]\n",
|
||||
" [-0.0458906 -0.04941232]\n",
|
||||
" [-0.75830626 -0.81649992]\n",
|
||||
" [-0.27743934 -0.29873049]\n",
|
||||
" [-0.81172378 -0.87401678]]\n",
|
||||
"Difference\n",
|
||||
"[[0.11543278 0.10720564]\n",
|
||||
" [0.11069045 0.10280131]\n",
|
||||
" [0.29495615 0.27393401]\n",
|
||||
" [0.13839247 0.12852895]\n",
|
||||
" [0.14153863 0.13145088]\n",
|
||||
" [0.2458906 0.22836546]\n",
|
||||
" [0.04169374 0.03872214]\n",
|
||||
" [0.02256066 0.02095271]\n",
|
||||
" [0.11172378 0.10376099]]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Computing X coming from only 1st eigen vector\n",
|
||||
"Z_pca = X @ eigen_vectors[:,1]\n",
|
||||
"Z_pca = Z_pca.reshape([Z_pca.shape[0], 1])\n",
|
||||
"\n",
|
||||
"print(Z_pca.shape)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# X reconstructed\n",
|
||||
"eigen_v = (eigen_vectors[:, 1].reshape([eigen_vectors[:, 1].shape[0], 1]))\n",
|
||||
"X_rec = Z_pca @ eigen_v.T\n",
|
||||
"\n",
|
||||
"print(\"Compressed\")\n",
|
||||
"print(Z_pca)\n",
|
||||
"\n",
|
||||
"print(\"Reconstruction\")\n",
|
||||
"print(X_rec)\n",
|
||||
"\n",
|
||||
"print(\"Difference\")\n",
|
||||
"print(abs(X - X_rec))"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "deep_learning",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
501
Chapters/5-Optimization/INDEX-OLD.md
Normal file
501
Chapters/5-Optimization/INDEX-OLD.md
Normal file
@ -0,0 +1,501 @@
|
||||
# Optimization
|
||||
|
||||
We basically try to see the error and minimize it by moving towards the ***gradient***
|
||||
|
||||
## Types of Learning Algorithms
|
||||
|
||||
In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
|
||||
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
|
||||
|
||||
So, often we train the `model` on a subset of samples.
|
||||
|
||||
### Online Learning
|
||||
|
||||
This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
|
||||
|
||||
On each `point` we get the ***gradient*** and then we update `weights`.
|
||||
|
||||
### Mini-Batch
|
||||
|
||||
In this approach, we divide our `dataset` in small batches called `mini-batches`.
|
||||
These need to be ***balanced*** in order not to have ***imbalances***.
|
||||
|
||||
This technique is the ***most used one***
|
||||
|
||||
## Tips and Tricks
|
||||
|
||||
### Learning Rate
|
||||
|
||||
This is the `hyperparameter` we use to tune our
|
||||
***learning steps***.
|
||||
|
||||
Sometimes we have it too big and this causes
|
||||
***overshootings***. So a quick solution may be to turn
|
||||
it down.
|
||||
|
||||
However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
|
||||
|
||||
### Weight initialization
|
||||
|
||||
We need to avoid `neurons` to have the same
|
||||
***gradient***. This is easily achievable by using
|
||||
***small random values***.
|
||||
|
||||
However, if we have a ***large `fan-in`***, then it's
|
||||
***easy to overshoot***, then it's better to initialize
|
||||
those `weights` ***proportionally to***
|
||||
$\sqrt{\text{fan-in}}$:
|
||||
|
||||
$$
|
||||
w = \frac{
|
||||
np.random(N)
|
||||
}{
|
||||
\sqrt{N}
|
||||
}
|
||||
$$
|
||||
|
||||
#### Xavier-Glorot Initialization
|
||||
|
||||
<!-- TODO: Read Xavier-Glorot paper -->
|
||||
|
||||
Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
|
||||
`uniform distribution` with a `std-dev`
|
||||
|
||||
$$
|
||||
\sigma^2 = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
2
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
and bounded between $a$ and $-a$
|
||||
|
||||
$$
|
||||
a = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
6
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
Alternatively, one can use a `normal-distribution`
|
||||
$\mathcal{N}(0, \sigma^2)$.
|
||||
|
||||
Note that `gain` is in the **original paper** is equal
|
||||
to $1$
|
||||
|
||||
### Decorrelating input components
|
||||
|
||||
Since ***highly correlated features*** don't offer much
|
||||
in terms of ***new information***, probably we need
|
||||
to go in the ***latent space*** to find the
|
||||
`latent-variables` governing those `features`.
|
||||
|
||||
#### PCA
|
||||
|
||||
> [!CAUTION]
|
||||
> This topic won't be explained here as it's something
|
||||
> usually learnt for `Machine Learning`, a
|
||||
> ***prerequisite*** for approaching `Deep Learning`.
|
||||
|
||||
This is a method we can use to discard `features` that
|
||||
will ***add little to no information***
|
||||
|
||||
## Common problems in MultiLayer Networks
|
||||
|
||||
### Hitting a Plateau
|
||||
|
||||
This happenes wehn we have a ***big `learning-rate`***
|
||||
which makes `weights` go high in ***absolute value***.
|
||||
|
||||
Because this happens ***too quickly***, we could
|
||||
see a ***quick diminishing error*** and this is usually
|
||||
***mistaken for a minimum point***, while instead
|
||||
it's a ***plateau***.
|
||||
|
||||
## Speeding up Mini-Batch Learning
|
||||
|
||||
### Momentum[^momentum]
|
||||
|
||||
We use this method ***mainly when we use `SGD`*** as
|
||||
a ***learning techniques***
|
||||
|
||||
This method is better explained if we imagine
|
||||
our error surface as an actual surface and we place a
|
||||
ball over it.
|
||||
|
||||
***The ball will start rolling towards the steepest
|
||||
descent*** (initially), but ***after gaining enough
|
||||
velocity*** it will follow the ***previous direction
|
||||
, in some measure***.
|
||||
|
||||
So, now the ***gradient*** does modify the ***velocity***
|
||||
rather than the ***position***, so the momentum will
|
||||
***dampen small variations***.
|
||||
|
||||
Moreover, once the ***momentum builds up***, we will
|
||||
easily ***pass over plateaus*** as the
|
||||
***ball will continue to roll over*** until it is
|
||||
stopped by a negative ***gradient***
|
||||
|
||||
#### Momentum Equations
|
||||
|
||||
There are a couple of them, mainly.
|
||||
|
||||
One of them uses a term to evaluate the `momentum`, $p$,
|
||||
called `SGD momentum` or `momentum term` or
|
||||
`momentum parameter`:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
The other one is ***logically equivalent*** to the
|
||||
previous, but it update the `weights` in ***one step***
|
||||
and is called `Stochastic Heavy Ball Method`:
|
||||
|
||||
$$
|
||||
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
|
||||
+ \beta ( w_k - w_{k-1})
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
> This is how to choose $\beta$:
|
||||
>
|
||||
> $0 < \beta < 1$
|
||||
>
|
||||
> If $\beta = 0$, then we are doing
|
||||
> ***gradient descent***, if $\beta > 1$ then we
|
||||
> ***will have numerical instabilities***.
|
||||
>
|
||||
> The ***larger*** $\beta$ the
|
||||
> ***higher the `momentum`***, so it will
|
||||
> ***turn slower***
|
||||
|
||||
> [!TIP]
|
||||
> usual values are $\beta = 0.9$ or $\beta = 0.99$
|
||||
> and usually we start from 0.5 initially, to raise it
|
||||
> whenever we are stuck.
|
||||
>
|
||||
> When we increase $\beta$, then the `learning rate`
|
||||
> ***must decrease accordingly***
|
||||
> (e.g. from 0.9 to 0.99, `learning-rate` must be
|
||||
> divided by a factor of 10)
|
||||
|
||||
#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
|
||||
|
||||
Differently from the previous
|
||||
[momentum](#momentum-equations),
|
||||
we take an ***intermediate*** step where we
|
||||
***update the `weights`*** according to the
|
||||
***previous `momentum`*** and then we compute the
|
||||
***new `momentum`*** in this new position, and then
|
||||
we ***update again***
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{w}_k & = w_k - \beta p_k \\
|
||||
p_{k+1} &= \beta p_{k} +
|
||||
\eta \nabla L(X, y, \hat{w}_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
#### Why Momentum Works
|
||||
|
||||
While it has been ***hypothesized*** that
|
||||
***acceleration*** made ***convergence faster***, this
|
||||
is
|
||||
***only true for convex problems without much noise***,
|
||||
though this may be ***part of the story***
|
||||
|
||||
The other half may be ***Noise Smoothing*** by
|
||||
smoothing the optimization process, however
|
||||
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
|
||||
|
||||
### Separate Adaptive Learning Rates
|
||||
|
||||
Since `weights` may ***greatly vary*** across `layers`,
|
||||
having a ***single `learning-rate` might not be ideal.
|
||||
|
||||
So the idea is to set a `local learning-rate` to
|
||||
control the `global` one as a ***multiplicative factor***
|
||||
|
||||
#### Local Learning rates
|
||||
|
||||
- Start with $1$ as the ***starting point*** for
|
||||
`local learning-rates` which we'll call `gain` from
|
||||
now on.
|
||||
- If the `gradient` has the ***same sign, increase it***
|
||||
- Otherwise, ***multiplicatively decrease it***
|
||||
|
||||
$$
|
||||
w_{i,j} = - g_{i,j} \cdot \eta \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
}
|
||||
|
||||
\\
|
||||
g_{i,j}(t) = \begin{cases}
|
||||
|
||||
g_{i,j}(t - 1) + \delta
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) > 0 \\
|
||||
|
||||
|
||||
g_{i,j}(t - 1) \cdot (1 - \delta)
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) \leq 0
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
With this method, if there are oscillations, we will have
|
||||
`gains` around $1$
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> - Usually a value for $d$ is $0.05$
|
||||
> - Limit `gains` around some values:
|
||||
>
|
||||
> - $[0.1, 10]$
|
||||
> - $[0.01, 100]$
|
||||
>
|
||||
> - Use `full-batches` or `big mini-batches` so that
|
||||
> the ***gradient*** doesn't oscillate because of
|
||||
> sampling errors
|
||||
> - Combine it with [Momentum](#momentum)
|
||||
> - Remember that ***Adaptive `learning-rate`*** deals
|
||||
> with ***axis-alignment***
|
||||
|
||||
### rmsprop | Root Mean Square Propagation
|
||||
|
||||
#### rprop | Resilient Propagation[^rprop-torch]
|
||||
|
||||
This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
|
||||
but in this case we don't use the
|
||||
[AIMD](#local-learning-rates) technique and
|
||||
***we don't take into account*** the
|
||||
***magnitude of the gradient*** but ***only the sign***
|
||||
|
||||
- If ***gradient*** has same sign:
|
||||
- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
|
||||
- else:
|
||||
- $step_{k} = step_{k} \cdot \eta_-$
|
||||
where $0 <\eta_- < 1$
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> Limit the step size in a range where:
|
||||
>
|
||||
> - $\inf < 50$
|
||||
> - $\sup > 1 \text{M}$
|
||||
|
||||
> [!CAUTION]
|
||||
>
|
||||
> rprop does ***not work*** with `mini-batches` as
|
||||
> the ***sign of the gradient changes frequently***
|
||||
|
||||
#### rmsprop in detail[^rmsprop-torch]
|
||||
|
||||
The idea is that [rprop](#rprop--resilient-propagation)
|
||||
is ***equivalent to using the gradient divided by its
|
||||
value*** (as you either multiply for $1$ or $-1$),
|
||||
however it means that between `mini-batches` the
|
||||
***divisor*** changes each time, oscillating.
|
||||
|
||||
The solution is to have a ***running average*** of
|
||||
the ***magnitude of the squared gradient for
|
||||
each `weight`***:
|
||||
|
||||
$$
|
||||
MeanSquare(w, t) =
|
||||
\alpha MeanSquare(w, t-1) +
|
||||
(1 - \alpha)
|
||||
\left(
|
||||
\frac{d\, Out}{d\, w}^2
|
||||
\right)
|
||||
$$
|
||||
|
||||
We then divide the ***gradient by the `square root`***
|
||||
of that value
|
||||
|
||||
#### Further Developments
|
||||
|
||||
- `rmsprop` with `momentum` does not work as it should
|
||||
- `rmsprop` with `Nesterov momentum` works best
|
||||
if usedto divide the ***correction*** rather than
|
||||
the ***jump***
|
||||
- `rmsprop` with `adaptive learnings` needs more
|
||||
investigation
|
||||
|
||||
### Fancy Methods
|
||||
|
||||
#### Adaptive Gradient
|
||||
|
||||
<!-- TODO: Expand over these -->
|
||||
|
||||
##### Convex Case
|
||||
|
||||
- Conjugate Gradient/Acceleration
|
||||
- L-BFGS
|
||||
- Quasi-Newton Methods
|
||||
|
||||
##### Non-Convex Case
|
||||
|
||||
Pay attention, here the `Hessian` may not be
|
||||
`Positive Semi Defined`, thus when the ***gradient*** is
|
||||
$0$ we don't necessarily know where we are.
|
||||
|
||||
- Natural Gradient Methods
|
||||
- Curvature Adaptive
|
||||
- [Adagrad](./Fancy-Methods/ADAGRAD.md)
|
||||
- [AdaDelta](./Fancy-Methods/ADADELTA.md)
|
||||
- [RMSprop](#rmsprop-in-detail)
|
||||
- [ADAM](./Fancy-Methods/ADAM.md)
|
||||
- l-BFGS
|
||||
- [heavy ball gradient](#momentum)
|
||||
- [momemtum](#momentum)
|
||||
- Noise Injection:
|
||||
- Simulated Annealing
|
||||
- Langevin Method
|
||||
|
||||
#### Adagrad
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAGRAD.md)
|
||||
|
||||
#### Adadelta
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADADELTA.md)
|
||||
|
||||
#### ADAM
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM.md)
|
||||
|
||||
#### AdamW
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM-W.md)
|
||||
|
||||
#### LION
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/LION.md)
|
||||
|
||||
### Hessian Free[^anelli-hessian-free]
|
||||
|
||||
How much can we `learn` from a given
|
||||
`Loss` space?
|
||||
|
||||
The ***best way to move*** would be along the
|
||||
***gradient***, assuming it has
|
||||
the ***same curvature***
|
||||
(e.g. It's and has a local minimum).
|
||||
|
||||
But ***usually this is not the case***, so we need
|
||||
to move ***where the ratio of gradient and curvature is
|
||||
high***
|
||||
|
||||
#### Newton's Method
|
||||
|
||||
This method takes into account the ***curvature***
|
||||
of the `Loss`
|
||||
|
||||
With this method, the update would be:
|
||||
|
||||
$$
|
||||
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
|
||||
d \, E
|
||||
}{
|
||||
d \, \vec{w}
|
||||
}
|
||||
$$
|
||||
|
||||
***If this could be feasible we'll go on the minimum in
|
||||
one step***, but it's not, as the
|
||||
***computations***
|
||||
needed to get a `Hessian` ***increase exponentially***.
|
||||
|
||||
The thing is that whenever we ***update `weights`*** with
|
||||
the `Steepest Descent` method, each update *messes up*
|
||||
another, while the ***curvature*** can help to ***scale
|
||||
these updates*** so that they do not disturb each other.
|
||||
|
||||
#### Curvature Approximations
|
||||
|
||||
However, since the `Hessian` is
|
||||
***too expensive to compute***, we can approximate it.
|
||||
|
||||
- We can take only the ***diagonal elements***
|
||||
- ***Other algorithms*** (e.g. Hessian Free)
|
||||
- ***Conjugate Gradient*** to minimize the
|
||||
***approximation error***
|
||||
|
||||
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
|
||||
|
||||
> [!CAUTION]
|
||||
>
|
||||
> This is an oversemplification of the topic, so reading
|
||||
> the footnotes material is greatly advised.
|
||||
|
||||
The basic idea is that, in order not to mess up previous
|
||||
directions, we ***`optimize` along perpendicular directions***.
|
||||
|
||||
This method is ***guaranteed to mathematically succeed
|
||||
after N steps, the dimension of the space***, in practice
|
||||
the error will be minimal.
|
||||
|
||||
This ***method works well for `non-quadratic errors`***
|
||||
and the `Hessian Free` `optimizer` uses this method
|
||||
on ***genuinely quadratic surfaces***, which are
|
||||
***quadratic approximations of the real surface***
|
||||
|
||||
|
||||
<!-- TODO: Add PDF 5 pg. 38 -->
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
||||
|
||||
[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
|
||||
|
||||
[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
|
||||
|
||||
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
||||
|
||||
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
||||
|
||||
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
|
||||
|
||||
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
|
||||
|
||||
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
|
||||
@ -1,501 +1,556 @@
|
||||
# Optimization
|
||||
|
||||
We basically try to see the error and minimize it by moving towards the ***gradient***
|
||||
## Beyond Full Batches
|
||||
|
||||
## Types of Learning Algorithms
|
||||
Even though full batches give the best picture of a probability dristribution
|
||||
of data points, it's computationally expensive.
|
||||
|
||||
In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
|
||||
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
|
||||
Since data is usually **highly redundant**, we can think of getting smaller
|
||||
sets that are classes balanced, **mini-batches**, to update weights.
|
||||
While this doesn't give the same results as full batches, is still reliable.
|
||||
|
||||
So, often we train the `model` on a subset of samples.
|
||||
When we need to bring things to the extreme, we can even update over a single
|
||||
data point, **online learning**, however they are not as efficient as
|
||||
mini-batches as **they do not use matrix multiplications, which are GPU efficient**
|
||||
|
||||
### Online Learning
|
||||
## Learning rate Scheduling
|
||||
|
||||
This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
|
||||
## Xavier-Glorot Weight initialization
|
||||
|
||||
On each `point` we get the ***gradient*** and then we update `weights`.
|
||||
> [!WARNING]
|
||||
> Before Xavier-Glorot there was another initialization technique proportional
|
||||
> to fan-in:
|
||||
>
|
||||
> $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$
|
||||
>
|
||||
> Though, Xavier-Glorot is not the only available initialization as there are
|
||||
> many others[^torch-init]
|
||||
|
||||
### Mini-Batch
|
||||
Whenever we initialize weights, we need to be careful to **break simmetry**, as
|
||||
**identical hiddden nodes gets the exact same results**, making us
|
||||
lose representation power.
|
||||
|
||||
In this approach, we divide our `dataset` in small batches called `mini-batches`.
|
||||
These need to be ***balanced*** in order not to have ***imbalances***.
|
||||
Another problem with weight initialization is the **overshooting**. This is
|
||||
caused by **many small changes over weights**. The idea to solve this is by
|
||||
**initializing weights proprotionally to fan-in (input) and fan-out (output)**
|
||||
|
||||
This technique is the ***most used one***
|
||||
|
||||
## Tips and Tricks
|
||||
|
||||
### Learning Rate
|
||||
|
||||
This is the `hyperparameter` we use to tune our
|
||||
***learning steps***.
|
||||
|
||||
Sometimes we have it too big and this causes
|
||||
***overshootings***. So a quick solution may be to turn
|
||||
it down.
|
||||
|
||||
However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
|
||||
|
||||
### Weight initialization
|
||||
|
||||
We need to avoid `neurons` to have the same
|
||||
***gradient***. This is easily achievable by using
|
||||
***small random values***.
|
||||
|
||||
However, if we have a ***large `fan-in`***, then it's
|
||||
***easy to overshoot***, then it's better to initialize
|
||||
those `weights` ***proportionally to***
|
||||
$\sqrt{\text{fan-in}}$:
|
||||
|
||||
$$
|
||||
w = \frac{
|
||||
np.random(N)
|
||||
}{
|
||||
\sqrt{N}
|
||||
}
|
||||
$$
|
||||
|
||||
#### Xavier-Glorot Initialization
|
||||
|
||||
<!-- TODO: Read Xavier-Glorot paper -->
|
||||
|
||||
Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
|
||||
`uniform distribution` with a `std-dev`
|
||||
|
||||
$$
|
||||
\sigma^2 = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
2
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
and bounded between $a$ and $-a$
|
||||
|
||||
$$
|
||||
a = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
6
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
Alternatively, one can use a `normal-distribution`
|
||||
$\mathcal{N}(0, \sigma^2)$.
|
||||
|
||||
Note that `gain` is in the **original paper** is equal
|
||||
to $1$
|
||||
|
||||
### Decorrelating input components
|
||||
|
||||
Since ***highly correlated features*** don't offer much
|
||||
in terms of ***new information***, probably we need
|
||||
to go in the ***latent space*** to find the
|
||||
`latent-variables` governing those `features`.
|
||||
|
||||
#### PCA
|
||||
|
||||
> [!CAUTION]
|
||||
> This topic won't be explained here as it's something
|
||||
> usually learnt for `Machine Learning`, a
|
||||
> ***prerequisite*** for approaching `Deep Learning`.
|
||||
|
||||
This is a method we can use to discard `features` that
|
||||
will ***add little to no information***
|
||||
|
||||
## Common problems in MultiLayer Networks
|
||||
|
||||
### Hitting a Plateau
|
||||
|
||||
This happenes wehn we have a ***big `learning-rate`***
|
||||
which makes `weights` go high in ***absolute value***.
|
||||
|
||||
Because this happens ***too quickly***, we could
|
||||
see a ***quick diminishing error*** and this is usually
|
||||
***mistaken for a minimum point***, while instead
|
||||
it's a ***plateau***.
|
||||
|
||||
## Speeding up Mini-Batch Learning
|
||||
|
||||
### Momentum[^momentum]
|
||||
|
||||
We use this method ***mainly when we use `SGD`*** as
|
||||
a ***learning techniques***
|
||||
|
||||
This method is better explained if we imagine
|
||||
our error surface as an actual surface and we place a
|
||||
ball over it.
|
||||
|
||||
***The ball will start rolling towards the steepest
|
||||
descent*** (initially), but ***after gaining enough
|
||||
velocity*** it will follow the ***previous direction
|
||||
, in some measure***.
|
||||
|
||||
So, now the ***gradient*** does modify the ***velocity***
|
||||
rather than the ***position***, so the momentum will
|
||||
***dampen small variations***.
|
||||
|
||||
Moreover, once the ***momentum builds up***, we will
|
||||
easily ***pass over plateaus*** as the
|
||||
***ball will continue to roll over*** until it is
|
||||
stopped by a negative ***gradient***
|
||||
|
||||
#### Momentum Equations
|
||||
|
||||
There are a couple of them, mainly.
|
||||
|
||||
One of them uses a term to evaluate the `momentum`, $p$,
|
||||
called `SGD momentum` or `momentum term` or
|
||||
`momentum parameter`:
|
||||
A technique we use to initialize weights comes from Xavier and Glorot, called
|
||||
Xavier-Glorot initialization:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
&W \propto \frac{rand(in, out)}{in + out} \\
|
||||
&rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\
|
||||
&\,\,\,\,\text{or} \\
|
||||
&rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot
|
||||
\sqrt{\frac{2}{in + out}}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
The other one is ***logically equivalent*** to the
|
||||
previous, but it update the `weights` in ***one step***
|
||||
and is called `Stochastic Heavy Ball Method`:
|
||||
In other words, xavier glorot extracts weights from either a uniform distribution,
|
||||
or a normal one, scaled by a factor $g$ called gain
|
||||
|
||||
$$
|
||||
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
|
||||
+ \beta ( w_k - w_{k-1})
|
||||
$$
|
||||
[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html)
|
||||
|
||||
> [!NOTE]
|
||||
> This is how to choose $\beta$:
|
||||
>
|
||||
> $0 < \beta < 1$
|
||||
>
|
||||
> If $\beta = 0$, then we are doing
|
||||
> ***gradient descent***, if $\beta > 1$ then we
|
||||
> ***will have numerical instabilities***.
|
||||
>
|
||||
> The ***larger*** $\beta$ the
|
||||
> ***higher the `momentum`***, so it will
|
||||
> ***turn slower***
|
||||
## Momentum
|
||||
|
||||
> [!TIP]
|
||||
> usual values are $\beta = 0.9$ or $\beta = 0.99$
|
||||
> and usually we start from 0.5 initially, to raise it
|
||||
> whenever we are stuck.
|
||||
>
|
||||
> When we increase $\beta$, then the `learning rate`
|
||||
> ***must decrease accordingly***
|
||||
> (e.g. from 0.9 to 0.99, `learning-rate` must be
|
||||
> divided by a factor of 10)
|
||||
> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by
|
||||
> a factor of 10
|
||||
|
||||
#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
|
||||
It's a technique inspired by physics. Imagine a ball rolling over a plane. Once
|
||||
it has enough speed, even if the plane changes inclination, the ball has
|
||||
still energy to move along the previous way because of its momentum.
|
||||
|
||||
Differently from the previous
|
||||
[momentum](#momentum-equations),
|
||||
we take an ***intermediate*** step where we
|
||||
***update the `weights`*** according to the
|
||||
***previous `momentum`*** and then we compute the
|
||||
***new `momentum`*** in this new position, and then
|
||||
we ***update again***
|
||||
Whenever on a gradient descent we have oscillations, **momentum dampens** all
|
||||
movements steering us from the previous direction. Here momentum at time $k$
|
||||
is $p_k$
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{w}_k & = w_k - \beta p_k \\
|
||||
p_{k+1} &= \beta p_{k} +
|
||||
\eta \nabla L(X, y, \hat{w}_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\
|
||||
W_{k+1} &= W_{k} - \gamma p_{k+1} \\
|
||||
\beta &\in [0, 1]
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
#### Why Momentum Works
|
||||
|
||||
While it has been ***hypothesized*** that
|
||||
***acceleration*** made ***convergence faster***, this
|
||||
is
|
||||
***only true for convex problems without much noise***,
|
||||
though this may be ***part of the story***
|
||||
|
||||
The other half may be ***Noise Smoothing*** by
|
||||
smoothing the optimization process, however
|
||||
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
|
||||
|
||||
### Separate Adaptive Learning Rates
|
||||
|
||||
Since `weights` may ***greatly vary*** across `layers`,
|
||||
having a ***single `learning-rate` might not be ideal.
|
||||
|
||||
So the idea is to set a `local learning-rate` to
|
||||
control the `global` one as a ***multiplicative factor***
|
||||
|
||||
#### Local Learning rates
|
||||
|
||||
- Start with $1$ as the ***starting point*** for
|
||||
`local learning-rates` which we'll call `gain` from
|
||||
now on.
|
||||
- If the `gradient` has the ***same sign, increase it***
|
||||
- Otherwise, ***multiplicatively decrease it***
|
||||
Or, in a more compact way, logically equivalent to the previous one:
|
||||
|
||||
$$
|
||||
w_{i,j} = - g_{i,j} \cdot \eta \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
}
|
||||
W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1})
|
||||
$$
|
||||
|
||||
\\
|
||||
g_{i,j}(t) = \begin{cases}
|
||||
The larger $\beta$ the slower it curves, accumulating more of previous directions.
|
||||
To play it safe, use smaller values once you are at the beginning where updates
|
||||
are large and slowly turn it up to values near 1
|
||||
|
||||
g_{i,j}(t - 1) + \delta
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) > 0 \\
|
||||
> [!NOTE]
|
||||
>
|
||||
> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign
|
||||
> rate
|
||||
> - $\gamma$: Learning rate
|
||||
> - $\beta$: hyperparameter of dampening factor
|
||||
> - $\nabla L(X, Y, W_{k})$: gradient of the loss
|
||||
>
|
||||
|
||||
## Nesterov Acceleated Gradient (aka NAG)
|
||||
|
||||
g_{i,j}(t - 1) \cdot (1 - \delta)
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) \leq 0
|
||||
This method takes inpiration from Nesterov's optimization for convex functions and
|
||||
applies it to momentum. Its quirk is that it never computes the gradient where it
|
||||
lands on, but on a temporary computation of them before the actual update.
|
||||
|
||||
|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] |
|
||||
|--|--|
|
||||
|  |  |
|
||||
|
||||
To illustrate better its quirk, here's the formulation:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{W}_{k} &= W_{k} - \beta p_k \\
|
||||
p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\
|
||||
W_{k+1} &= W_{k} - \gamma p_{k+1}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$
|
||||
which will be our actual weights. The idea is to follow the previous momentum
|
||||
blindly, see where it goes and then make the correction.
|
||||
|
||||
[^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12)
|
||||
|
||||
## Justifying Faster Optimization for Momentum Based Methods
|
||||
|
||||
While many people justify the speed of momentum based methods for its acceleration,
|
||||
this doesn't hold true as it's only accelerated for convex functions.
|
||||
|
||||
Since we have no idea, most of the times, how our gradient function looks like,
|
||||
we can't make assumptions about it being convex.
|
||||
|
||||
So, the most compelling explanation lies in the fact that a momentum based
|
||||
optimization is like computing a running average of the loss gradient, smoothing
|
||||
the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD
|
||||
|
||||
## Separate Adaptive Learning Rate
|
||||
|
||||
The idea is that each weight of each layer may need its own learnig rate to avoid
|
||||
overshooting and smooth the magnitude of received gradients, high over last layers
|
||||
and low over first ones (architecture wise)
|
||||
|
||||
The trick is to have a global learning rate that is adjusted by a local gain that
|
||||
is increased each time the weight keeps the same sign and viceversa:
|
||||
|
||||
$$
|
||||
\Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\
|
||||
|
||||
g_{i,j}(n +1 ) = \begin{cases}
|
||||
g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\
|
||||
g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
With this method, if there are oscillations, we will have
|
||||
`gains` around $1$
|
||||
This method ensures that if the weight oscillates, the gain will dampen it.
|
||||
Moreover, should it be totally random, it will hover near 1, keeping gradient
|
||||
updates unchanged.
|
||||
|
||||
> [!NOTE]
|
||||
> The way $g$ is updated is similar to AIMD in TCP Congestion
|
||||
|
||||
<!-- Comment for linter complains-->
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> - Usually a value for $d$ is $0.05$
|
||||
> - Limit `gains` around some values:
|
||||
> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$
|
||||
> - **Use full batch or big mini-batches** - This ensures that the change in sign
|
||||
> is not due to sampling errors
|
||||
> - **Combine this with momentum**
|
||||
> - **Use this to deal with axis-alignment problems**
|
||||
>
|
||||
> - $[0.1, 10]$
|
||||
> - $[0.01, 100]$
|
||||
>
|
||||
> - Use `full-batches` or `big mini-batches` so that
|
||||
> the ***gradient*** doesn't oscillate because of
|
||||
> sampling errors
|
||||
> - Combine it with [Momentum](#momentum)
|
||||
> - Remember that ***Adaptive `learning-rate`*** deals
|
||||
> with ***axis-alignment***
|
||||
|
||||
### rmsprop | Root Mean Square Propagation
|
||||
## Resilient Backpropagation (aka RProp)
|
||||
|
||||
#### rprop | Resilient Propagation[^rprop-torch]
|
||||
|
||||
This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
|
||||
but in this case we don't use the
|
||||
[AIMD](#local-learning-rates) technique and
|
||||
***we don't take into account*** the
|
||||
***magnitude of the gradient*** but ***only the sign***
|
||||
|
||||
- If ***gradient*** has same sign:
|
||||
- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
|
||||
- else:
|
||||
- $step_{k} = step_{k} \cdot \eta_-$
|
||||
where $0 <\eta_- < 1$
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> Limit the step size in a range where:
|
||||
>
|
||||
> - $\inf < 50$
|
||||
> - $\sup > 1 \text{M}$
|
||||
|
||||
> [!CAUTION]
|
||||
>
|
||||
> rprop does ***not work*** with `mini-batches` as
|
||||
> the ***sign of the gradient changes frequently***
|
||||
|
||||
#### rmsprop in detail[^rmsprop-torch]
|
||||
|
||||
The idea is that [rprop](#rprop--resilient-propagation)
|
||||
is ***equivalent to using the gradient divided by its
|
||||
value*** (as you either multiply for $1$ or $-1$),
|
||||
however it means that between `mini-batches` the
|
||||
***divisor*** changes each time, oscillating.
|
||||
|
||||
The solution is to have a ***running average*** of
|
||||
the ***magnitude of the squared gradient for
|
||||
each `weight`***:
|
||||
Instead of using the magnitude of the gradient, **RProp uses the sign to derive
|
||||
updates** that is multiplied by a step value. Here's the formulation[^florian-1]:
|
||||
|
||||
$$
|
||||
MeanSquare(w, t) =
|
||||
\alpha MeanSquare(w, t-1) +
|
||||
(1 - \alpha)
|
||||
\left(
|
||||
\frac{d\, Out}{d\, w}^2
|
||||
\right)
|
||||
w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left(
|
||||
\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}
|
||||
\right) \\
|
||||
s_{i,j}^{(n)} = \begin{cases}
|
||||
s_{i,j}^{(n - 1)} \cdot 1.2 &
|
||||
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
||||
\cdot
|
||||
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\
|
||||
s_{i,j}^{(n - 1)} \cdot 0.5 &
|
||||
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
||||
\cdot
|
||||
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0
|
||||
\end{cases} \\
|
||||
s_{i,j} \in [10^{-6}, 50]
|
||||
$$
|
||||
|
||||
We then divide the ***gradient by the `square root`***
|
||||
of that value
|
||||
It is noticeable that , like
|
||||
[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase
|
||||
or decreases the gain. However, since it uses multiplication to increase it, makes
|
||||
it unusable for anything but full-batches, beacause of its fast growth.
|
||||
|
||||
#### Further Developments
|
||||
[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/)
|
||||
|
||||
- `rmsprop` with `momentum` does not work as it should
|
||||
- `rmsprop` with `Nesterov momentum` works best
|
||||
if usedto divide the ***correction*** rather than
|
||||
the ***jump***
|
||||
- `rmsprop` with `adaptive learnings` needs more
|
||||
investigation
|
||||
## Root Mean Square Propagation (aka RMSProp)
|
||||
|
||||
### Fancy Methods
|
||||
As the name implies, it propagates the loss over, a bit like momentum. Since
|
||||
[RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient,
|
||||
it's almost like dividing the gradient by its magnitude, which is bad in case of
|
||||
mini-batches, as all divisors are different.
|
||||
|
||||
#### Adaptive Gradient
|
||||
|
||||
<!-- TODO: Expand over these -->
|
||||
|
||||
##### Convex Case
|
||||
|
||||
- Conjugate Gradient/Acceleration
|
||||
- L-BFGS
|
||||
- Quasi-Newton Methods
|
||||
|
||||
##### Non-Convex Case
|
||||
|
||||
Pay attention, here the `Hessian` may not be
|
||||
`Positive Semi Defined`, thus when the ***gradient*** is
|
||||
$0$ we don't necessarily know where we are.
|
||||
|
||||
- Natural Gradient Methods
|
||||
- Curvature Adaptive
|
||||
- [Adagrad](./Fancy-Methods/ADAGRAD.md)
|
||||
- [AdaDelta](./Fancy-Methods/ADADELTA.md)
|
||||
- [RMSprop](#rmsprop-in-detail)
|
||||
- [ADAM](./Fancy-Methods/ADAM.md)
|
||||
- l-BFGS
|
||||
- [heavy ball gradient](#momentum)
|
||||
- [momemtum](#momentum)
|
||||
- Noise Injection:
|
||||
- Simulated Annealing
|
||||
- Langevin Method
|
||||
|
||||
#### Adagrad
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAGRAD.md)
|
||||
|
||||
#### Adadelta
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADADELTA.md)
|
||||
|
||||
#### ADAM
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM.md)
|
||||
|
||||
#### AdamW
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM-W.md)
|
||||
|
||||
#### LION
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/LION.md)
|
||||
|
||||
### Hessian Free[^anelli-hessian-free]
|
||||
|
||||
How much can we `learn` from a given
|
||||
`Loss` space?
|
||||
|
||||
The ***best way to move*** would be along the
|
||||
***gradient***, assuming it has
|
||||
the ***same curvature***
|
||||
(e.g. It's and has a local minimum).
|
||||
|
||||
But ***usually this is not the case***, so we need
|
||||
to move ***where the ratio of gradient and curvature is
|
||||
high***
|
||||
|
||||
#### Newton's Method
|
||||
|
||||
This method takes into account the ***curvature***
|
||||
of the `Loss`
|
||||
|
||||
With this method, the update would be:
|
||||
RMSProp solves this by keeping the gradient magnitude similar across mini-batches
|
||||
by keeping a running average of it:
|
||||
|
||||
$$
|
||||
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
|
||||
d \, E
|
||||
L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left(
|
||||
\frac{d \, Loss}{d\, W^{(k -1)}}
|
||||
\right)^2 \\
|
||||
W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\
|
||||
\text{usually } \beta = 0.9
|
||||
$$
|
||||
|
||||
What this method does is keeping a running average of the measn square error,
|
||||
hence the name, and use it to normalize the gradient keeping it similar across
|
||||
mini-batches.
|
||||
|
||||
> [!NOTE]
|
||||
> While it can be used with momentum, it doesn't seem to add as much benefits as
|
||||
> using it standalone.
|
||||
>
|
||||
> With Nesterov, it works best if used to normalize the correction, rather than
|
||||
> the jump. While for the adaptive learning rates, it still requires further
|
||||
> investigations to prove the efficacy.
|
||||
>
|
||||
|
||||
## Adaptive Gradient Methods
|
||||
|
||||
<!--
|
||||
MARK: AdaGrad
|
||||
-->
|
||||
### AdaGrad[^adagrad-torch]
|
||||
|
||||
`AdaGrad` is an ***optimization method*** aimed
|
||||
to:
|
||||
|
||||
<ins>***"find needles in the haystack in the form of
|
||||
very predictive yet rarely observed features"***
|
||||
[^adagrad-official-paper]</ins>
|
||||
|
||||
`AdaGrad`, opposed to a standard `SGD` that is the
|
||||
***same for each gradient geometry***, tries to
|
||||
***incorporate geometry from earlier iterations***.
|
||||
|
||||
#### AdaGrad Algorithm
|
||||
|
||||
Instead `AdaGrad` takes another
|
||||
approach[^anelli-adagrad-2][^adagrad-official-paper]:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\
|
||||
G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\
|
||||
w_{i}^{(k + 1)} &=
|
||||
w_{i}^{(k)} - \eta \cdot\frac{
|
||||
1
|
||||
}{
|
||||
d \, \vec{w}
|
||||
}
|
||||
\sqrt{G_{i,i}^{(k +1)} + \epsilon}
|
||||
} \cdot g_{i}^{(k+1)} \\
|
||||
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
***If this could be feasible we'll go on the minimum in
|
||||
one step***, but it's not, as the
|
||||
***computations***
|
||||
needed to get a `Hessian` ***increase exponentially***.
|
||||
Here $G^{(k)}$ is the ***sum of outer product*** of the
|
||||
***gradient*** until time $t$, though ***usually it is
|
||||
not used*** $G_t$, which is ***impractical because
|
||||
of the high number of dimensions***, so we use
|
||||
$diag(G_t)$ which can be
|
||||
***computed in linear time***[^adagrad-official-paper]
|
||||
|
||||
The thing is that whenever we ***update `weights`*** with
|
||||
the `Steepest Descent` method, each update *messes up*
|
||||
another, while the ***curvature*** can help to ***scale
|
||||
these updates*** so that they do not disturb each other.
|
||||
The $\epsilon$ term here is used to
|
||||
***avoid dividing by 0***[^anelli-adagrad-2] and has a
|
||||
small value, usually in the order of $10^{-8}$
|
||||
|
||||
#### Curvature Approximations
|
||||
|
||||
However, since the `Hessian` is
|
||||
***too expensive to compute***, we can approximate it.
|
||||
|
||||
- We can take only the ***diagonal elements***
|
||||
- ***Other algorithms*** (e.g. Hessian Free)
|
||||
- ***Conjugate Gradient*** to minimize the
|
||||
***approximation error***
|
||||
|
||||
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
|
||||
|
||||
> [!CAUTION]
|
||||
> [!NOTE]
|
||||
>
|
||||
> This is an oversemplification of the topic, so reading
|
||||
> the footnotes material is greatly advised.
|
||||
> This example is tough to understand if we where to apply it to a matrix $W$
|
||||
> instead of a vector. To make it easier to understand in matricial notation:
|
||||
>
|
||||
> $$
|
||||
> \begin{aligned}
|
||||
> \nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\
|
||||
> G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\
|
||||
> W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}
|
||||
{\sqrt{G^{(k+1)} + \epsilon}}
|
||||
> \end{aligned}
|
||||
> $$
|
||||
>
|
||||
> In other words, compute the gradient and scale it for the sum of its squares
|
||||
> until that point
|
||||
|
||||
The basic idea is that, in order not to mess up previous
|
||||
directions, we ***`optimize` along perpendicular directions***.
|
||||
#### AdaGrad effectiveness[^anelli-adagrad-3]
|
||||
|
||||
This method is ***guaranteed to mathematically succeed
|
||||
after N steps, the dimension of the space***, in practice
|
||||
the error will be minimal.
|
||||
- When we have ***many dimensions, many features are
|
||||
irrelevant***
|
||||
- ***Rarer Features are more relevant***
|
||||
- It adapts $\eta$ to the right metric space
|
||||
by projecting gradient stochastic updates with
|
||||
[Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
|
||||
a probability distribution.
|
||||
|
||||
This ***method works well for `non-quadratic errors`***
|
||||
and the `Hessian Free` `optimizer` uses this method
|
||||
on ***genuinely quadratic surfaces***, which are
|
||||
***quadratic approximations of the real surface***
|
||||
#### AdaGrad Considerations
|
||||
|
||||
|
||||
<!-- TODO: Add PDF 5 pg. 38 -->
|
||||
- It eliminates the need of manually tuning the
|
||||
`learning rates`, which is usually set to
|
||||
$0.01$
|
||||
- The squared ***gradients*** are accumulated during
|
||||
iterations, making the `learning-rate` become
|
||||
***smaller and smaller***, thus becoming 0 and untrainable
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
||||
[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
|
||||
|
||||
[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
|
||||
[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
|
||||
|
||||
[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
|
||||
[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
|
||||
|
||||
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
||||
[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
|
||||
|
||||
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
||||
[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
|
||||
|
||||
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
|
||||
[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
|
||||
|
||||
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
|
||||
### AdaDelta[^adadelta-offcial-paper]
|
||||
|
||||
`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
|
||||
created to address some problems of it, like
|
||||
***sensitivity to initial `parameters` and corresponding
|
||||
gradient***[^adadelta-offcial-paper]
|
||||
|
||||
To address all these problems, `ADADELTA` accumulates
|
||||
***gradients over a `window` as a running average***, rather than ***accumulating
|
||||
it over all instances***:
|
||||
|
||||
$$
|
||||
G^{(k+1)} = \beta \cdot G^{(k)} +
|
||||
(1 - \beta) \cdot \nabla L^{(k+1)}
|
||||
$$
|
||||
|
||||
The update, which is very similar to the one in
|
||||
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Technically speaking, the last equation is basically equivalent to the
|
||||
[RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is
|
||||
equivalent to the running average of the mean square.
|
||||
|
||||
However, as the author pointed out[^adadelta-units], this equation does not
|
||||
respect units of measures. We should correct this problem
|
||||
by ***considering the curvature locally smooth*** and
|
||||
taking an approximation of it at the next step, becoming:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}}
|
||||
\nabla L^{(k)}\\
|
||||
S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\
|
||||
W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
As we can notice, the ***`learning rate` completely
|
||||
disappears from the equation, eliminating the need to
|
||||
set one***
|
||||
|
||||
> [!WARNING]
|
||||
> Here $\Delta W$ is already negative, that's why there's a $+$ in the last
|
||||
> equation
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||
|
||||
[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||
|
||||
### Adaptive Moment Estimation (aka AdaM)
|
||||
|
||||
AdaM computes both the momentum and the squared gradients with running
|
||||
averages, which are 0 filled at time $k = 0$:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\
|
||||
V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
> [!WARNING]
|
||||
> The squared gradient can be thought as the variance, however it's not centered
|
||||
|
||||
Then it corrects them to be used in the final formulation:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}} \\
|
||||
\hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
> [!WARNING]
|
||||
> $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep.
|
||||
|
||||
Then it computes the update in this way:
|
||||
|
||||
$$
|
||||
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}}
|
||||
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
||||
$$
|
||||
|
||||
Even though Adam works, it doesn't generalize well and, particularly in image
|
||||
problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers
|
||||
instead of 1 as for SGD, which 2 of them need parameters tuning.
|
||||
|
||||
> [!NOTE]
|
||||
> Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and
|
||||
> $\epsilon = 10^-8$
|
||||
|
||||
### AdamW
|
||||
|
||||
AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty,
|
||||
AdaM already implements it, however it is usually added to momentum, getting
|
||||
scaled by $\sqrt{\hat{V}}$:
|
||||
|
||||
$$
|
||||
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}}
|
||||
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
||||
$$
|
||||
|
||||
AdamW authors saw that this was inefficient as it was influences by the uncentered
|
||||
variance, thus modified the formula to this:
|
||||
|
||||
$$
|
||||
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} }
|
||||
{\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)}
|
||||
$$
|
||||
|
||||
### Lion (evoLved sIgn mOmeNtum)[^official-paper]
|
||||
|
||||
`Lion` is the result of a ***genetic search algorithm*** aimed to
|
||||
find the best `optimizer`.
|
||||
|
||||
It starts from a population of `AdamW` algorithms to
|
||||
***speed up the search***. Opposed to
|
||||
`Adam` and `AdamW`, it keeps track
|
||||
***only for the momentum*** and ***gradient sign***,
|
||||
requiring ***less `memory`***.
|
||||
|
||||
Since ***uniform updates yields larger norms***,
|
||||
`Lion` requires a ***smaller `learning-rate`***
|
||||
and a ***larger decoupled `weight-decay`***
|
||||
$\lambda$[^official-paper-1].
|
||||
|
||||
The ***advantages of `Lion` over `Adam` and `AdamW`
|
||||
increase with the size of
|
||||
the `mini-batch`***[^official-paper-1]
|
||||
|
||||
#### Symbolic Representation[^official-paper-2]
|
||||
|
||||
New ***trained algorithms*** are represented
|
||||
`simbolically`, bringing these advantages:
|
||||
|
||||
- `Algorithms` must be ***implemented*** as `programs`
|
||||
- It ***easier to analyze, comprehend and transfer to
|
||||
new task*** these `algorithms`, rather than other
|
||||
`algorithms` such as `NeuralNetworks`
|
||||
- We can **estimate the *complexity*** by looking
|
||||
at the ***length of code***
|
||||
|
||||
#### Tournament[^official-paper-3]
|
||||
|
||||
The best code is found with a ***tournament style
|
||||
evolution***. Each cycle it picks the ***best
|
||||
`algorithm`*** which will be
|
||||
***copied and mutated*** and the ***oldest is removed***
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||
|
||||
[^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||
|
||||
[^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||
|
||||
[^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||
|
||||
## Hessian Free Optimization
|
||||
|
||||
Since we are moving on a function which gradient is not constant, by looking at
|
||||
the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix),
|
||||
we can see when it starts to change.
|
||||
|
||||
### Newton's Method
|
||||
|
||||
This method would technically give us the solution in one step on a quadratic
|
||||
function, but it is unfeasible due to the memory and computational requirements:
|
||||
|
||||
$$
|
||||
\Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W}
|
||||
$$
|
||||
|
||||
### Conjugate Gradient
|
||||
|
||||
The idea is to correct the weights so that we reduce the gradient to 0 across
|
||||
perpendicular directions. This means that, for each update, we are not messing up
|
||||
previous optimizations.
|
||||
|
||||
While it is usually used for quadratic error surfaces, there's a non linear variant
|
||||
(non-linear conjugate gradient) that usually works well. However it is also
|
||||
possible to approximate the true error function with a quadratic one, using the
|
||||
standard method.
|
||||
|
||||
It gives a solution after $N$ steps over an $N$ dimensional quadratic surface,
|
||||
however we need to penalize frequent changes in weights, especially for hidden
|
||||
activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
|
||||
|
||||
## Optimization Tricks
|
||||
|
||||
### Input decorrelation
|
||||
|
||||
If you have a linear neuron, think of a Feed Forward and not of a Convolution,
|
||||
it's better to decorrelate input components.
|
||||
|
||||
A way to achieve this is through a
|
||||
[PCA](./../15-Appendix-A/INDEX.md#computing-pca),
|
||||
transforming the error surface from an ellipse to a circle.
|
||||
|
||||
### Recognize Plateaus
|
||||
|
||||
If we start with big learning rates, since weights gain a big magnitude, the
|
||||
derivative will be small and the error will not decrease significantly.
|
||||
|
||||
This may seem a local minima, but this is usually a plateau.
|
||||
|
||||
### Mini-Batch Speed up
|
||||
|
||||
To speed up mini batch training use these methods:
|
||||
|
||||
- [**Momentum**](#momentum)
|
||||
- [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate)
|
||||
- [**rmsprop**](#root-mean-square-propagation-aka-rmsprop)
|
||||
- [**Adaptive Gradients Methods**](#adaptive-gradient-methods)
|
||||
|
||||
### Mini-batches vs Full-Batches
|
||||
|
||||
The rule of thumb is to use **full-batches for small datasets or small redundancy**
|
||||
, while **mini-batches for redundant datasets**
|
||||
|
||||
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
|
||||
|
||||
BIN
Chapters/5-Optimization/pngs/nesterov.gif
Normal file
BIN
Chapters/5-Optimization/pngs/nesterov.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 172 KiB |
BIN
Chapters/5-Optimization/pngs/vanilla-momentum.gif
Normal file
BIN
Chapters/5-Optimization/pngs/vanilla-momentum.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 166 KiB |
Loading…
x
Reference in New Issue
Block a user