Revised Optimization Notes
This commit is contained in:
parent
2a96deaebf
commit
934c08d4c0
@ -33,7 +33,8 @@ $$
|
|||||||
|
|
||||||
## Cross Entropy Loss derivation
|
## Cross Entropy Loss derivation
|
||||||
|
|
||||||
A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing
|
Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"*
|
||||||
|
we get from distribution $p$ knowing
|
||||||
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
results from distribution $q$. It is defined as the entropy of $p$ plus the
|
||||||
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
[Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$
|
||||||
|
|
||||||
@ -62,6 +63,23 @@ Usually $\hat{y}$ comes from using a
|
|||||||
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
[softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a
|
||||||
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
logaritm and probability values are at most 1, the closer to 0, the higher the loss
|
||||||
|
|
||||||
|
## Computing PCA[^wiki-pca]
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
> $X$ here is the matrix of dataset with **<ins>features over rows</ins>**
|
||||||
|
|
||||||
|
- $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation
|
||||||
|
- $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$
|
||||||
|
- $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues
|
||||||
|
- $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$
|
||||||
|
highest eigenvalue
|
||||||
|
- $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2
|
||||||
|
> are closely related and apply the same concept but applying different
|
||||||
|
> mathematical formulas.
|
||||||
|
|
||||||
## Laplace Operator[^khan-1]
|
## Laplace Operator[^khan-1]
|
||||||
|
|
||||||
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the
|
||||||
@ -80,8 +98,32 @@ It can also be used to compute the net flow of particles in that region of space
|
|||||||
> This is not a **discrete laplace operator**, which is instead a **matrix** here,
|
> This is not a **discrete laplace operator**, which is instead a **matrix** here,
|
||||||
> as there are many other formulations.
|
> as there are many other formulations.
|
||||||
|
|
||||||
|
## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)
|
||||||
|
|
||||||
|
A Hessian Matrix represents the 2nd derivative of a function, thus it gives
|
||||||
|
us the curvature of a function.
|
||||||
|
|
||||||
|
It is also used to tell us whether the point is a local minimum (it is positive
|
||||||
|
defined), local maximum (it is negative defined) or saddle (neither positive or
|
||||||
|
negative defined).
|
||||||
|
|
||||||
|
It is computed by computing the partial derivatives of the gradient along
|
||||||
|
all dimensions and then transpose it.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla f = \begin{bmatrix}
|
||||||
|
\frac{d \, f}{d\,x} & \frac{d \, f}{d\,y}
|
||||||
|
\end{bmatrix} \\
|
||||||
|
H(f) = \begin{bmatrix}
|
||||||
|
\frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\
|
||||||
|
\frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2}
|
||||||
|
\end{bmatrix}
|
||||||
|
$$
|
||||||
|
|
||||||
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
[^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0)
|
||||||
|
|
||||||
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
[^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy)
|
||||||
|
|
||||||
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
[^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory))
|
||||||
|
|
||||||
|
[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method)
|
||||||
|
|||||||
199
Chapters/15-Appendix-A/python-experiments/pca.ipynb
Normal file
199
Chapters/15-Appendix-A/python-experiments/pca.ipynb
Normal file
@ -0,0 +1,199 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "8c14ea22",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Computing PCA\n",
|
||||||
|
"\n",
|
||||||
|
"Here I'll be taking data from [Geeks4Geeks](https://www.geeksforgeeks.org/machine-learning/mathematical-approach-to-pca/)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0b32eb5c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"[1.8 1.87777778]\n",
|
||||||
|
"[[ 0.7 0.52222222]\n",
|
||||||
|
" [-1.3 -1.17777778]\n",
|
||||||
|
" [ 0.4 1.02222222]\n",
|
||||||
|
" [ 1.3 1.12222222]\n",
|
||||||
|
" [ 0.5 0.82222222]\n",
|
||||||
|
" [ 0.2 -0.27777778]\n",
|
||||||
|
" [-0.8 -0.77777778]\n",
|
||||||
|
" [-0.3 -0.27777778]\n",
|
||||||
|
" [-0.7 -0.97777778]]\n",
|
||||||
|
"[[0.6925 0.68875 ]\n",
|
||||||
|
" [0.68875 0.79444444]]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"import numpy as np\n",
|
||||||
|
"\n",
|
||||||
|
"X : np.ndarray = np.array([\n",
|
||||||
|
" [2.5, 2.4],\n",
|
||||||
|
" [0.5, 0.7],\n",
|
||||||
|
" [2.2, 2.9],\n",
|
||||||
|
" [3.1, 3.0],\n",
|
||||||
|
" [2.3, 2.7],\n",
|
||||||
|
" [2.0, 1.6],\n",
|
||||||
|
" [1.0, 1.1],\n",
|
||||||
|
" [1.5, 1.6],\n",
|
||||||
|
" [1.1, 0.9]\n",
|
||||||
|
"])\n",
|
||||||
|
"\n",
|
||||||
|
"# Compute mean values for features\n",
|
||||||
|
"mu_X = np.mean(X, 0)\n",
|
||||||
|
"\n",
|
||||||
|
"print(mu_X)\n",
|
||||||
|
"# \"Normalize\" Features\n",
|
||||||
|
"X = X - mu_X\n",
|
||||||
|
"print(X)\n",
|
||||||
|
"\n",
|
||||||
|
"# Compute covariance matrix applying\n",
|
||||||
|
"# Bessel's correction (n-1) instead of n\n",
|
||||||
|
"Cov = (X.T @ X) / (X.shape[0] - 1)\n",
|
||||||
|
"\n",
|
||||||
|
"print(Cov)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "78e9429f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"As you can notice, we did $X^T \\times X$ instead of $X \\times X^T$. This is because our \n",
|
||||||
|
"dataset had datapoints over rows instead of features."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 84,
|
||||||
|
"id": "f93b7a92",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"[0.05283865 1.43410579]\n",
|
||||||
|
"[[-0.73273632 -0.68051267]\n",
|
||||||
|
" [ 0.68051267 -0.73273632]]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Computing eigenvalues\n",
|
||||||
|
"eigen = np.linalg.eig(Cov)\n",
|
||||||
|
"eigen_values = eigen.eigenvalues\n",
|
||||||
|
"eigen_vectors = eigen.eigenvectors\n",
|
||||||
|
"\n",
|
||||||
|
"print(eigen_values)\n",
|
||||||
|
"print(eigen_vectors)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "bfbdd9c3",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Now we'll generate the new X matrix by only using the first eigen vector"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 85,
|
||||||
|
"id": "7ce6c540",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"(9, 1)\n",
|
||||||
|
"Compressed\n",
|
||||||
|
"[[-0.85901005]\n",
|
||||||
|
" [ 1.74766702]\n",
|
||||||
|
" [-1.02122441]\n",
|
||||||
|
" [-1.70695945]\n",
|
||||||
|
" [-0.94272842]\n",
|
||||||
|
" [ 0.06743533]\n",
|
||||||
|
" [ 1.11431616]\n",
|
||||||
|
" [ 0.40769167]\n",
|
||||||
|
" [ 1.19281215]]\n",
|
||||||
|
"Reconstruction\n",
|
||||||
|
"[[ 0.58456722 0.62942786]\n",
|
||||||
|
" [-1.18930955 -1.28057909]\n",
|
||||||
|
" [ 0.69495615 0.74828821]\n",
|
||||||
|
" [ 1.16160753 1.25075117]\n",
|
||||||
|
" [ 0.64153863 0.69077135]\n",
|
||||||
|
" [-0.0458906 -0.04941232]\n",
|
||||||
|
" [-0.75830626 -0.81649992]\n",
|
||||||
|
" [-0.27743934 -0.29873049]\n",
|
||||||
|
" [-0.81172378 -0.87401678]]\n",
|
||||||
|
"Difference\n",
|
||||||
|
"[[0.11543278 0.10720564]\n",
|
||||||
|
" [0.11069045 0.10280131]\n",
|
||||||
|
" [0.29495615 0.27393401]\n",
|
||||||
|
" [0.13839247 0.12852895]\n",
|
||||||
|
" [0.14153863 0.13145088]\n",
|
||||||
|
" [0.2458906 0.22836546]\n",
|
||||||
|
" [0.04169374 0.03872214]\n",
|
||||||
|
" [0.02256066 0.02095271]\n",
|
||||||
|
" [0.11172378 0.10376099]]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Computing X coming from only 1st eigen vector\n",
|
||||||
|
"Z_pca = X @ eigen_vectors[:,1]\n",
|
||||||
|
"Z_pca = Z_pca.reshape([Z_pca.shape[0], 1])\n",
|
||||||
|
"\n",
|
||||||
|
"print(Z_pca.shape)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"# X reconstructed\n",
|
||||||
|
"eigen_v = (eigen_vectors[:, 1].reshape([eigen_vectors[:, 1].shape[0], 1]))\n",
|
||||||
|
"X_rec = Z_pca @ eigen_v.T\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Compressed\")\n",
|
||||||
|
"print(Z_pca)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Reconstruction\")\n",
|
||||||
|
"print(X_rec)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Difference\")\n",
|
||||||
|
"print(abs(X - X_rec))"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "deep_learning",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.7"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
501
Chapters/5-Optimization/INDEX-OLD.md
Normal file
501
Chapters/5-Optimization/INDEX-OLD.md
Normal file
@ -0,0 +1,501 @@
|
|||||||
|
# Optimization
|
||||||
|
|
||||||
|
We basically try to see the error and minimize it by moving towards the ***gradient***
|
||||||
|
|
||||||
|
## Types of Learning Algorithms
|
||||||
|
|
||||||
|
In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
|
||||||
|
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
|
||||||
|
|
||||||
|
So, often we train the `model` on a subset of samples.
|
||||||
|
|
||||||
|
### Online Learning
|
||||||
|
|
||||||
|
This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
|
||||||
|
|
||||||
|
On each `point` we get the ***gradient*** and then we update `weights`.
|
||||||
|
|
||||||
|
### Mini-Batch
|
||||||
|
|
||||||
|
In this approach, we divide our `dataset` in small batches called `mini-batches`.
|
||||||
|
These need to be ***balanced*** in order not to have ***imbalances***.
|
||||||
|
|
||||||
|
This technique is the ***most used one***
|
||||||
|
|
||||||
|
## Tips and Tricks
|
||||||
|
|
||||||
|
### Learning Rate
|
||||||
|
|
||||||
|
This is the `hyperparameter` we use to tune our
|
||||||
|
***learning steps***.
|
||||||
|
|
||||||
|
Sometimes we have it too big and this causes
|
||||||
|
***overshootings***. So a quick solution may be to turn
|
||||||
|
it down.
|
||||||
|
|
||||||
|
However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
|
||||||
|
|
||||||
|
### Weight initialization
|
||||||
|
|
||||||
|
We need to avoid `neurons` to have the same
|
||||||
|
***gradient***. This is easily achievable by using
|
||||||
|
***small random values***.
|
||||||
|
|
||||||
|
However, if we have a ***large `fan-in`***, then it's
|
||||||
|
***easy to overshoot***, then it's better to initialize
|
||||||
|
those `weights` ***proportionally to***
|
||||||
|
$\sqrt{\text{fan-in}}$:
|
||||||
|
|
||||||
|
$$
|
||||||
|
w = \frac{
|
||||||
|
np.random(N)
|
||||||
|
}{
|
||||||
|
\sqrt{N}
|
||||||
|
}
|
||||||
|
$$
|
||||||
|
|
||||||
|
#### Xavier-Glorot Initialization
|
||||||
|
|
||||||
|
<!-- TODO: Read Xavier-Glorot paper -->
|
||||||
|
|
||||||
|
Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
|
||||||
|
`uniform distribution` with a `std-dev`
|
||||||
|
|
||||||
|
$$
|
||||||
|
\sigma^2 = \text{gain} \cdot \sqrt{
|
||||||
|
\frac{
|
||||||
|
2
|
||||||
|
}{
|
||||||
|
\text{fan-in} + \text{fan-out}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
$$
|
||||||
|
|
||||||
|
and bounded between $a$ and $-a$
|
||||||
|
|
||||||
|
$$
|
||||||
|
a = \text{gain} \cdot \sqrt{
|
||||||
|
\frac{
|
||||||
|
6
|
||||||
|
}{
|
||||||
|
\text{fan-in} + \text{fan-out}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Alternatively, one can use a `normal-distribution`
|
||||||
|
$\mathcal{N}(0, \sigma^2)$.
|
||||||
|
|
||||||
|
Note that `gain` is in the **original paper** is equal
|
||||||
|
to $1$
|
||||||
|
|
||||||
|
### Decorrelating input components
|
||||||
|
|
||||||
|
Since ***highly correlated features*** don't offer much
|
||||||
|
in terms of ***new information***, probably we need
|
||||||
|
to go in the ***latent space*** to find the
|
||||||
|
`latent-variables` governing those `features`.
|
||||||
|
|
||||||
|
#### PCA
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
> This topic won't be explained here as it's something
|
||||||
|
> usually learnt for `Machine Learning`, a
|
||||||
|
> ***prerequisite*** for approaching `Deep Learning`.
|
||||||
|
|
||||||
|
This is a method we can use to discard `features` that
|
||||||
|
will ***add little to no information***
|
||||||
|
|
||||||
|
## Common problems in MultiLayer Networks
|
||||||
|
|
||||||
|
### Hitting a Plateau
|
||||||
|
|
||||||
|
This happenes wehn we have a ***big `learning-rate`***
|
||||||
|
which makes `weights` go high in ***absolute value***.
|
||||||
|
|
||||||
|
Because this happens ***too quickly***, we could
|
||||||
|
see a ***quick diminishing error*** and this is usually
|
||||||
|
***mistaken for a minimum point***, while instead
|
||||||
|
it's a ***plateau***.
|
||||||
|
|
||||||
|
## Speeding up Mini-Batch Learning
|
||||||
|
|
||||||
|
### Momentum[^momentum]
|
||||||
|
|
||||||
|
We use this method ***mainly when we use `SGD`*** as
|
||||||
|
a ***learning techniques***
|
||||||
|
|
||||||
|
This method is better explained if we imagine
|
||||||
|
our error surface as an actual surface and we place a
|
||||||
|
ball over it.
|
||||||
|
|
||||||
|
***The ball will start rolling towards the steepest
|
||||||
|
descent*** (initially), but ***after gaining enough
|
||||||
|
velocity*** it will follow the ***previous direction
|
||||||
|
, in some measure***.
|
||||||
|
|
||||||
|
So, now the ***gradient*** does modify the ***velocity***
|
||||||
|
rather than the ***position***, so the momentum will
|
||||||
|
***dampen small variations***.
|
||||||
|
|
||||||
|
Moreover, once the ***momentum builds up***, we will
|
||||||
|
easily ***pass over plateaus*** as the
|
||||||
|
***ball will continue to roll over*** until it is
|
||||||
|
stopped by a negative ***gradient***
|
||||||
|
|
||||||
|
#### Momentum Equations
|
||||||
|
|
||||||
|
There are a couple of them, mainly.
|
||||||
|
|
||||||
|
One of them uses a term to evaluate the `momentum`, $p$,
|
||||||
|
called `SGD momentum` or `momentum term` or
|
||||||
|
`momentum parameter`:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
|
||||||
|
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
The other one is ***logically equivalent*** to the
|
||||||
|
previous, but it update the `weights` in ***one step***
|
||||||
|
and is called `Stochastic Heavy Ball Method`:
|
||||||
|
|
||||||
|
$$
|
||||||
|
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
|
||||||
|
+ \beta ( w_k - w_{k-1})
|
||||||
|
$$
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> This is how to choose $\beta$:
|
||||||
|
>
|
||||||
|
> $0 < \beta < 1$
|
||||||
|
>
|
||||||
|
> If $\beta = 0$, then we are doing
|
||||||
|
> ***gradient descent***, if $\beta > 1$ then we
|
||||||
|
> ***will have numerical instabilities***.
|
||||||
|
>
|
||||||
|
> The ***larger*** $\beta$ the
|
||||||
|
> ***higher the `momentum`***, so it will
|
||||||
|
> ***turn slower***
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
> usual values are $\beta = 0.9$ or $\beta = 0.99$
|
||||||
|
> and usually we start from 0.5 initially, to raise it
|
||||||
|
> whenever we are stuck.
|
||||||
|
>
|
||||||
|
> When we increase $\beta$, then the `learning rate`
|
||||||
|
> ***must decrease accordingly***
|
||||||
|
> (e.g. from 0.9 to 0.99, `learning-rate` must be
|
||||||
|
> divided by a factor of 10)
|
||||||
|
|
||||||
|
#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
|
||||||
|
|
||||||
|
Differently from the previous
|
||||||
|
[momentum](#momentum-equations),
|
||||||
|
we take an ***intermediate*** step where we
|
||||||
|
***update the `weights`*** according to the
|
||||||
|
***previous `momentum`*** and then we compute the
|
||||||
|
***new `momentum`*** in this new position, and then
|
||||||
|
we ***update again***
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\hat{w}_k & = w_k - \beta p_k \\
|
||||||
|
p_{k+1} &= \beta p_{k} +
|
||||||
|
\eta \nabla L(X, y, \hat{w}_k) \\
|
||||||
|
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
#### Why Momentum Works
|
||||||
|
|
||||||
|
While it has been ***hypothesized*** that
|
||||||
|
***acceleration*** made ***convergence faster***, this
|
||||||
|
is
|
||||||
|
***only true for convex problems without much noise***,
|
||||||
|
though this may be ***part of the story***
|
||||||
|
|
||||||
|
The other half may be ***Noise Smoothing*** by
|
||||||
|
smoothing the optimization process, however
|
||||||
|
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
|
||||||
|
|
||||||
|
### Separate Adaptive Learning Rates
|
||||||
|
|
||||||
|
Since `weights` may ***greatly vary*** across `layers`,
|
||||||
|
having a ***single `learning-rate` might not be ideal.
|
||||||
|
|
||||||
|
So the idea is to set a `local learning-rate` to
|
||||||
|
control the `global` one as a ***multiplicative factor***
|
||||||
|
|
||||||
|
#### Local Learning rates
|
||||||
|
|
||||||
|
- Start with $1$ as the ***starting point*** for
|
||||||
|
`local learning-rates` which we'll call `gain` from
|
||||||
|
now on.
|
||||||
|
- If the `gradient` has the ***same sign, increase it***
|
||||||
|
- Otherwise, ***multiplicatively decrease it***
|
||||||
|
|
||||||
|
$$
|
||||||
|
w_{i,j} = - g_{i,j} \cdot \eta \frac{
|
||||||
|
d \, Out
|
||||||
|
}{
|
||||||
|
d \, w_{i,j}
|
||||||
|
}
|
||||||
|
|
||||||
|
\\
|
||||||
|
g_{i,j}(t) = \begin{cases}
|
||||||
|
|
||||||
|
g_{i,j}(t - 1) + \delta
|
||||||
|
& \left( \frac{
|
||||||
|
d \, Out
|
||||||
|
}{
|
||||||
|
d \, w_{i,j}
|
||||||
|
} (t)
|
||||||
|
\cdot
|
||||||
|
\frac{
|
||||||
|
d \, Out
|
||||||
|
}{
|
||||||
|
d \, w_{i,j}
|
||||||
|
} (t-1) \right) > 0 \\
|
||||||
|
|
||||||
|
|
||||||
|
g_{i,j}(t - 1) \cdot (1 - \delta)
|
||||||
|
& \left( \frac{
|
||||||
|
d \, Out
|
||||||
|
}{
|
||||||
|
d \, w_{i,j}
|
||||||
|
} (t)
|
||||||
|
\cdot
|
||||||
|
\frac{
|
||||||
|
d \, Out
|
||||||
|
}{
|
||||||
|
d \, w_{i,j}
|
||||||
|
} (t-1) \right) \leq 0
|
||||||
|
\end{cases}
|
||||||
|
$$
|
||||||
|
|
||||||
|
With this method, if there are oscillations, we will have
|
||||||
|
`gains` around $1$
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
>
|
||||||
|
> - Usually a value for $d$ is $0.05$
|
||||||
|
> - Limit `gains` around some values:
|
||||||
|
>
|
||||||
|
> - $[0.1, 10]$
|
||||||
|
> - $[0.01, 100]$
|
||||||
|
>
|
||||||
|
> - Use `full-batches` or `big mini-batches` so that
|
||||||
|
> the ***gradient*** doesn't oscillate because of
|
||||||
|
> sampling errors
|
||||||
|
> - Combine it with [Momentum](#momentum)
|
||||||
|
> - Remember that ***Adaptive `learning-rate`*** deals
|
||||||
|
> with ***axis-alignment***
|
||||||
|
|
||||||
|
### rmsprop | Root Mean Square Propagation
|
||||||
|
|
||||||
|
#### rprop | Resilient Propagation[^rprop-torch]
|
||||||
|
|
||||||
|
This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
|
||||||
|
but in this case we don't use the
|
||||||
|
[AIMD](#local-learning-rates) technique and
|
||||||
|
***we don't take into account*** the
|
||||||
|
***magnitude of the gradient*** but ***only the sign***
|
||||||
|
|
||||||
|
- If ***gradient*** has same sign:
|
||||||
|
- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
|
||||||
|
- else:
|
||||||
|
- $step_{k} = step_{k} \cdot \eta_-$
|
||||||
|
where $0 <\eta_- < 1$
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
>
|
||||||
|
> Limit the step size in a range where:
|
||||||
|
>
|
||||||
|
> - $\inf < 50$
|
||||||
|
> - $\sup > 1 \text{M}$
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
>
|
||||||
|
> rprop does ***not work*** with `mini-batches` as
|
||||||
|
> the ***sign of the gradient changes frequently***
|
||||||
|
|
||||||
|
#### rmsprop in detail[^rmsprop-torch]
|
||||||
|
|
||||||
|
The idea is that [rprop](#rprop--resilient-propagation)
|
||||||
|
is ***equivalent to using the gradient divided by its
|
||||||
|
value*** (as you either multiply for $1$ or $-1$),
|
||||||
|
however it means that between `mini-batches` the
|
||||||
|
***divisor*** changes each time, oscillating.
|
||||||
|
|
||||||
|
The solution is to have a ***running average*** of
|
||||||
|
the ***magnitude of the squared gradient for
|
||||||
|
each `weight`***:
|
||||||
|
|
||||||
|
$$
|
||||||
|
MeanSquare(w, t) =
|
||||||
|
\alpha MeanSquare(w, t-1) +
|
||||||
|
(1 - \alpha)
|
||||||
|
\left(
|
||||||
|
\frac{d\, Out}{d\, w}^2
|
||||||
|
\right)
|
||||||
|
$$
|
||||||
|
|
||||||
|
We then divide the ***gradient by the `square root`***
|
||||||
|
of that value
|
||||||
|
|
||||||
|
#### Further Developments
|
||||||
|
|
||||||
|
- `rmsprop` with `momentum` does not work as it should
|
||||||
|
- `rmsprop` with `Nesterov momentum` works best
|
||||||
|
if usedto divide the ***correction*** rather than
|
||||||
|
the ***jump***
|
||||||
|
- `rmsprop` with `adaptive learnings` needs more
|
||||||
|
investigation
|
||||||
|
|
||||||
|
### Fancy Methods
|
||||||
|
|
||||||
|
#### Adaptive Gradient
|
||||||
|
|
||||||
|
<!-- TODO: Expand over these -->
|
||||||
|
|
||||||
|
##### Convex Case
|
||||||
|
|
||||||
|
- Conjugate Gradient/Acceleration
|
||||||
|
- L-BFGS
|
||||||
|
- Quasi-Newton Methods
|
||||||
|
|
||||||
|
##### Non-Convex Case
|
||||||
|
|
||||||
|
Pay attention, here the `Hessian` may not be
|
||||||
|
`Positive Semi Defined`, thus when the ***gradient*** is
|
||||||
|
$0$ we don't necessarily know where we are.
|
||||||
|
|
||||||
|
- Natural Gradient Methods
|
||||||
|
- Curvature Adaptive
|
||||||
|
- [Adagrad](./Fancy-Methods/ADAGRAD.md)
|
||||||
|
- [AdaDelta](./Fancy-Methods/ADADELTA.md)
|
||||||
|
- [RMSprop](#rmsprop-in-detail)
|
||||||
|
- [ADAM](./Fancy-Methods/ADAM.md)
|
||||||
|
- l-BFGS
|
||||||
|
- [heavy ball gradient](#momentum)
|
||||||
|
- [momemtum](#momentum)
|
||||||
|
- Noise Injection:
|
||||||
|
- Simulated Annealing
|
||||||
|
- Langevin Method
|
||||||
|
|
||||||
|
#### Adagrad
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> [Here in detail](./Fancy-Methods/ADAGRAD.md)
|
||||||
|
|
||||||
|
#### Adadelta
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> [Here in detail](./Fancy-Methods/ADADELTA.md)
|
||||||
|
|
||||||
|
#### ADAM
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> [Here in detail](./Fancy-Methods/ADAM.md)
|
||||||
|
|
||||||
|
#### AdamW
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> [Here in detail](./Fancy-Methods/ADAM-W.md)
|
||||||
|
|
||||||
|
#### LION
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> [Here in detail](./Fancy-Methods/LION.md)
|
||||||
|
|
||||||
|
### Hessian Free[^anelli-hessian-free]
|
||||||
|
|
||||||
|
How much can we `learn` from a given
|
||||||
|
`Loss` space?
|
||||||
|
|
||||||
|
The ***best way to move*** would be along the
|
||||||
|
***gradient***, assuming it has
|
||||||
|
the ***same curvature***
|
||||||
|
(e.g. It's and has a local minimum).
|
||||||
|
|
||||||
|
But ***usually this is not the case***, so we need
|
||||||
|
to move ***where the ratio of gradient and curvature is
|
||||||
|
high***
|
||||||
|
|
||||||
|
#### Newton's Method
|
||||||
|
|
||||||
|
This method takes into account the ***curvature***
|
||||||
|
of the `Loss`
|
||||||
|
|
||||||
|
With this method, the update would be:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
|
||||||
|
d \, E
|
||||||
|
}{
|
||||||
|
d \, \vec{w}
|
||||||
|
}
|
||||||
|
$$
|
||||||
|
|
||||||
|
***If this could be feasible we'll go on the minimum in
|
||||||
|
one step***, but it's not, as the
|
||||||
|
***computations***
|
||||||
|
needed to get a `Hessian` ***increase exponentially***.
|
||||||
|
|
||||||
|
The thing is that whenever we ***update `weights`*** with
|
||||||
|
the `Steepest Descent` method, each update *messes up*
|
||||||
|
another, while the ***curvature*** can help to ***scale
|
||||||
|
these updates*** so that they do not disturb each other.
|
||||||
|
|
||||||
|
#### Curvature Approximations
|
||||||
|
|
||||||
|
However, since the `Hessian` is
|
||||||
|
***too expensive to compute***, we can approximate it.
|
||||||
|
|
||||||
|
- We can take only the ***diagonal elements***
|
||||||
|
- ***Other algorithms*** (e.g. Hessian Free)
|
||||||
|
- ***Conjugate Gradient*** to minimize the
|
||||||
|
***approximation error***
|
||||||
|
|
||||||
|
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
>
|
||||||
|
> This is an oversemplification of the topic, so reading
|
||||||
|
> the footnotes material is greatly advised.
|
||||||
|
|
||||||
|
The basic idea is that, in order not to mess up previous
|
||||||
|
directions, we ***`optimize` along perpendicular directions***.
|
||||||
|
|
||||||
|
This method is ***guaranteed to mathematically succeed
|
||||||
|
after N steps, the dimension of the space***, in practice
|
||||||
|
the error will be minimal.
|
||||||
|
|
||||||
|
This ***method works well for `non-quadratic errors`***
|
||||||
|
and the `Hessian Free` `optimizer` uses this method
|
||||||
|
on ***genuinely quadratic surfaces***, which are
|
||||||
|
***quadratic approximations of the real surface***
|
||||||
|
|
||||||
|
|
||||||
|
<!-- TODO: Add PDF 5 pg. 38 -->
|
||||||
|
|
||||||
|
<!-- Footnotes -->
|
||||||
|
|
||||||
|
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
||||||
|
|
||||||
|
[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
|
||||||
|
|
||||||
|
[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
|
||||||
|
|
||||||
|
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
||||||
|
|
||||||
|
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
||||||
|
|
||||||
|
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
|
||||||
|
|
||||||
|
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
|
||||||
|
|
||||||
|
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
|
||||||
@ -1,501 +1,556 @@
|
|||||||
# Optimization
|
# Optimization
|
||||||
|
|
||||||
We basically try to see the error and minimize it by moving towards the ***gradient***
|
## Beyond Full Batches
|
||||||
|
|
||||||
## Types of Learning Algorithms
|
Even though full batches give the best picture of a probability dristribution
|
||||||
|
of data points, it's computationally expensive.
|
||||||
|
|
||||||
In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
|
Since data is usually **highly redundant**, we can think of getting smaller
|
||||||
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
|
sets that are classes balanced, **mini-batches**, to update weights.
|
||||||
|
While this doesn't give the same results as full batches, is still reliable.
|
||||||
|
|
||||||
So, often we train the `model` on a subset of samples.
|
When we need to bring things to the extreme, we can even update over a single
|
||||||
|
data point, **online learning**, however they are not as efficient as
|
||||||
|
mini-batches as **they do not use matrix multiplications, which are GPU efficient**
|
||||||
|
|
||||||
### Online Learning
|
## Learning rate Scheduling
|
||||||
|
|
||||||
This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
|
## Xavier-Glorot Weight initialization
|
||||||
|
|
||||||
On each `point` we get the ***gradient*** and then we update `weights`.
|
> [!WARNING]
|
||||||
|
> Before Xavier-Glorot there was another initialization technique proportional
|
||||||
|
> to fan-in:
|
||||||
|
>
|
||||||
|
> $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$
|
||||||
|
>
|
||||||
|
> Though, Xavier-Glorot is not the only available initialization as there are
|
||||||
|
> many others[^torch-init]
|
||||||
|
|
||||||
### Mini-Batch
|
Whenever we initialize weights, we need to be careful to **break simmetry**, as
|
||||||
|
**identical hiddden nodes gets the exact same results**, making us
|
||||||
|
lose representation power.
|
||||||
|
|
||||||
In this approach, we divide our `dataset` in small batches called `mini-batches`.
|
Another problem with weight initialization is the **overshooting**. This is
|
||||||
These need to be ***balanced*** in order not to have ***imbalances***.
|
caused by **many small changes over weights**. The idea to solve this is by
|
||||||
|
**initializing weights proprotionally to fan-in (input) and fan-out (output)**
|
||||||
|
|
||||||
This technique is the ***most used one***
|
A technique we use to initialize weights comes from Xavier and Glorot, called
|
||||||
|
Xavier-Glorot initialization:
|
||||||
## Tips and Tricks
|
|
||||||
|
|
||||||
### Learning Rate
|
|
||||||
|
|
||||||
This is the `hyperparameter` we use to tune our
|
|
||||||
***learning steps***.
|
|
||||||
|
|
||||||
Sometimes we have it too big and this causes
|
|
||||||
***overshootings***. So a quick solution may be to turn
|
|
||||||
it down.
|
|
||||||
|
|
||||||
However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
|
|
||||||
|
|
||||||
### Weight initialization
|
|
||||||
|
|
||||||
We need to avoid `neurons` to have the same
|
|
||||||
***gradient***. This is easily achievable by using
|
|
||||||
***small random values***.
|
|
||||||
|
|
||||||
However, if we have a ***large `fan-in`***, then it's
|
|
||||||
***easy to overshoot***, then it's better to initialize
|
|
||||||
those `weights` ***proportionally to***
|
|
||||||
$\sqrt{\text{fan-in}}$:
|
|
||||||
|
|
||||||
$$
|
|
||||||
w = \frac{
|
|
||||||
np.random(N)
|
|
||||||
}{
|
|
||||||
\sqrt{N}
|
|
||||||
}
|
|
||||||
$$
|
|
||||||
|
|
||||||
#### Xavier-Glorot Initialization
|
|
||||||
|
|
||||||
<!-- TODO: Read Xavier-Glorot paper -->
|
|
||||||
|
|
||||||
Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
|
|
||||||
`uniform distribution` with a `std-dev`
|
|
||||||
|
|
||||||
$$
|
|
||||||
\sigma^2 = \text{gain} \cdot \sqrt{
|
|
||||||
\frac{
|
|
||||||
2
|
|
||||||
}{
|
|
||||||
\text{fan-in} + \text{fan-out}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
$$
|
|
||||||
|
|
||||||
and bounded between $a$ and $-a$
|
|
||||||
|
|
||||||
$$
|
|
||||||
a = \text{gain} \cdot \sqrt{
|
|
||||||
\frac{
|
|
||||||
6
|
|
||||||
}{
|
|
||||||
\text{fan-in} + \text{fan-out}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
$$
|
|
||||||
|
|
||||||
Alternatively, one can use a `normal-distribution`
|
|
||||||
$\mathcal{N}(0, \sigma^2)$.
|
|
||||||
|
|
||||||
Note that `gain` is in the **original paper** is equal
|
|
||||||
to $1$
|
|
||||||
|
|
||||||
### Decorrelating input components
|
|
||||||
|
|
||||||
Since ***highly correlated features*** don't offer much
|
|
||||||
in terms of ***new information***, probably we need
|
|
||||||
to go in the ***latent space*** to find the
|
|
||||||
`latent-variables` governing those `features`.
|
|
||||||
|
|
||||||
#### PCA
|
|
||||||
|
|
||||||
> [!CAUTION]
|
|
||||||
> This topic won't be explained here as it's something
|
|
||||||
> usually learnt for `Machine Learning`, a
|
|
||||||
> ***prerequisite*** for approaching `Deep Learning`.
|
|
||||||
|
|
||||||
This is a method we can use to discard `features` that
|
|
||||||
will ***add little to no information***
|
|
||||||
|
|
||||||
## Common problems in MultiLayer Networks
|
|
||||||
|
|
||||||
### Hitting a Plateau
|
|
||||||
|
|
||||||
This happenes wehn we have a ***big `learning-rate`***
|
|
||||||
which makes `weights` go high in ***absolute value***.
|
|
||||||
|
|
||||||
Because this happens ***too quickly***, we could
|
|
||||||
see a ***quick diminishing error*** and this is usually
|
|
||||||
***mistaken for a minimum point***, while instead
|
|
||||||
it's a ***plateau***.
|
|
||||||
|
|
||||||
## Speeding up Mini-Batch Learning
|
|
||||||
|
|
||||||
### Momentum[^momentum]
|
|
||||||
|
|
||||||
We use this method ***mainly when we use `SGD`*** as
|
|
||||||
a ***learning techniques***
|
|
||||||
|
|
||||||
This method is better explained if we imagine
|
|
||||||
our error surface as an actual surface and we place a
|
|
||||||
ball over it.
|
|
||||||
|
|
||||||
***The ball will start rolling towards the steepest
|
|
||||||
descent*** (initially), but ***after gaining enough
|
|
||||||
velocity*** it will follow the ***previous direction
|
|
||||||
, in some measure***.
|
|
||||||
|
|
||||||
So, now the ***gradient*** does modify the ***velocity***
|
|
||||||
rather than the ***position***, so the momentum will
|
|
||||||
***dampen small variations***.
|
|
||||||
|
|
||||||
Moreover, once the ***momentum builds up***, we will
|
|
||||||
easily ***pass over plateaus*** as the
|
|
||||||
***ball will continue to roll over*** until it is
|
|
||||||
stopped by a negative ***gradient***
|
|
||||||
|
|
||||||
#### Momentum Equations
|
|
||||||
|
|
||||||
There are a couple of them, mainly.
|
|
||||||
|
|
||||||
One of them uses a term to evaluate the `momentum`, $p$,
|
|
||||||
called `SGD momentum` or `momentum term` or
|
|
||||||
`momentum parameter`:
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
|
&W \propto \frac{rand(in, out)}{in + out} \\
|
||||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
&rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\
|
||||||
|
&\,\,\,\,\text{or} \\
|
||||||
|
&rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot
|
||||||
|
\sqrt{\frac{2}{in + out}}
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
The other one is ***logically equivalent*** to the
|
In other words, xavier glorot extracts weights from either a uniform distribution,
|
||||||
previous, but it update the `weights` in ***one step***
|
or a normal one, scaled by a factor $g$ called gain
|
||||||
and is called `Stochastic Heavy Ball Method`:
|
|
||||||
|
|
||||||
$$
|
[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html)
|
||||||
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
|
|
||||||
+ \beta ( w_k - w_{k-1})
|
|
||||||
$$
|
|
||||||
|
|
||||||
> [!NOTE]
|
## Momentum
|
||||||
> This is how to choose $\beta$:
|
|
||||||
>
|
|
||||||
> $0 < \beta < 1$
|
|
||||||
>
|
|
||||||
> If $\beta = 0$, then we are doing
|
|
||||||
> ***gradient descent***, if $\beta > 1$ then we
|
|
||||||
> ***will have numerical instabilities***.
|
|
||||||
>
|
|
||||||
> The ***larger*** $\beta$ the
|
|
||||||
> ***higher the `momentum`***, so it will
|
|
||||||
> ***turn slower***
|
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
> usual values are $\beta = 0.9$ or $\beta = 0.99$
|
> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by
|
||||||
> and usually we start from 0.5 initially, to raise it
|
> a factor of 10
|
||||||
> whenever we are stuck.
|
|
||||||
>
|
|
||||||
> When we increase $\beta$, then the `learning rate`
|
|
||||||
> ***must decrease accordingly***
|
|
||||||
> (e.g. from 0.9 to 0.99, `learning-rate` must be
|
|
||||||
> divided by a factor of 10)
|
|
||||||
|
|
||||||
#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
|
It's a technique inspired by physics. Imagine a ball rolling over a plane. Once
|
||||||
|
it has enough speed, even if the plane changes inclination, the ball has
|
||||||
|
still energy to move along the previous way because of its momentum.
|
||||||
|
|
||||||
Differently from the previous
|
Whenever on a gradient descent we have oscillations, **momentum dampens** all
|
||||||
[momentum](#momentum-equations),
|
movements steering us from the previous direction. Here momentum at time $k$
|
||||||
we take an ***intermediate*** step where we
|
is $p_k$
|
||||||
***update the `weights`*** according to the
|
|
||||||
***previous `momentum`*** and then we compute the
|
|
||||||
***new `momentum`*** in this new position, and then
|
|
||||||
we ***update again***
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
\hat{w}_k & = w_k - \beta p_k \\
|
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\
|
||||||
p_{k+1} &= \beta p_{k} +
|
W_{k+1} &= W_{k} - \gamma p_{k+1} \\
|
||||||
\eta \nabla L(X, y, \hat{w}_k) \\
|
\beta &\in [0, 1]
|
||||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
#### Why Momentum Works
|
Or, in a more compact way, logically equivalent to the previous one:
|
||||||
|
|
||||||
While it has been ***hypothesized*** that
|
|
||||||
***acceleration*** made ***convergence faster***, this
|
|
||||||
is
|
|
||||||
***only true for convex problems without much noise***,
|
|
||||||
though this may be ***part of the story***
|
|
||||||
|
|
||||||
The other half may be ***Noise Smoothing*** by
|
|
||||||
smoothing the optimization process, however
|
|
||||||
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
|
|
||||||
|
|
||||||
### Separate Adaptive Learning Rates
|
|
||||||
|
|
||||||
Since `weights` may ***greatly vary*** across `layers`,
|
|
||||||
having a ***single `learning-rate` might not be ideal.
|
|
||||||
|
|
||||||
So the idea is to set a `local learning-rate` to
|
|
||||||
control the `global` one as a ***multiplicative factor***
|
|
||||||
|
|
||||||
#### Local Learning rates
|
|
||||||
|
|
||||||
- Start with $1$ as the ***starting point*** for
|
|
||||||
`local learning-rates` which we'll call `gain` from
|
|
||||||
now on.
|
|
||||||
- If the `gradient` has the ***same sign, increase it***
|
|
||||||
- Otherwise, ***multiplicatively decrease it***
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
w_{i,j} = - g_{i,j} \cdot \eta \frac{
|
W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1})
|
||||||
d \, Out
|
$$
|
||||||
}{
|
|
||||||
d \, w_{i,j}
|
|
||||||
}
|
|
||||||
|
|
||||||
\\
|
The larger $\beta$ the slower it curves, accumulating more of previous directions.
|
||||||
g_{i,j}(t) = \begin{cases}
|
To play it safe, use smaller values once you are at the beginning where updates
|
||||||
|
are large and slowly turn it up to values near 1
|
||||||
|
|
||||||
g_{i,j}(t - 1) + \delta
|
> [!NOTE]
|
||||||
& \left( \frac{
|
>
|
||||||
d \, Out
|
> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign
|
||||||
}{
|
> rate
|
||||||
d \, w_{i,j}
|
> - $\gamma$: Learning rate
|
||||||
} (t)
|
> - $\beta$: hyperparameter of dampening factor
|
||||||
\cdot
|
> - $\nabla L(X, Y, W_{k})$: gradient of the loss
|
||||||
\frac{
|
>
|
||||||
d \, Out
|
|
||||||
}{
|
|
||||||
d \, w_{i,j}
|
|
||||||
} (t-1) \right) > 0 \\
|
|
||||||
|
|
||||||
|
## Nesterov Acceleated Gradient (aka NAG)
|
||||||
|
|
||||||
g_{i,j}(t - 1) \cdot (1 - \delta)
|
This method takes inpiration from Nesterov's optimization for convex functions and
|
||||||
& \left( \frac{
|
applies it to momentum. Its quirk is that it never computes the gradient where it
|
||||||
d \, Out
|
lands on, but on a temporary computation of them before the actual update.
|
||||||
}{
|
|
||||||
d \, w_{i,j}
|
|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] |
|
||||||
} (t)
|
|--|--|
|
||||||
\cdot
|
|  |  |
|
||||||
\frac{
|
|
||||||
d \, Out
|
To illustrate better its quirk, here's the formulation:
|
||||||
}{
|
|
||||||
d \, w_{i,j}
|
$$
|
||||||
} (t-1) \right) \leq 0
|
\begin{aligned}
|
||||||
|
\hat{W}_{k} &= W_{k} - \beta p_k \\
|
||||||
|
p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\
|
||||||
|
W_{k+1} &= W_{k} - \gamma p_{k+1}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$
|
||||||
|
which will be our actual weights. The idea is to follow the previous momentum
|
||||||
|
blindly, see where it goes and then make the correction.
|
||||||
|
|
||||||
|
[^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12)
|
||||||
|
|
||||||
|
## Justifying Faster Optimization for Momentum Based Methods
|
||||||
|
|
||||||
|
While many people justify the speed of momentum based methods for its acceleration,
|
||||||
|
this doesn't hold true as it's only accelerated for convex functions.
|
||||||
|
|
||||||
|
Since we have no idea, most of the times, how our gradient function looks like,
|
||||||
|
we can't make assumptions about it being convex.
|
||||||
|
|
||||||
|
So, the most compelling explanation lies in the fact that a momentum based
|
||||||
|
optimization is like computing a running average of the loss gradient, smoothing
|
||||||
|
the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD
|
||||||
|
|
||||||
|
## Separate Adaptive Learning Rate
|
||||||
|
|
||||||
|
The idea is that each weight of each layer may need its own learnig rate to avoid
|
||||||
|
overshooting and smooth the magnitude of received gradients, high over last layers
|
||||||
|
and low over first ones (architecture wise)
|
||||||
|
|
||||||
|
The trick is to have a global learning rate that is adjusted by a local gain that
|
||||||
|
is increased each time the weight keeps the same sign and viceversa:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\
|
||||||
|
|
||||||
|
g_{i,j}(n +1 ) = \begin{cases}
|
||||||
|
g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\
|
||||||
|
g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0
|
||||||
\end{cases}
|
\end{cases}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
With this method, if there are oscillations, we will have
|
This method ensures that if the weight oscillates, the gain will dampen it.
|
||||||
`gains` around $1$
|
Moreover, should it be totally random, it will hover near 1, keeping gradient
|
||||||
|
updates unchanged.
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> The way $g$ is updated is similar to AIMD in TCP Congestion
|
||||||
|
|
||||||
|
<!-- Comment for linter complains-->
|
||||||
|
|
||||||
> [!TIP]
|
> [!TIP]
|
||||||
>
|
>
|
||||||
> - Usually a value for $d$ is $0.05$
|
> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$
|
||||||
> - Limit `gains` around some values:
|
> - **Use full batch or big mini-batches** - This ensures that the change in sign
|
||||||
|
> is not due to sampling errors
|
||||||
|
> - **Combine this with momentum**
|
||||||
|
> - **Use this to deal with axis-alignment problems**
|
||||||
>
|
>
|
||||||
> - $[0.1, 10]$
|
|
||||||
> - $[0.01, 100]$
|
|
||||||
>
|
|
||||||
> - Use `full-batches` or `big mini-batches` so that
|
|
||||||
> the ***gradient*** doesn't oscillate because of
|
|
||||||
> sampling errors
|
|
||||||
> - Combine it with [Momentum](#momentum)
|
|
||||||
> - Remember that ***Adaptive `learning-rate`*** deals
|
|
||||||
> with ***axis-alignment***
|
|
||||||
|
|
||||||
### rmsprop | Root Mean Square Propagation
|
## Resilient Backpropagation (aka RProp)
|
||||||
|
|
||||||
#### rprop | Resilient Propagation[^rprop-torch]
|
Instead of using the magnitude of the gradient, **RProp uses the sign to derive
|
||||||
|
updates** that is multiplied by a step value. Here's the formulation[^florian-1]:
|
||||||
This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
|
|
||||||
but in this case we don't use the
|
|
||||||
[AIMD](#local-learning-rates) technique and
|
|
||||||
***we don't take into account*** the
|
|
||||||
***magnitude of the gradient*** but ***only the sign***
|
|
||||||
|
|
||||||
- If ***gradient*** has same sign:
|
|
||||||
- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
|
|
||||||
- else:
|
|
||||||
- $step_{k} = step_{k} \cdot \eta_-$
|
|
||||||
where $0 <\eta_- < 1$
|
|
||||||
|
|
||||||
> [!TIP]
|
|
||||||
>
|
|
||||||
> Limit the step size in a range where:
|
|
||||||
>
|
|
||||||
> - $\inf < 50$
|
|
||||||
> - $\sup > 1 \text{M}$
|
|
||||||
|
|
||||||
> [!CAUTION]
|
|
||||||
>
|
|
||||||
> rprop does ***not work*** with `mini-batches` as
|
|
||||||
> the ***sign of the gradient changes frequently***
|
|
||||||
|
|
||||||
#### rmsprop in detail[^rmsprop-torch]
|
|
||||||
|
|
||||||
The idea is that [rprop](#rprop--resilient-propagation)
|
|
||||||
is ***equivalent to using the gradient divided by its
|
|
||||||
value*** (as you either multiply for $1$ or $-1$),
|
|
||||||
however it means that between `mini-batches` the
|
|
||||||
***divisor*** changes each time, oscillating.
|
|
||||||
|
|
||||||
The solution is to have a ***running average*** of
|
|
||||||
the ***magnitude of the squared gradient for
|
|
||||||
each `weight`***:
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
MeanSquare(w, t) =
|
w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left(
|
||||||
\alpha MeanSquare(w, t-1) +
|
\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}
|
||||||
(1 - \alpha)
|
\right) \\
|
||||||
\left(
|
s_{i,j}^{(n)} = \begin{cases}
|
||||||
\frac{d\, Out}{d\, w}^2
|
s_{i,j}^{(n - 1)} \cdot 1.2 &
|
||||||
\right)
|
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
||||||
|
\cdot
|
||||||
|
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\
|
||||||
|
s_{i,j}^{(n - 1)} \cdot 0.5 &
|
||||||
|
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
||||||
|
\cdot
|
||||||
|
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0
|
||||||
|
\end{cases} \\
|
||||||
|
s_{i,j} \in [10^{-6}, 50]
|
||||||
$$
|
$$
|
||||||
|
|
||||||
We then divide the ***gradient by the `square root`***
|
It is noticeable that , like
|
||||||
of that value
|
[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase
|
||||||
|
or decreases the gain. However, since it uses multiplication to increase it, makes
|
||||||
|
it unusable for anything but full-batches, beacause of its fast growth.
|
||||||
|
|
||||||
#### Further Developments
|
[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/)
|
||||||
|
|
||||||
- `rmsprop` with `momentum` does not work as it should
|
## Root Mean Square Propagation (aka RMSProp)
|
||||||
- `rmsprop` with `Nesterov momentum` works best
|
|
||||||
if usedto divide the ***correction*** rather than
|
|
||||||
the ***jump***
|
|
||||||
- `rmsprop` with `adaptive learnings` needs more
|
|
||||||
investigation
|
|
||||||
|
|
||||||
### Fancy Methods
|
As the name implies, it propagates the loss over, a bit like momentum. Since
|
||||||
|
[RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient,
|
||||||
|
it's almost like dividing the gradient by its magnitude, which is bad in case of
|
||||||
|
mini-batches, as all divisors are different.
|
||||||
|
|
||||||
#### Adaptive Gradient
|
RMSProp solves this by keeping the gradient magnitude similar across mini-batches
|
||||||
|
by keeping a running average of it:
|
||||||
<!-- TODO: Expand over these -->
|
|
||||||
|
|
||||||
##### Convex Case
|
|
||||||
|
|
||||||
- Conjugate Gradient/Acceleration
|
|
||||||
- L-BFGS
|
|
||||||
- Quasi-Newton Methods
|
|
||||||
|
|
||||||
##### Non-Convex Case
|
|
||||||
|
|
||||||
Pay attention, here the `Hessian` may not be
|
|
||||||
`Positive Semi Defined`, thus when the ***gradient*** is
|
|
||||||
$0$ we don't necessarily know where we are.
|
|
||||||
|
|
||||||
- Natural Gradient Methods
|
|
||||||
- Curvature Adaptive
|
|
||||||
- [Adagrad](./Fancy-Methods/ADAGRAD.md)
|
|
||||||
- [AdaDelta](./Fancy-Methods/ADADELTA.md)
|
|
||||||
- [RMSprop](#rmsprop-in-detail)
|
|
||||||
- [ADAM](./Fancy-Methods/ADAM.md)
|
|
||||||
- l-BFGS
|
|
||||||
- [heavy ball gradient](#momentum)
|
|
||||||
- [momemtum](#momentum)
|
|
||||||
- Noise Injection:
|
|
||||||
- Simulated Annealing
|
|
||||||
- Langevin Method
|
|
||||||
|
|
||||||
#### Adagrad
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> [Here in detail](./Fancy-Methods/ADAGRAD.md)
|
|
||||||
|
|
||||||
#### Adadelta
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> [Here in detail](./Fancy-Methods/ADADELTA.md)
|
|
||||||
|
|
||||||
#### ADAM
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> [Here in detail](./Fancy-Methods/ADAM.md)
|
|
||||||
|
|
||||||
#### AdamW
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> [Here in detail](./Fancy-Methods/ADAM-W.md)
|
|
||||||
|
|
||||||
#### LION
|
|
||||||
|
|
||||||
> [!NOTE]
|
|
||||||
> [Here in detail](./Fancy-Methods/LION.md)
|
|
||||||
|
|
||||||
### Hessian Free[^anelli-hessian-free]
|
|
||||||
|
|
||||||
How much can we `learn` from a given
|
|
||||||
`Loss` space?
|
|
||||||
|
|
||||||
The ***best way to move*** would be along the
|
|
||||||
***gradient***, assuming it has
|
|
||||||
the ***same curvature***
|
|
||||||
(e.g. It's and has a local minimum).
|
|
||||||
|
|
||||||
But ***usually this is not the case***, so we need
|
|
||||||
to move ***where the ratio of gradient and curvature is
|
|
||||||
high***
|
|
||||||
|
|
||||||
#### Newton's Method
|
|
||||||
|
|
||||||
This method takes into account the ***curvature***
|
|
||||||
of the `Loss`
|
|
||||||
|
|
||||||
With this method, the update would be:
|
|
||||||
|
|
||||||
$$
|
$$
|
||||||
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
|
L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left(
|
||||||
d \, E
|
\frac{d \, Loss}{d\, W^{(k -1)}}
|
||||||
}{
|
\right)^2 \\
|
||||||
d \, \vec{w}
|
W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\
|
||||||
}
|
\text{usually } \beta = 0.9
|
||||||
$$
|
$$
|
||||||
|
|
||||||
***If this could be feasible we'll go on the minimum in
|
What this method does is keeping a running average of the measn square error,
|
||||||
one step***, but it's not, as the
|
hence the name, and use it to normalize the gradient keeping it similar across
|
||||||
***computations***
|
mini-batches.
|
||||||
needed to get a `Hessian` ***increase exponentially***.
|
|
||||||
|
|
||||||
The thing is that whenever we ***update `weights`*** with
|
> [!NOTE]
|
||||||
the `Steepest Descent` method, each update *messes up*
|
> While it can be used with momentum, it doesn't seem to add as much benefits as
|
||||||
another, while the ***curvature*** can help to ***scale
|
> using it standalone.
|
||||||
these updates*** so that they do not disturb each other.
|
>
|
||||||
|
> With Nesterov, it works best if used to normalize the correction, rather than
|
||||||
#### Curvature Approximations
|
> the jump. While for the adaptive learning rates, it still requires further
|
||||||
|
> investigations to prove the efficacy.
|
||||||
However, since the `Hessian` is
|
|
||||||
***too expensive to compute***, we can approximate it.
|
|
||||||
|
|
||||||
- We can take only the ***diagonal elements***
|
|
||||||
- ***Other algorithms*** (e.g. Hessian Free)
|
|
||||||
- ***Conjugate Gradient*** to minimize the
|
|
||||||
***approximation error***
|
|
||||||
|
|
||||||
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
|
|
||||||
|
|
||||||
> [!CAUTION]
|
|
||||||
>
|
>
|
||||||
> This is an oversemplification of the topic, so reading
|
|
||||||
> the footnotes material is greatly advised.
|
|
||||||
|
|
||||||
The basic idea is that, in order not to mess up previous
|
## Adaptive Gradient Methods
|
||||||
directions, we ***`optimize` along perpendicular directions***.
|
|
||||||
|
|
||||||
This method is ***guaranteed to mathematically succeed
|
<!--
|
||||||
after N steps, the dimension of the space***, in practice
|
MARK: AdaGrad
|
||||||
the error will be minimal.
|
-->
|
||||||
|
### AdaGrad[^adagrad-torch]
|
||||||
|
|
||||||
This ***method works well for `non-quadratic errors`***
|
`AdaGrad` is an ***optimization method*** aimed
|
||||||
and the `Hessian Free` `optimizer` uses this method
|
to:
|
||||||
on ***genuinely quadratic surfaces***, which are
|
|
||||||
***quadratic approximations of the real surface***
|
|
||||||
|
|
||||||
|
<ins>***"find needles in the haystack in the form of
|
||||||
|
very predictive yet rarely observed features"***
|
||||||
|
[^adagrad-official-paper]</ins>
|
||||||
|
|
||||||
<!-- TODO: Add PDF 5 pg. 38 -->
|
`AdaGrad`, opposed to a standard `SGD` that is the
|
||||||
|
***same for each gradient geometry***, tries to
|
||||||
|
***incorporate geometry from earlier iterations***.
|
||||||
|
|
||||||
|
#### AdaGrad Algorithm
|
||||||
|
|
||||||
|
Instead `AdaGrad` takes another
|
||||||
|
approach[^anelli-adagrad-2][^adagrad-official-paper]:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\
|
||||||
|
G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\
|
||||||
|
w_{i}^{(k + 1)} &=
|
||||||
|
w_{i}^{(k)} - \eta \cdot\frac{
|
||||||
|
1
|
||||||
|
}{
|
||||||
|
\sqrt{G_{i,i}^{(k +1)} + \epsilon}
|
||||||
|
} \cdot g_{i}^{(k+1)} \\
|
||||||
|
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Here $G^{(k)}$ is the ***sum of outer product*** of the
|
||||||
|
***gradient*** until time $t$, though ***usually it is
|
||||||
|
not used*** $G_t$, which is ***impractical because
|
||||||
|
of the high number of dimensions***, so we use
|
||||||
|
$diag(G_t)$ which can be
|
||||||
|
***computed in linear time***[^adagrad-official-paper]
|
||||||
|
|
||||||
|
The $\epsilon$ term here is used to
|
||||||
|
***avoid dividing by 0***[^anelli-adagrad-2] and has a
|
||||||
|
small value, usually in the order of $10^{-8}$
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
>
|
||||||
|
> This example is tough to understand if we where to apply it to a matrix $W$
|
||||||
|
> instead of a vector. To make it easier to understand in matricial notation:
|
||||||
|
>
|
||||||
|
> $$
|
||||||
|
> \begin{aligned}
|
||||||
|
> \nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\
|
||||||
|
> G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\
|
||||||
|
> W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}
|
||||||
|
{\sqrt{G^{(k+1)} + \epsilon}}
|
||||||
|
> \end{aligned}
|
||||||
|
> $$
|
||||||
|
>
|
||||||
|
> In other words, compute the gradient and scale it for the sum of its squares
|
||||||
|
> until that point
|
||||||
|
|
||||||
|
#### AdaGrad effectiveness[^anelli-adagrad-3]
|
||||||
|
|
||||||
|
- When we have ***many dimensions, many features are
|
||||||
|
irrelevant***
|
||||||
|
- ***Rarer Features are more relevant***
|
||||||
|
- It adapts $\eta$ to the right metric space
|
||||||
|
by projecting gradient stochastic updates with
|
||||||
|
[Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
|
||||||
|
a probability distribution.
|
||||||
|
|
||||||
|
#### AdaGrad Considerations
|
||||||
|
|
||||||
|
- It eliminates the need of manually tuning the
|
||||||
|
`learning rates`, which is usually set to
|
||||||
|
$0.01$
|
||||||
|
- The squared ***gradients*** are accumulated during
|
||||||
|
iterations, making the `learning-rate` become
|
||||||
|
***smaller and smaller***, thus becoming 0 and untrainable
|
||||||
|
|
||||||
<!-- Footnotes -->
|
<!-- Footnotes -->
|
||||||
|
|
||||||
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
|
||||||
|
|
||||||
[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
|
[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
|
||||||
|
|
||||||
[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
|
[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
|
||||||
|
|
||||||
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
|
||||||
|
|
||||||
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
|
||||||
|
|
||||||
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
|
[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
|
||||||
|
|
||||||
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
|
### AdaDelta[^adadelta-offcial-paper]
|
||||||
|
|
||||||
|
`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
|
||||||
|
created to address some problems of it, like
|
||||||
|
***sensitivity to initial `parameters` and corresponding
|
||||||
|
gradient***[^adadelta-offcial-paper]
|
||||||
|
|
||||||
|
To address all these problems, `ADADELTA` accumulates
|
||||||
|
***gradients over a `window` as a running average***, rather than ***accumulating
|
||||||
|
it over all instances***:
|
||||||
|
|
||||||
|
$$
|
||||||
|
G^{(k+1)} = \beta \cdot G^{(k)} +
|
||||||
|
(1 - \beta) \cdot \nabla L^{(k+1)}
|
||||||
|
$$
|
||||||
|
|
||||||
|
The update, which is very similar to the one in
|
||||||
|
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Technically speaking, the last equation is basically equivalent to the
|
||||||
|
[RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is
|
||||||
|
equivalent to the running average of the mean square.
|
||||||
|
|
||||||
|
However, as the author pointed out[^adadelta-units], this equation does not
|
||||||
|
respect units of measures. We should correct this problem
|
||||||
|
by ***considering the curvature locally smooth*** and
|
||||||
|
taking an approximation of it at the next step, becoming:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}}
|
||||||
|
\nabla L^{(k)}\\
|
||||||
|
S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\
|
||||||
|
W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
As we can notice, the ***`learning rate` completely
|
||||||
|
disappears from the equation, eliminating the need to
|
||||||
|
set one***
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> Here $\Delta W$ is already negative, that's why there's a $+$ in the last
|
||||||
|
> equation
|
||||||
|
|
||||||
|
<!-- Footnotes -->
|
||||||
|
|
||||||
|
[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||||
|
|
||||||
|
[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
||||||
|
|
||||||
|
### Adaptive Moment Estimation (aka AdaM)
|
||||||
|
|
||||||
|
AdaM computes both the momentum and the squared gradients with running
|
||||||
|
averages, which are 0 filled at time $k = 0$:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\
|
||||||
|
V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> The squared gradient can be thought as the variance, however it's not centered
|
||||||
|
|
||||||
|
Then it corrects them to be used in the final formulation:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}} \\
|
||||||
|
\hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
> $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep.
|
||||||
|
|
||||||
|
Then it computes the update in this way:
|
||||||
|
|
||||||
|
$$
|
||||||
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}}
|
||||||
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Even though Adam works, it doesn't generalize well and, particularly in image
|
||||||
|
problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers
|
||||||
|
instead of 1 as for SGD, which 2 of them need parameters tuning.
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and
|
||||||
|
> $\epsilon = 10^-8$
|
||||||
|
|
||||||
|
### AdamW
|
||||||
|
|
||||||
|
AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty,
|
||||||
|
AdaM already implements it, however it is usually added to momentum, getting
|
||||||
|
scaled by $\sqrt{\hat{V}}$:
|
||||||
|
|
||||||
|
$$
|
||||||
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}}
|
||||||
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
||||||
|
$$
|
||||||
|
|
||||||
|
AdamW authors saw that this was inefficient as it was influences by the uncentered
|
||||||
|
variance, thus modified the formula to this:
|
||||||
|
|
||||||
|
$$
|
||||||
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} }
|
||||||
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Lion (evoLved sIgn mOmeNtum)[^official-paper]
|
||||||
|
|
||||||
|
`Lion` is the result of a ***genetic search algorithm*** aimed to
|
||||||
|
find the best `optimizer`.
|
||||||
|
|
||||||
|
It starts from a population of `AdamW` algorithms to
|
||||||
|
***speed up the search***. Opposed to
|
||||||
|
`Adam` and `AdamW`, it keeps track
|
||||||
|
***only for the momentum*** and ***gradient sign***,
|
||||||
|
requiring ***less `memory`***.
|
||||||
|
|
||||||
|
Since ***uniform updates yields larger norms***,
|
||||||
|
`Lion` requires a ***smaller `learning-rate`***
|
||||||
|
and a ***larger decoupled `weight-decay`***
|
||||||
|
$\lambda$[^official-paper-1].
|
||||||
|
|
||||||
|
The ***advantages of `Lion` over `Adam` and `AdamW`
|
||||||
|
increase with the size of
|
||||||
|
the `mini-batch`***[^official-paper-1]
|
||||||
|
|
||||||
|
#### Symbolic Representation[^official-paper-2]
|
||||||
|
|
||||||
|
New ***trained algorithms*** are represented
|
||||||
|
`simbolically`, bringing these advantages:
|
||||||
|
|
||||||
|
- `Algorithms` must be ***implemented*** as `programs`
|
||||||
|
- It ***easier to analyze, comprehend and transfer to
|
||||||
|
new task*** these `algorithms`, rather than other
|
||||||
|
`algorithms` such as `NeuralNetworks`
|
||||||
|
- We can **estimate the *complexity*** by looking
|
||||||
|
at the ***length of code***
|
||||||
|
|
||||||
|
#### Tournament[^official-paper-3]
|
||||||
|
|
||||||
|
The best code is found with a ***tournament style
|
||||||
|
evolution***. Each cycle it picks the ***best
|
||||||
|
`algorithm`*** which will be
|
||||||
|
***copied and mutated*** and the ***oldest is removed***
|
||||||
|
|
||||||
|
<!-- Footnotes -->
|
||||||
|
|
||||||
|
[^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||||
|
|
||||||
|
[^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||||
|
|
||||||
|
[^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||||
|
|
||||||
|
[^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
||||||
|
|
||||||
|
## Hessian Free Optimization
|
||||||
|
|
||||||
|
Since we are moving on a function which gradient is not constant, by looking at
|
||||||
|
the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix),
|
||||||
|
we can see when it starts to change.
|
||||||
|
|
||||||
|
### Newton's Method
|
||||||
|
|
||||||
|
This method would technically give us the solution in one step on a quadratic
|
||||||
|
function, but it is unfeasible due to the memory and computational requirements:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Conjugate Gradient
|
||||||
|
|
||||||
|
The idea is to correct the weights so that we reduce the gradient to 0 across
|
||||||
|
perpendicular directions. This means that, for each update, we are not messing up
|
||||||
|
previous optimizations.
|
||||||
|
|
||||||
|
While it is usually used for quadratic error surfaces, there's a non linear variant
|
||||||
|
(non-linear conjugate gradient) that usually works well. However it is also
|
||||||
|
possible to approximate the true error function with a quadratic one, using the
|
||||||
|
standard method.
|
||||||
|
|
||||||
|
It gives a solution after $N$ steps over an $N$ dimensional quadratic surface,
|
||||||
|
however we need to penalize frequent changes in weights, especially for hidden
|
||||||
|
activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
|
||||||
|
|
||||||
|
## Optimization Tricks
|
||||||
|
|
||||||
|
### Input decorrelation
|
||||||
|
|
||||||
|
If you have a linear neuron, think of a Feed Forward and not of a Convolution,
|
||||||
|
it's better to decorrelate input components.
|
||||||
|
|
||||||
|
A way to achieve this is through a
|
||||||
|
[PCA](./../15-Appendix-A/INDEX.md#computing-pca),
|
||||||
|
transforming the error surface from an ellipse to a circle.
|
||||||
|
|
||||||
|
### Recognize Plateaus
|
||||||
|
|
||||||
|
If we start with big learning rates, since weights gain a big magnitude, the
|
||||||
|
derivative will be small and the error will not decrease significantly.
|
||||||
|
|
||||||
|
This may seem a local minima, but this is usually a plateau.
|
||||||
|
|
||||||
|
### Mini-Batch Speed up
|
||||||
|
|
||||||
|
To speed up mini batch training use these methods:
|
||||||
|
|
||||||
|
- [**Momentum**](#momentum)
|
||||||
|
- [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate)
|
||||||
|
- [**rmsprop**](#root-mean-square-propagation-aka-rmsprop)
|
||||||
|
- [**Adaptive Gradients Methods**](#adaptive-gradient-methods)
|
||||||
|
|
||||||
|
### Mini-batches vs Full-Batches
|
||||||
|
|
||||||
|
The rule of thumb is to use **full-batches for small datasets or small redundancy**
|
||||||
|
, while **mini-batches for redundant datasets**
|
||||||
|
|
||||||
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
|
|
||||||
|
|||||||
BIN
Chapters/5-Optimization/pngs/nesterov.gif
Normal file
BIN
Chapters/5-Optimization/pngs/nesterov.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 172 KiB |
BIN
Chapters/5-Optimization/pngs/vanilla-momentum.gif
Normal file
BIN
Chapters/5-Optimization/pngs/vanilla-momentum.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 166 KiB |
Loading…
x
Reference in New Issue
Block a user