diff --git a/Chapters/15-Appendix-A/INDEX.md b/Chapters/15-Appendix-A/INDEX.md index d488d25..1d68956 100644 --- a/Chapters/15-Appendix-A/INDEX.md +++ b/Chapters/15-Appendix-A/INDEX.md @@ -33,7 +33,8 @@ $$ ## Cross Entropy Loss derivation -A cross entropy is the measure of *"surprise"* we get from distribution $p$ knowing +Cross entropy[^wiki-cross-entropy] is the measure of *"surprise"* +we get from distribution $p$ knowing results from distribution $q$. It is defined as the entropy of $p$ plus the [Kullback-Leibler Divergence](#kullback-leibler-divergence) between $p$ and $q$ @@ -62,6 +63,23 @@ Usually $\hat{y}$ comes from using a [softmax](./../3-Activation-Functions/INDEX.md#softmax). Moreover, since it uses a logaritm and probability values are at most 1, the closer to 0, the higher the loss +## Computing PCA[^wiki-pca] + +> [!CAUTION] +> $X$ here is the matrix of dataset with **features over rows** + +- $\Sigma = \frac{X \times X^T}{N} \coloneqq$ Correlation Matrix approximation +- $\vec{\lambda} \coloneqq$ vector of eigenvalues of $\Sigma$ +- $\Lambda \coloneqq$ eigenvector columnar matrix sorted by eigenvalues +- $\Lambda_{red} \coloneqq$ eigenvector matrix reduced to $k^{th}$ + highest eigenvalue +- $Z = X \times\Lambda_{red}^T \coloneqq$ Compressed representation + +> [!NOTE] +> You may have studied PCA in terms of SVD, Singular Value Decomposition. The 2 +> are closely related and apply the same concept but applying different +> mathematical formulas. + ## Laplace Operator[^khan-1] It is defined as $\nabla \cdot \nabla f \in \R$ and is equivalent to the @@ -80,8 +98,32 @@ It can also be used to compute the net flow of particles in that region of space > This is not a **discrete laplace operator**, which is instead a **matrix** here, > as there are many other formulations. +## [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix) + +A Hessian Matrix represents the 2nd derivative of a function, thus it gives +us the curvature of a function. + +It is also used to tell us whether the point is a local minimum (it is positive +defined), local maximum (it is negative defined) or saddle (neither positive or +negative defined). + +It is computed by computing the partial derivatives of the gradient along +all dimensions and then transpose it. + +$$ +\nabla f = \begin{bmatrix} + \frac{d \, f}{d\,x} & \frac{d \, f}{d\,y} +\end{bmatrix} \\ +H(f) = \begin{bmatrix} + \frac{d \, f}{d\,x^2} & \frac{d \, f}{d \, x\,d\,y} \\ + \frac{d \, f}{d\, y \, d\,x} & \frac{d \, f}{d\,y^2} +\end{bmatrix} +$$ + [^khan-1]: [Khan Academy | Laplace Intuition | 9th November 2025](https://www.youtube.com/watch?v=EW08rD-GFh0) [^wiki-cross-entropy]: [Wikipedia | Cross Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Cross-entropy) [^wiki-entropy]: [Wikipedia | Entropy | 17th November 2025](https://en.wikipedia.org/wiki/Entropy_(information_theory)) + +[^wiki-pca]: [Wikipedia | Principal Component Analysis | 18th November 2025](https://en.wikipedia.org/wiki/Principal_component_analysis#Computation_using_the_covariance_method) diff --git a/Chapters/15-Appendix-A/python-experiments/pca.ipynb b/Chapters/15-Appendix-A/python-experiments/pca.ipynb new file mode 100644 index 0000000..eac9af0 --- /dev/null +++ b/Chapters/15-Appendix-A/python-experiments/pca.ipynb @@ -0,0 +1,199 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8c14ea22", + "metadata": {}, + "source": [ + "# Computing PCA\n", + "\n", + "Here I'll be taking data from [Geeks4Geeks](https://www.geeksforgeeks.org/machine-learning/mathematical-approach-to-pca/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b32eb5c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[1.8 1.87777778]\n", + "[[ 0.7 0.52222222]\n", + " [-1.3 -1.17777778]\n", + " [ 0.4 1.02222222]\n", + " [ 1.3 1.12222222]\n", + " [ 0.5 0.82222222]\n", + " [ 0.2 -0.27777778]\n", + " [-0.8 -0.77777778]\n", + " [-0.3 -0.27777778]\n", + " [-0.7 -0.97777778]]\n", + "[[0.6925 0.68875 ]\n", + " [0.68875 0.79444444]]\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "\n", + "X : np.ndarray = np.array([\n", + " [2.5, 2.4],\n", + " [0.5, 0.7],\n", + " [2.2, 2.9],\n", + " [3.1, 3.0],\n", + " [2.3, 2.7],\n", + " [2.0, 1.6],\n", + " [1.0, 1.1],\n", + " [1.5, 1.6],\n", + " [1.1, 0.9]\n", + "])\n", + "\n", + "# Compute mean values for features\n", + "mu_X = np.mean(X, 0)\n", + "\n", + "print(mu_X)\n", + "# \"Normalize\" Features\n", + "X = X - mu_X\n", + "print(X)\n", + "\n", + "# Compute covariance matrix applying\n", + "# Bessel's correction (n-1) instead of n\n", + "Cov = (X.T @ X) / (X.shape[0] - 1)\n", + "\n", + "print(Cov)" + ] + }, + { + "cell_type": "markdown", + "id": "78e9429f", + "metadata": {}, + "source": [ + "As you can notice, we did $X^T \\times X$ instead of $X \\times X^T$. This is because our \n", + "dataset had datapoints over rows instead of features." + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "f93b7a92", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0.05283865 1.43410579]\n", + "[[-0.73273632 -0.68051267]\n", + " [ 0.68051267 -0.73273632]]\n" + ] + } + ], + "source": [ + "# Computing eigenvalues\n", + "eigen = np.linalg.eig(Cov)\n", + "eigen_values = eigen.eigenvalues\n", + "eigen_vectors = eigen.eigenvectors\n", + "\n", + "print(eigen_values)\n", + "print(eigen_vectors)" + ] + }, + { + "cell_type": "markdown", + "id": "bfbdd9c3", + "metadata": {}, + "source": [ + "Now we'll generate the new X matrix by only using the first eigen vector" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "7ce6c540", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(9, 1)\n", + "Compressed\n", + "[[-0.85901005]\n", + " [ 1.74766702]\n", + " [-1.02122441]\n", + " [-1.70695945]\n", + " [-0.94272842]\n", + " [ 0.06743533]\n", + " [ 1.11431616]\n", + " [ 0.40769167]\n", + " [ 1.19281215]]\n", + "Reconstruction\n", + "[[ 0.58456722 0.62942786]\n", + " [-1.18930955 -1.28057909]\n", + " [ 0.69495615 0.74828821]\n", + " [ 1.16160753 1.25075117]\n", + " [ 0.64153863 0.69077135]\n", + " [-0.0458906 -0.04941232]\n", + " [-0.75830626 -0.81649992]\n", + " [-0.27743934 -0.29873049]\n", + " [-0.81172378 -0.87401678]]\n", + "Difference\n", + "[[0.11543278 0.10720564]\n", + " [0.11069045 0.10280131]\n", + " [0.29495615 0.27393401]\n", + " [0.13839247 0.12852895]\n", + " [0.14153863 0.13145088]\n", + " [0.2458906 0.22836546]\n", + " [0.04169374 0.03872214]\n", + " [0.02256066 0.02095271]\n", + " [0.11172378 0.10376099]]\n" + ] + } + ], + "source": [ + "# Computing X coming from only 1st eigen vector\n", + "Z_pca = X @ eigen_vectors[:,1]\n", + "Z_pca = Z_pca.reshape([Z_pca.shape[0], 1])\n", + "\n", + "print(Z_pca.shape)\n", + "\n", + "\n", + "# X reconstructed\n", + "eigen_v = (eigen_vectors[:, 1].reshape([eigen_vectors[:, 1].shape[0], 1]))\n", + "X_rec = Z_pca @ eigen_v.T\n", + "\n", + "print(\"Compressed\")\n", + "print(Z_pca)\n", + "\n", + "print(\"Reconstruction\")\n", + "print(X_rec)\n", + "\n", + "print(\"Difference\")\n", + "print(abs(X - X_rec))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "deep_learning", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Chapters/5-Optimization/INDEX-OLD.md b/Chapters/5-Optimization/INDEX-OLD.md new file mode 100644 index 0000000..9a2fa04 --- /dev/null +++ b/Chapters/5-Optimization/INDEX-OLD.md @@ -0,0 +1,501 @@ +# Optimization + +We basically try to see the error and minimize it by moving towards the ***gradient*** + +## Types of Learning Algorithms + +In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`. +Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others. + +So, often we train the `model` on a subset of samples. + +### Online Learning + +This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`. + +On each `point` we get the ***gradient*** and then we update `weights`. + +### Mini-Batch + +In this approach, we divide our `dataset` in small batches called `mini-batches`. +These need to be ***balanced*** in order not to have ***imbalances***. + +This technique is the ***most used one*** + +## Tips and Tricks + +### Learning Rate + +This is the `hyperparameter` we use to tune our +***learning steps***. + +Sometimes we have it too big and this causes +***overshootings***. So a quick solution may be to turn +it down. + +However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter` + +### Weight initialization + +We need to avoid `neurons` to have the same +***gradient***. This is easily achievable by using +***small random values***. + +However, if we have a ***large `fan-in`***, then it's +***easy to overshoot***, then it's better to initialize +those `weights` ***proportionally to*** +$\sqrt{\text{fan-in}}$: + +$$ +w = \frac{ + np.random(N) +}{ + \sqrt{N} +} +$$ + +#### Xavier-Glorot Initialization + + + +Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a +`uniform distribution` with a `std-dev` + +$$ +\sigma^2 = \text{gain} \cdot \sqrt{ + \frac{ + 2 + }{ + \text{fan-in} + \text{fan-out} + } +} +$$ + +and bounded between $a$ and $-a$ + +$$ +a = \text{gain} \cdot \sqrt{ + \frac{ + 6 + }{ + \text{fan-in} + \text{fan-out} + } +} +$$ + +Alternatively, one can use a `normal-distribution` +$\mathcal{N}(0, \sigma^2)$. + +Note that `gain` is in the **original paper** is equal +to $1$ + +### Decorrelating input components + +Since ***highly correlated features*** don't offer much +in terms of ***new information***, probably we need +to go in the ***latent space*** to find the +`latent-variables` governing those `features`. + +#### PCA + +> [!CAUTION] +> This topic won't be explained here as it's something +> usually learnt for `Machine Learning`, a +> ***prerequisite*** for approaching `Deep Learning`. + +This is a method we can use to discard `features` that +will ***add little to no information*** + +## Common problems in MultiLayer Networks + +### Hitting a Plateau + +This happenes wehn we have a ***big `learning-rate`*** +which makes `weights` go high in ***absolute value***. + +Because this happens ***too quickly***, we could +see a ***quick diminishing error*** and this is usually +***mistaken for a minimum point***, while instead +it's a ***plateau***. + +## Speeding up Mini-Batch Learning + +### Momentum[^momentum] + +We use this method ***mainly when we use `SGD`*** as +a ***learning techniques*** + +This method is better explained if we imagine +our error surface as an actual surface and we place a +ball over it. + +***The ball will start rolling towards the steepest +descent*** (initially), but ***after gaining enough +velocity*** it will follow the ***previous direction +, in some measure***. + +So, now the ***gradient*** does modify the ***velocity*** +rather than the ***position***, so the momentum will +***dampen small variations***. + +Moreover, once the ***momentum builds up***, we will +easily ***pass over plateaus*** as the +***ball will continue to roll over*** until it is +stopped by a negative ***gradient*** + +#### Momentum Equations + +There are a couple of them, mainly. + +One of them uses a term to evaluate the `momentum`, $p$, +called `SGD momentum` or `momentum term` or +`momentum parameter`: + +$$ +\begin{aligned} + p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\ + w_{k+1} &= w_{k} - \gamma p_{k+1} +\end{aligned} +$$ + +The other one is ***logically equivalent*** to the +previous, but it update the `weights` in ***one step*** +and is called `Stochastic Heavy Ball Method`: + +$$ + w_{k+1} = w_k - \gamma \nabla L(X, y, w_k) + + \beta ( w_k - w_{k-1}) +$$ + +> [!NOTE] +> This is how to choose $\beta$: +> +> $0 < \beta < 1$ +> +> If $\beta = 0$, then we are doing +> ***gradient descent***, if $\beta > 1$ then we +> ***will have numerical instabilities***. +> +> The ***larger*** $\beta$ the +> ***higher the `momentum`***, so it will +> ***turn slower*** + +> [!TIP] +> usual values are $\beta = 0.9$ or $\beta = 0.99$ +> and usually we start from 0.5 initially, to raise it +> whenever we are stuck. +> +> When we increase $\beta$, then the `learning rate` +> ***must decrease accordingly*** +> (e.g. from 0.9 to 0.99, `learning-rate` must be +> divided by a factor of 10) + +#### Nesterov (1983) Sutskever (2012) Accelerated Momentum + +Differently from the previous +[momentum](#momentum-equations), +we take an ***intermediate*** step where we +***update the `weights`*** according to the +***previous `momentum`*** and then we compute the +***new `momentum`*** in this new position, and then +we ***update again*** + +$$ +\begin{aligned} + \hat{w}_k & = w_k - \beta p_k \\ + p_{k+1} &= \beta p_{k} + + \eta \nabla L(X, y, \hat{w}_k) \\ + w_{k+1} &= w_{k} - \gamma p_{k+1} +\end{aligned} +$$ + +#### Why Momentum Works + +While it has been ***hypothesized*** that +***acceleration*** made ***convergence faster***, this +is +***only true for convex problems without much noise***, +though this may be ***part of the story*** + +The other half may be ***Noise Smoothing*** by +smoothing the optimization process, however +according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason. + +### Separate Adaptive Learning Rates + +Since `weights` may ***greatly vary*** across `layers`, +having a ***single `learning-rate` might not be ideal. + +So the idea is to set a `local learning-rate` to +control the `global` one as a ***multiplicative factor*** + +#### Local Learning rates + +- Start with $1$ as the ***starting point*** for + `local learning-rates` which we'll call `gain` from + now on. +- If the `gradient` has the ***same sign, increase it*** +- Otherwise, ***multiplicatively decrease it*** + +$$ + w_{i,j} = - g_{i,j} \cdot \eta \frac{ + d \, Out + }{ + d \, w_{i,j} + } + + \\ + g_{i,j}(t) = \begin{cases} + + g_{i,j}(t - 1) + \delta + & \left( \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t) + \cdot + \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t-1) \right) > 0 \\ + + + g_{i,j}(t - 1) \cdot (1 - \delta) + & \left( \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t) + \cdot + \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t-1) \right) \leq 0 +\end{cases} +$$ + +With this method, if there are oscillations, we will have +`gains` around $1$ + +> [!TIP] +> +> - Usually a value for $d$ is $0.05$ +> - Limit `gains` around some values: +> +> - $[0.1, 10]$ +> - $[0.01, 100]$ +> +> - Use `full-batches` or `big mini-batches` so that +> the ***gradient*** doesn't oscillate because of +> sampling errors +> - Combine it with [Momentum](#momentum) +> - Remember that ***Adaptive `learning-rate`*** deals +> with ***axis-alignment*** + +### rmsprop | Root Mean Square Propagation + +#### rprop | Resilient Propagation[^rprop-torch] + +This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates), +but in this case we don't use the +[AIMD](#local-learning-rates) technique and +***we don't take into account*** the +***magnitude of the gradient*** but ***only the sign*** + +- If ***gradient*** has same sign: + - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$ +- else: + - $step_{k} = step_{k} \cdot \eta_-$ + where $0 <\eta_- < 1$ + +> [!TIP] +> +> Limit the step size in a range where: +> +> - $\inf < 50$ +> - $\sup > 1 \text{M}$ + +> [!CAUTION] +> +> rprop does ***not work*** with `mini-batches` as +> the ***sign of the gradient changes frequently*** + +#### rmsprop in detail[^rmsprop-torch] + +The idea is that [rprop](#rprop--resilient-propagation) +is ***equivalent to using the gradient divided by its +value*** (as you either multiply for $1$ or $-1$), +however it means that between `mini-batches` the +***divisor*** changes each time, oscillating. + +The solution is to have a ***running average*** of +the ***magnitude of the squared gradient for +each `weight`***: + +$$ + MeanSquare(w, t) = + \alpha MeanSquare(w, t-1) + + (1 - \alpha) + \left( + \frac{d\, Out}{d\, w}^2 + \right) +$$ + +We then divide the ***gradient by the `square root`*** +of that value + +#### Further Developments + +- `rmsprop` with `momentum` does not work as it should +- `rmsprop` with `Nesterov momentum` works best + if usedto divide the ***correction*** rather than + the ***jump*** +- `rmsprop` with `adaptive learnings` needs more + investigation + +### Fancy Methods + +#### Adaptive Gradient + + + +##### Convex Case + +- Conjugate Gradient/Acceleration +- L-BFGS +- Quasi-Newton Methods + +##### Non-Convex Case + +Pay attention, here the `Hessian` may not be +`Positive Semi Defined`, thus when the ***gradient*** is +$0$ we don't necessarily know where we are. + +- Natural Gradient Methods +- Curvature Adaptive + - [Adagrad](./Fancy-Methods/ADAGRAD.md) + - [AdaDelta](./Fancy-Methods/ADADELTA.md) + - [RMSprop](#rmsprop-in-detail) + - [ADAM](./Fancy-Methods/ADAM.md) + - l-BFGS + - [heavy ball gradient](#momentum) + - [momemtum](#momentum) +- Noise Injection: + - Simulated Annealing + - Langevin Method + +#### Adagrad + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAGRAD.md) + +#### Adadelta + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADADELTA.md) + +#### ADAM + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAM.md) + +#### AdamW + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAM-W.md) + +#### LION + +> [!NOTE] +> [Here in detail](./Fancy-Methods/LION.md) + +### Hessian Free[^anelli-hessian-free] + +How much can we `learn` from a given +`Loss` space? + +The ***best way to move*** would be along the +***gradient***, assuming it has +the ***same curvature*** +(e.g. It's and has a local minimum). + +But ***usually this is not the case***, so we need +to move ***where the ratio of gradient and curvature is +high*** + +#### Newton's Method + +This method takes into account the ***curvature*** +of the `Loss` + +With this method, the update would be: + +$$ +\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{ + d \, E +}{ + d \, \vec{w} +} +$$ + +***If this could be feasible we'll go on the minimum in +one step***, but it's not, as the +***computations*** +needed to get a `Hessian` ***increase exponentially***. + +The thing is that whenever we ***update `weights`*** with +the `Steepest Descent` method, each update *messes up* +another, while the ***curvature*** can help to ***scale +these updates*** so that they do not disturb each other. + +#### Curvature Approximations + +However, since the `Hessian` is +***too expensive to compute***, we can approximate it. + +- We can take only the ***diagonal elements*** +- ***Other algorithms*** (e.g. Hessian Free) +- ***Conjugate Gradient*** to minimize the + ***approximation error*** + +#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient] + +> [!CAUTION] +> +> This is an oversemplification of the topic, so reading +> the footnotes material is greatly advised. + +The basic idea is that, in order not to mess up previous +directions, we ***`optimize` along perpendicular directions***. + +This method is ***guaranteed to mathematically succeed +after N steps, the dimension of the space***, in practice +the error will be minimal. + +This ***method works well for `non-quadratic errors`*** +and the `Hessian Free` `optimizer` uses this method +on ***genuinely quadratic surfaces***, which are +***quadratic approximations of the real surface*** + + + + + + +[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/) + +[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 + +[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 + +[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html) + +[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html) + +[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81 + +[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method) + +[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76 diff --git a/Chapters/5-Optimization/INDEX.md b/Chapters/5-Optimization/INDEX.md index 9a2fa04..97b8aa1 100644 --- a/Chapters/5-Optimization/INDEX.md +++ b/Chapters/5-Optimization/INDEX.md @@ -1,501 +1,556 @@ # Optimization -We basically try to see the error and minimize it by moving towards the ***gradient*** +## Beyond Full Batches -## Types of Learning Algorithms +Even though full batches give the best picture of a probability dristribution +of data points, it's computationally expensive. -In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`. -Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others. +Since data is usually **highly redundant**, we can think of getting smaller +sets that are classes balanced, **mini-batches**, to update weights. +While this doesn't give the same results as full batches, is still reliable. -So, often we train the `model` on a subset of samples. +When we need to bring things to the extreme, we can even update over a single +data point, **online learning**, however they are not as efficient as +mini-batches as **they do not use matrix multiplications, which are GPU efficient** -### Online Learning +## Learning rate Scheduling -This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`. +## Xavier-Glorot Weight initialization -On each `point` we get the ***gradient*** and then we update `weights`. +> [!WARNING] +> Before Xavier-Glorot there was another initialization technique proportional +> to fan-in: +> +> $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$ +> +> Though, Xavier-Glorot is not the only available initialization as there are +> many others[^torch-init] -### Mini-Batch +Whenever we initialize weights, we need to be careful to **break simmetry**, as +**identical hiddden nodes gets the exact same results**, making us +lose representation power. -In this approach, we divide our `dataset` in small batches called `mini-batches`. -These need to be ***balanced*** in order not to have ***imbalances***. +Another problem with weight initialization is the **overshooting**. This is +caused by **many small changes over weights**. The idea to solve this is by +**initializing weights proprotionally to fan-in (input) and fan-out (output)** -This technique is the ***most used one*** - -## Tips and Tricks - -### Learning Rate - -This is the `hyperparameter` we use to tune our -***learning steps***. - -Sometimes we have it too big and this causes -***overshootings***. So a quick solution may be to turn -it down. - -However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter` - -### Weight initialization - -We need to avoid `neurons` to have the same -***gradient***. This is easily achievable by using -***small random values***. - -However, if we have a ***large `fan-in`***, then it's -***easy to overshoot***, then it's better to initialize -those `weights` ***proportionally to*** -$\sqrt{\text{fan-in}}$: - -$$ -w = \frac{ - np.random(N) -}{ - \sqrt{N} -} -$$ - -#### Xavier-Glorot Initialization - - - -Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a -`uniform distribution` with a `std-dev` - -$$ -\sigma^2 = \text{gain} \cdot \sqrt{ - \frac{ - 2 - }{ - \text{fan-in} + \text{fan-out} - } -} -$$ - -and bounded between $a$ and $-a$ - -$$ -a = \text{gain} \cdot \sqrt{ - \frac{ - 6 - }{ - \text{fan-in} + \text{fan-out} - } -} -$$ - -Alternatively, one can use a `normal-distribution` -$\mathcal{N}(0, \sigma^2)$. - -Note that `gain` is in the **original paper** is equal -to $1$ - -### Decorrelating input components - -Since ***highly correlated features*** don't offer much -in terms of ***new information***, probably we need -to go in the ***latent space*** to find the -`latent-variables` governing those `features`. - -#### PCA - -> [!CAUTION] -> This topic won't be explained here as it's something -> usually learnt for `Machine Learning`, a -> ***prerequisite*** for approaching `Deep Learning`. - -This is a method we can use to discard `features` that -will ***add little to no information*** - -## Common problems in MultiLayer Networks - -### Hitting a Plateau - -This happenes wehn we have a ***big `learning-rate`*** -which makes `weights` go high in ***absolute value***. - -Because this happens ***too quickly***, we could -see a ***quick diminishing error*** and this is usually -***mistaken for a minimum point***, while instead -it's a ***plateau***. - -## Speeding up Mini-Batch Learning - -### Momentum[^momentum] - -We use this method ***mainly when we use `SGD`*** as -a ***learning techniques*** - -This method is better explained if we imagine -our error surface as an actual surface and we place a -ball over it. - -***The ball will start rolling towards the steepest -descent*** (initially), but ***after gaining enough -velocity*** it will follow the ***previous direction -, in some measure***. - -So, now the ***gradient*** does modify the ***velocity*** -rather than the ***position***, so the momentum will -***dampen small variations***. - -Moreover, once the ***momentum builds up***, we will -easily ***pass over plateaus*** as the -***ball will continue to roll over*** until it is -stopped by a negative ***gradient*** - -#### Momentum Equations - -There are a couple of them, mainly. - -One of them uses a term to evaluate the `momentum`, $p$, -called `SGD momentum` or `momentum term` or -`momentum parameter`: +A technique we use to initialize weights comes from Xavier and Glorot, called +Xavier-Glorot initialization: $$ \begin{aligned} - p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\ - w_{k+1} &= w_{k} - \gamma p_{k+1} + &W \propto \frac{rand(in, out)}{in + out} \\ + &rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\ + &\,\,\,\,\text{or} \\ + &rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot + \sqrt{\frac{2}{in + out}} \end{aligned} $$ -The other one is ***logically equivalent*** to the -previous, but it update the `weights` in ***one step*** -and is called `Stochastic Heavy Ball Method`: +In other words, xavier glorot extracts weights from either a uniform distribution, +or a normal one, scaled by a factor $g$ called gain -$$ - w_{k+1} = w_k - \gamma \nabla L(X, y, w_k) - + \beta ( w_k - w_{k-1}) -$$ +[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html) -> [!NOTE] -> This is how to choose $\beta$: -> -> $0 < \beta < 1$ -> -> If $\beta = 0$, then we are doing -> ***gradient descent***, if $\beta > 1$ then we -> ***will have numerical instabilities***. -> -> The ***larger*** $\beta$ the -> ***higher the `momentum`***, so it will -> ***turn slower*** +## Momentum > [!TIP] -> usual values are $\beta = 0.9$ or $\beta = 0.99$ -> and usually we start from 0.5 initially, to raise it -> whenever we are stuck. -> -> When we increase $\beta$, then the `learning rate` -> ***must decrease accordingly*** -> (e.g. from 0.9 to 0.99, `learning-rate` must be -> divided by a factor of 10) +> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by +> a factor of 10 -#### Nesterov (1983) Sutskever (2012) Accelerated Momentum +It's a technique inspired by physics. Imagine a ball rolling over a plane. Once +it has enough speed, even if the plane changes inclination, the ball has +still energy to move along the previous way because of its momentum. -Differently from the previous -[momentum](#momentum-equations), -we take an ***intermediate*** step where we -***update the `weights`*** according to the -***previous `momentum`*** and then we compute the -***new `momentum`*** in this new position, and then -we ***update again*** +Whenever on a gradient descent we have oscillations, **momentum dampens** all +movements steering us from the previous direction. Here momentum at time $k$ +is $p_k$ $$ \begin{aligned} - \hat{w}_k & = w_k - \beta p_k \\ - p_{k+1} &= \beta p_{k} + - \eta \nabla L(X, y, \hat{w}_k) \\ - w_{k+1} &= w_{k} - \gamma p_{k+1} + p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\ + W_{k+1} &= W_{k} - \gamma p_{k+1} \\ + \beta &\in [0, 1] \end{aligned} $$ -#### Why Momentum Works - -While it has been ***hypothesized*** that -***acceleration*** made ***convergence faster***, this -is -***only true for convex problems without much noise***, -though this may be ***part of the story*** - -The other half may be ***Noise Smoothing*** by -smoothing the optimization process, however -according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason. - -### Separate Adaptive Learning Rates - -Since `weights` may ***greatly vary*** across `layers`, -having a ***single `learning-rate` might not be ideal. - -So the idea is to set a `local learning-rate` to -control the `global` one as a ***multiplicative factor*** - -#### Local Learning rates - -- Start with $1$ as the ***starting point*** for - `local learning-rates` which we'll call `gain` from - now on. -- If the `gradient` has the ***same sign, increase it*** -- Otherwise, ***multiplicatively decrease it*** +Or, in a more compact way, logically equivalent to the previous one: $$ - w_{i,j} = - g_{i,j} \cdot \eta \frac{ - d \, Out - }{ - d \, w_{i,j} - } + W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1}) +$$ - \\ - g_{i,j}(t) = \begin{cases} +The larger $\beta$ the slower it curves, accumulating more of previous directions. +To play it safe, use smaller values once you are at the beginning where updates +are large and slowly turn it up to values near 1 - g_{i,j}(t - 1) + \delta - & \left( \frac{ - d \, Out - }{ - d \, w_{i,j} - } (t) - \cdot - \frac{ - d \, Out - }{ - d \, w_{i,j} - } (t-1) \right) > 0 \\ +> [!NOTE] +> +> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign +> rate +> - $\gamma$: Learning rate +> - $\beta$: hyperparameter of dampening factor +> - $\nabla L(X, Y, W_{k})$: gradient of the loss +> +## Nesterov Acceleated Gradient (aka NAG) - g_{i,j}(t - 1) \cdot (1 - \delta) - & \left( \frac{ - d \, Out - }{ - d \, w_{i,j} - } (t) - \cdot - \frac{ - d \, Out - }{ - d \, w_{i,j} - } (t-1) \right) \leq 0 +This method takes inpiration from Nesterov's optimization for convex functions and +applies it to momentum. Its quirk is that it never computes the gradient where it +lands on, but on a temporary computation of them before the actual update. + +|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] | +|--|--| +| ![momentum descent](./pngs/vanilla-momentum.gif) | ![nesterov momentum descent](./pngs/nesterov.gif) | + +To illustrate better its quirk, here's the formulation: + +$$ +\begin{aligned} + \hat{W}_{k} &= W_{k} - \beta p_k \\ + p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\ + W_{k+1} &= W_{k} - \gamma p_{k+1} +\end{aligned} +$$ + +As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$ +which will be our actual weights. The idea is to follow the previous momentum +blindly, see where it goes and then make the correction. + +[^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12) + +## Justifying Faster Optimization for Momentum Based Methods + +While many people justify the speed of momentum based methods for its acceleration, +this doesn't hold true as it's only accelerated for convex functions. + +Since we have no idea, most of the times, how our gradient function looks like, +we can't make assumptions about it being convex. + +So, the most compelling explanation lies in the fact that a momentum based +optimization is like computing a running average of the loss gradient, smoothing +the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD + +## Separate Adaptive Learning Rate + +The idea is that each weight of each layer may need its own learnig rate to avoid +overshooting and smooth the magnitude of received gradients, high over last layers +and low over first ones (architecture wise) + +The trick is to have a global learning rate that is adjusted by a local gain that +is increased each time the weight keeps the same sign and viceversa: + +$$ +\Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\ + +g_{i,j}(n +1 ) = \begin{cases} + g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\ + g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0 \end{cases} $$ -With this method, if there are oscillations, we will have -`gains` around $1$ +This method ensures that if the weight oscillates, the gain will dampen it. +Moreover, should it be totally random, it will hover near 1, keeping gradient +updates unchanged. + +> [!NOTE] +> The way $g$ is updated is similar to AIMD in TCP Congestion + + > [!TIP] > -> - Usually a value for $d$ is $0.05$ -> - Limit `gains` around some values: +> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$ +> - **Use full batch or big mini-batches** - This ensures that the change in sign +> is not due to sampling errors +> - **Combine this with momentum** +> - **Use this to deal with axis-alignment problems** > -> - $[0.1, 10]$ -> - $[0.01, 100]$ -> -> - Use `full-batches` or `big mini-batches` so that -> the ***gradient*** doesn't oscillate because of -> sampling errors -> - Combine it with [Momentum](#momentum) -> - Remember that ***Adaptive `learning-rate`*** deals -> with ***axis-alignment*** -### rmsprop | Root Mean Square Propagation +## Resilient Backpropagation (aka RProp) -#### rprop | Resilient Propagation[^rprop-torch] - -This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates), -but in this case we don't use the -[AIMD](#local-learning-rates) technique and -***we don't take into account*** the -***magnitude of the gradient*** but ***only the sign*** - -- If ***gradient*** has same sign: - - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$ -- else: - - $step_{k} = step_{k} \cdot \eta_-$ - where $0 <\eta_- < 1$ - -> [!TIP] -> -> Limit the step size in a range where: -> -> - $\inf < 50$ -> - $\sup > 1 \text{M}$ - -> [!CAUTION] -> -> rprop does ***not work*** with `mini-batches` as -> the ***sign of the gradient changes frequently*** - -#### rmsprop in detail[^rmsprop-torch] - -The idea is that [rprop](#rprop--resilient-propagation) -is ***equivalent to using the gradient divided by its -value*** (as you either multiply for $1$ or $-1$), -however it means that between `mini-batches` the -***divisor*** changes each time, oscillating. - -The solution is to have a ***running average*** of -the ***magnitude of the squared gradient for -each `weight`***: +Instead of using the magnitude of the gradient, **RProp uses the sign to derive +updates** that is multiplied by a step value. Here's the formulation[^florian-1]: $$ - MeanSquare(w, t) = - \alpha MeanSquare(w, t-1) + - (1 - \alpha) - \left( - \frac{d\, Out}{d\, w}^2 - \right) +w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left( + \frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}} +\right) \\ +s_{i,j}^{(n)} = \begin{cases} + s_{i,j}^{(n - 1)} \cdot 1.2 & + \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right) + \cdot + \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\ + s_{i,j}^{(n - 1)} \cdot 0.5 & + \text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right) + \cdot + \text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0 +\end{cases} \\ +s_{i,j} \in [10^{-6}, 50] $$ -We then divide the ***gradient by the `square root`*** -of that value +It is noticeable that , like +[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase +or decreases the gain. However, since it uses multiplication to increase it, makes +it unusable for anything but full-batches, beacause of its fast growth. -#### Further Developments +[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/) -- `rmsprop` with `momentum` does not work as it should -- `rmsprop` with `Nesterov momentum` works best - if usedto divide the ***correction*** rather than - the ***jump*** -- `rmsprop` with `adaptive learnings` needs more - investigation +## Root Mean Square Propagation (aka RMSProp) -### Fancy Methods +As the name implies, it propagates the loss over, a bit like momentum. Since +[RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient, +it's almost like dividing the gradient by its magnitude, which is bad in case of +mini-batches, as all divisors are different. -#### Adaptive Gradient - - - -##### Convex Case - -- Conjugate Gradient/Acceleration -- L-BFGS -- Quasi-Newton Methods - -##### Non-Convex Case - -Pay attention, here the `Hessian` may not be -`Positive Semi Defined`, thus when the ***gradient*** is -$0$ we don't necessarily know where we are. - -- Natural Gradient Methods -- Curvature Adaptive - - [Adagrad](./Fancy-Methods/ADAGRAD.md) - - [AdaDelta](./Fancy-Methods/ADADELTA.md) - - [RMSprop](#rmsprop-in-detail) - - [ADAM](./Fancy-Methods/ADAM.md) - - l-BFGS - - [heavy ball gradient](#momentum) - - [momemtum](#momentum) -- Noise Injection: - - Simulated Annealing - - Langevin Method - -#### Adagrad - -> [!NOTE] -> [Here in detail](./Fancy-Methods/ADAGRAD.md) - -#### Adadelta - -> [!NOTE] -> [Here in detail](./Fancy-Methods/ADADELTA.md) - -#### ADAM - -> [!NOTE] -> [Here in detail](./Fancy-Methods/ADAM.md) - -#### AdamW - -> [!NOTE] -> [Here in detail](./Fancy-Methods/ADAM-W.md) - -#### LION - -> [!NOTE] -> [Here in detail](./Fancy-Methods/LION.md) - -### Hessian Free[^anelli-hessian-free] - -How much can we `learn` from a given -`Loss` space? - -The ***best way to move*** would be along the -***gradient***, assuming it has -the ***same curvature*** -(e.g. It's and has a local minimum). - -But ***usually this is not the case***, so we need -to move ***where the ratio of gradient and curvature is -high*** - -#### Newton's Method - -This method takes into account the ***curvature*** -of the `Loss` - -With this method, the update would be: +RMSProp solves this by keeping the gradient magnitude similar across mini-batches +by keeping a running average of it: $$ -\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{ - d \, E -}{ - d \, \vec{w} -} +L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left( + \frac{d \, Loss}{d\, W^{(k -1)}} + \right)^2 \\ +W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\ + \text{usually } \beta = 0.9 $$ -***If this could be feasible we'll go on the minimum in -one step***, but it's not, as the -***computations*** -needed to get a `Hessian` ***increase exponentially***. +What this method does is keeping a running average of the measn square error, +hence the name, and use it to normalize the gradient keeping it similar across +mini-batches. -The thing is that whenever we ***update `weights`*** with -the `Steepest Descent` method, each update *messes up* -another, while the ***curvature*** can help to ***scale -these updates*** so that they do not disturb each other. - -#### Curvature Approximations - -However, since the `Hessian` is -***too expensive to compute***, we can approximate it. - -- We can take only the ***diagonal elements*** -- ***Other algorithms*** (e.g. Hessian Free) -- ***Conjugate Gradient*** to minimize the - ***approximation error*** - -#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient] - -> [!CAUTION] +> [!NOTE] +> While it can be used with momentum, it doesn't seem to add as much benefits as +> using it standalone. +> +> With Nesterov, it works best if used to normalize the correction, rather than +> the jump. While for the adaptive learning rates, it still requires further +> investigations to prove the efficacy. > -> This is an oversemplification of the topic, so reading -> the footnotes material is greatly advised. -The basic idea is that, in order not to mess up previous -directions, we ***`optimize` along perpendicular directions***. +## Adaptive Gradient Methods -This method is ***guaranteed to mathematically succeed -after N steps, the dimension of the space***, in practice -the error will be minimal. + +### AdaGrad[^adagrad-torch] -This ***method works well for `non-quadratic errors`*** -and the `Hessian Free` `optimizer` uses this method -on ***genuinely quadratic surfaces***, which are -***quadratic approximations of the real surface*** +`AdaGrad` is an ***optimization method*** aimed +to: +***"find needles in the haystack in the form of +very predictive yet rarely observed features"*** +[^adagrad-official-paper] - +`AdaGrad`, opposed to a standard `SGD` that is the +***same for each gradient geometry***, tries to +***incorporate geometry from earlier iterations***. + +#### AdaGrad Algorithm + +Instead `AdaGrad` takes another +approach[^anelli-adagrad-2][^adagrad-official-paper]: + +$$ +\begin{aligned} + g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\ + G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\ + w_{i}^{(k + 1)} &= + w_{i}^{(k)} - \eta \cdot\frac{ + 1 + }{ + \sqrt{G_{i,i}^{(k +1)} + \epsilon} + } \cdot g_{i}^{(k+1)} \\ + +\end{aligned} +$$ + +Here $G^{(k)}$ is the ***sum of outer product*** of the +***gradient*** until time $t$, though ***usually it is +not used*** $G_t$, which is ***impractical because +of the high number of dimensions***, so we use +$diag(G_t)$ which can be +***computed in linear time***[^adagrad-official-paper] + +The $\epsilon$ term here is used to +***avoid dividing by 0***[^anelli-adagrad-2] and has a +small value, usually in the order of $10^{-8}$ + +> [!NOTE] +> +> This example is tough to understand if we where to apply it to a matrix $W$ +> instead of a vector. To make it easier to understand in matricial notation: +> +> $$ +> \begin{aligned} +> \nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\ +> G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\ +> W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}} + {\sqrt{G^{(k+1)} + \epsilon}} +> \end{aligned} +> $$ +> +> In other words, compute the gradient and scale it for the sum of its squares +> until that point + +#### AdaGrad effectiveness[^anelli-adagrad-3] + +- When we have ***many dimensions, many features are + irrelevant*** +- ***Rarer Features are more relevant*** +- It adapts $\eta$ to the right metric space + by projecting gradient stochastic updates with + [Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from + a probability distribution. + +#### AdaGrad Considerations + +- It eliminates the need of manually tuning the + `learning rates`, which is usually set to + $0.01$ +- The squared ***gradients*** are accumulated during + iterations, making the `learning-rate` become + ***smaller and smaller***, thus becoming 0 and untrainable -[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/) +[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf) -[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 +[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html) -[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 +[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.) -[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html) +[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42 -[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html) +[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43 -[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81 +[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44 -[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method) +### AdaDelta[^adadelta-offcial-paper] + +`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and +created to address some problems of it, like +***sensitivity to initial `parameters` and corresponding +gradient***[^adadelta-offcial-paper] + +To address all these problems, `ADADELTA` accumulates +***gradients over a `window` as a running average***, rather than ***accumulating +it over all instances***: + +$$ +G^{(k+1)} = \beta \cdot G^{(k)} + + (1 - \beta) \cdot \nabla L^{(k+1)} +$$ + +The update, which is very similar to the one in +[AdaGrad](./ADAGRAD.md#the-algorithm), becomes: + +$$ +\begin{aligned} + W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}} +\end{aligned} +$$ + +Technically speaking, the last equation is basically equivalent to the +[RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is +equivalent to the running average of the mean square. + +However, as the author pointed out[^adadelta-units], this equation does not +respect units of measures. We should correct this problem +by ***considering the curvature locally smooth*** and +taking an approximation of it at the next step, becoming: + +$$ +\begin{aligned} + \Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}} + \nabla L^{(k)}\\ + S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\ + W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)} +\end{aligned} +$$ + +As we can notice, the ***`learning rate` completely +disappears from the equation, eliminating the need to +set one*** + +> [!WARNING] +> Here $\Delta W$ is already negative, that's why there's a $+$ in the last +> equation + + + +[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701) + +[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701) + +### Adaptive Moment Estimation (aka AdaM) + +AdaM computes both the momentum and the squared gradients with running +averages, which are 0 filled at time $k = 0$: + +$$ +\begin{aligned} + M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\ + V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\ +\end{aligned} +$$ + +> [!WARNING] +> The squared gradient can be thought as the variance, however it's not centered + +Then it corrects them to be used in the final formulation: + +$$ +\begin{aligned} + \hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}} \\ + \hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\ +\end{aligned} +$$ + +> [!WARNING] +> $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep. + +Then it computes the update in this way: + +$$ +W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}} + {\sqrt{\hat{V}^{(k+1)}} + \epsilon} +$$ + +Even though Adam works, it doesn't generalize well and, particularly in image +problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers +instead of 1 as for SGD, which 2 of them need parameters tuning. + +> [!NOTE] +> Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and +> $\epsilon = 10^-8$ + +### AdamW + +AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty, +AdaM already implements it, however it is usually added to momentum, getting +scaled by $\sqrt{\hat{V}}$: + +$$ +W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}} + {\sqrt{\hat{V}^{(k+1)}} + \epsilon} +$$ + +AdamW authors saw that this was inefficient as it was influences by the uncentered +variance, thus modified the formula to this: + +$$ +W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} } + {\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)} +$$ + +### Lion (evoLved sIgn mOmeNtum)[^official-paper] + +`Lion` is the result of a ***genetic search algorithm*** aimed to +find the best `optimizer`. + +It starts from a population of `AdamW` algorithms to +***speed up the search***. Opposed to +`Adam` and `AdamW`, it keeps track +***only for the momentum*** and ***gradient sign***, +requiring ***less `memory`***. + +Since ***uniform updates yields larger norms***, +`Lion` requires a ***smaller `learning-rate`*** +and a ***larger decoupled `weight-decay`*** +$\lambda$[^official-paper-1]. + +The ***advantages of `Lion` over `Adam` and `AdamW` +increase with the size of +the `mini-batch`***[^official-paper-1] + +#### Symbolic Representation[^official-paper-2] + +New ***trained algorithms*** are represented +`simbolically`, bringing these advantages: + +- `Algorithms` must be ***implemented*** as `programs` +- It ***easier to analyze, comprehend and transfer to + new task*** these `algorithms`, rather than other + `algorithms` such as `NeuralNetworks` +- We can **estimate the *complexity*** by looking + at the ***length of code*** + +#### Tournament[^official-paper-3] + +The best code is found with a ***tournament style +evolution***. Each cycle it picks the ***best +`algorithm`*** which will be +***copied and mutated*** and the ***oldest is removed*** + + + +[^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675) + +[^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675) + +[^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675) + +[^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675) + +## Hessian Free Optimization + +Since we are moving on a function which gradient is not constant, by looking at +the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix), +we can see when it starts to change. + +### Newton's Method + +This method would technically give us the solution in one step on a quadratic +function, but it is unfeasible due to the memory and computational requirements: + +$$ +\Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W} +$$ + +### Conjugate Gradient + +The idea is to correct the weights so that we reduce the gradient to 0 across +perpendicular directions. This means that, for each update, we are not messing up +previous optimizations. + +While it is usually used for quadratic error surfaces, there's a non linear variant +(non-linear conjugate gradient) that usually works well. However it is also +possible to approximate the true error function with a quadratic one, using the +standard method. + +It gives a solution after $N$ steps over an $N$ dimensional quadratic surface, +however we need to penalize frequent changes in weights, especially for hidden +activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md) + +## Optimization Tricks + +### Input decorrelation + +If you have a linear neuron, think of a Feed Forward and not of a Convolution, +it's better to decorrelate input components. + +A way to achieve this is through a +[PCA](./../15-Appendix-A/INDEX.md#computing-pca), +transforming the error surface from an ellipse to a circle. + +### Recognize Plateaus + +If we start with big learning rates, since weights gain a big magnitude, the +derivative will be small and the error will not decrease significantly. + +This may seem a local minima, but this is usually a plateau. + +### Mini-Batch Speed up + +To speed up mini batch training use these methods: + +- [**Momentum**](#momentum) +- [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate) +- [**rmsprop**](#root-mean-square-propagation-aka-rmsprop) +- [**Adaptive Gradients Methods**](#adaptive-gradient-methods) + +### Mini-batches vs Full-Batches + +The rule of thumb is to use **full-batches for small datasets or small redundancy** +, while **mini-batches for redundant datasets** -[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76 diff --git a/Chapters/5-Optimization/pngs/nesterov.gif b/Chapters/5-Optimization/pngs/nesterov.gif new file mode 100644 index 0000000..c95233b Binary files /dev/null and b/Chapters/5-Optimization/pngs/nesterov.gif differ diff --git a/Chapters/5-Optimization/pngs/vanilla-momentum.gif b/Chapters/5-Optimization/pngs/vanilla-momentum.gif new file mode 100644 index 0000000..51d7d2b Binary files /dev/null and b/Chapters/5-Optimization/pngs/vanilla-momentum.gif differ