diff --git a/Chapters/5-Optimization/INDEX.md b/Chapters/5-Optimization/INDEX.md new file mode 100644 index 0000000..6df1fdc --- /dev/null +++ b/Chapters/5-Optimization/INDEX.md @@ -0,0 +1,427 @@ +# Optimization + +We basically try to see the error and minimize it by moving towards the ***gradient*** + +## Types of Learning Algorithms + +In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`. +Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others. + +So, often we train the `model` on a subset of samples. + +### Online Learning + +This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`. + +On each `point` we get the ***gradient*** and then we update `weights`. + +### Mini-Batch + +In this approach, we divide our `dataset` in small batches called `mini-batches`. +These need to be ***balanced*** in order not to have ***imbalances***. + +This technique is the ***most used one*** + +## Tips and Tricks + +### Learning Rate + +This is the `hyperparameter` we use to tune our +***learning steps***. + +Sometimes we have it too big and this causes +***overshootings***. So a quick solution may be to turn +it down. + +However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter` + +### Weight initialization + +We need to avoid `neurons` to have the same +***gradient***. This is easily achievable by using +***small random values***. + +However, if we have a ***large `fan-in`***, then it's +***easy to overshoot***, then it's better to initialize +those `weights` ***proportionally to*** +$\sqrt{\text{fan-in}}$: + +$$ +w = \frac{ + np.random(N) +}{ + \sqrt{N} +} +$$ + +#### Xavier-Glorot Initialization + + + +Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a +`uniform distribution` with a `std-dev` + +$$ +\sigma^2 = \text{gain} \cdot \sqrt{ + \frac{ + 2 + }{ + \text{fan-in} + \text{fan-out} + } +} +$$ + +and bounded between $a$ and $-a$ + +$$ +a = \text{gain} \cdot \sqrt{ + \frac{ + 6 + }{ + \text{fan-in} + \text{fan-out} + } +} +$$ + +Alternatively, one can use a `normal-distribution` +$\mathcal{N}(0, \sigma^2)$. + +Note that `gain` is in the **original paper** is equal +to $1$ + +### Decorrelating input components + +Since ***highly correlated features*** don't offer much +in terms of ***new information***, probably we need +to go in the ***latent space*** to find the +`latent-variables` governing those `features`. + +#### PCA + +> [!CAUTION] +> This topic won't be explained here as it's something +> usually learnt for `Machine Learning`, a +> ***prerequisite*** for approaching `Deep Learning`. + +This is a method we can use to discard `features` that +will ***add little to no information*** + +## Common problems in MultiLayer Networks + +### Hitting a Plateau + +This happenes wehn we have a ***big `learning-rate`*** +which makes `weights` go high in ***absolute value***. + +Because this happens ***too quickly***, we could +see a ***quick diminishing error*** and this is usually +***mistaken for a minimum point***, while instead +it's a ***plateau***. + +## Speeding up Mini-Batch Learning + +### Momentum[^momentum] + +We use this method ***mainly when we use `SGD`*** as +a ***learning techniques*** + +This method is better explained if we imagine +our error surface as an actual surface and we place a +ball over it. + +***The ball will start rolling towards the steepest +descent*** (initially), but ***after gaining enough +velocity*** it will follow the ***previous direction +, in some measure***. + +So, now the ***gradient*** does modify the ***velocity*** +rather than the ***position***, so the momentum will +***dampen small variations***. + +Moreover, once the ***momentum builds up***, we will +easily ***pass over plateaus*** as the +***ball will continue to roll over*** until it is +stopped by a negative ***gradient*** + +#### Momentum Equations + +There are a couple of them, mainly. + +One of them uses a term to evaluate the `momentum`, $p$, +called `SGD momentum` or `momentum term` or +`momentum parameter`: + +$$ +\begin{aligned} + p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\ + w_{k+1} &= w_{k} - \gamma p_{k+1} +\end{aligned} +$$ + +The other one is ***logically equivalent*** to the +previous, but it update the `weights` in ***one step*** +and is called `Stochastic Heavy Ball Method`: + +$$ + w_{k+1} = w_k - \gamma \nabla L(X, y, w_k) + + \beta ( w_k - w_{k-1}) +$$ + +> [!NOTE] +> This is how to choose $\beta$: +> +> $0 < \beta < 1$ +> +> If $\beta = 0$, then we are doing +> ***gradient descent***, if $\beta > 1$ then we +> ***will have numerical instabilities***. +> +> The ***larger*** $\beta$ the +> ***higher the `momentum`***, so it will +> ***turn slower*** + +> [!TIP] +> usual values are $\beta = 0.9$ or $\beta = 0.99$ +> and usually we start from 0.5 initially, to raise it +> whenever we are stuck. +> +> When we increase $\beta$, then the `learning rate` +> ***must decrease accordingly*** +> (e.g. from 0.9 to 0.99, `learning-rate` must be +> divided by a factor of 10) + +#### Nesterov (1983) Sutskever (2012) Accelerated Momentum + +Differently from the previous +[momentum](#momentum-equations), +we take an ***intermediate*** step where we +***update the `weights`*** according to the +***previous `momentum`*** and then we compute the +***new `momentum`*** in this new position, and then +we ***update again*** + +$$ +\begin{aligned} + \hat{w}_k & = w_k - \beta p_k \\ + p_{k+1} &= \beta p_{k} + + \eta \nabla L(X, y, \hat{w}_k) \\ + w_{k+1} &= w_{k} - \gamma p_{k+1} +\end{aligned} +$$ + +#### Why Momentum Works + +While it has been ***hypothesized*** that +***acceleration*** made ***convergence faster***, this +is +***only true for convex problems without much noise***, +though this may be ***part of the story*** + +The other half may be ***Noise Smoothing*** by +smoothing the optimization process, however +according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason. + +### Separate Adaptive Learning Rates + +Since `weights` may ***greatly vary*** across `layers`, +having a ***single `learning-rate` might not be ideal. + +So the idea is to set a `local learning-rate` to +control the `global` one as a ***multiplicative factor*** + +#### Local Learning rates + +- Start with $1$ as the ***starting point*** for + `local learning-rates` which we'll call `gain` from + now on. +- If the `gradient` has the ***same sign, increase it*** +- Otherwise, ***multiplicatively decrease it*** + +$$ + w_{i,j} = - g_{i,j} \cdot \eta \frac{ + d \, Out + }{ + d \, w_{i,j} + } + + \\ + g_{i,j}(t) = \begin{cases} + + g_{i,j}(t - 1) + \delta + & \left( \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t) + \cdot + \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t-1) \right) > 0 \\ + + + g_{i,j}(t - 1) \cdot (1 - \delta) + & \left( \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t) + \cdot + \frac{ + d \, Out + }{ + d \, w_{i,j} + } (t-1) \right) \leq 0 +\end{cases} +$$ + +With this method, if there are oscillations, we will have +`gains` around $1$ + +> [!TIP] +> +> - Usually a value for $d$ is $0.05$ +> - Limit `gains` around some values: +> +> - $[0.1, 10]$ +> - $[0.01, 100]$ +> +> - Use `full-batches` or `big mini-batches` so that +> the ***gradient*** doesn't oscillate because of +> sampling errors +> - Combine it with [Momentum](#momentum) +> - Remember that ***Adaptive `learning-rate`*** deals +> with ***axis-alignment*** + +### rmsprop | Root Mean Square Propagation + +#### rprop | Resilient Propagation[^rprop-torch] + +This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates), +but in this case we don't use the +[AIMD](#local-learning-rates) technique and +***we don't take into account*** the +***magnitude of the gradient*** but ***only the sign*** + +- If ***gradient*** has same sign: + - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$ +- else: + - $step_{k} = step_{k} \cdot \eta_-$ + where $0 <\eta_- < 1$ + +> [!TIP] +> +> Limit the step size in a range where: +> +> - $\inf < 50$ +> - $\sup > 1 \text{M}$ + +> [!CAUTION] +> +> rprop does ***not work*** with `mini-batches` as +> the ***sign of the gradient changes frequently*** + +#### rmsprop in detail[^rmsprop-torch] + +The idea is that [rprop](#rprop--resilient-propagation) +is ***equivalent to using the gradient divided by its +value*** (as you either multiply for $1$ or $-1$), +however it means that between `mini-batches` the +***divisor*** changes each time, oscillating. + +The solution is to have a ***running average*** of +the ***magnitude of the squared gradient for +each `weight`***: + +$$ + MeanSquare(w, t) = + \alpha MeanSquare(w, t-1) + + (1 - \alpha) + \left( + \frac{d\, Out}{d\, w}^2 + \right) +$$ + +We then divide the ***gradient by the `square root`*** +of that value + +#### Further Developments + +- `rmsprop` with `momentum` does not work as it should +- `rmsprop` with `Nesterov momentum` works best + if usedto divide the ***correction*** rather than + the ***jump*** +- `rmsprop` with `adaptive learnings` needs more + investigation + +### Fancy Methods + +#### Adaptive Gradient + + + +##### Convex Case + +- Conjugate Gradient/Acceleration +- L-BFGS +- Quasi-Newton Methods + +##### Non-Convex Case + +Pay attention, here the `Hessian` may not be +`Positive Semi Defined`, thus when the ***gradient*** is +$0$ we don't necessarily know where we are. + +- Natural Gradient Methods +- Curvature Adaptive + - [Adagrad](./Fancy-Methods/ADAGRAD.md) + - [AdaDelta](./Fancy-Methods/ADADELTA.md) + - [RMSprop](#rmsprop-in-detail) + - [ADAM](./Fancy-Methods/ADAM.md) + - l-BFGS + - [heavy ball gradient](#momentum) + - [momemtum](#momentum) +- Noise Injection: + - Simulated Annealing + - Langevin Method + +#### Adagrad + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAGRAD.md) + +#### Adadelta + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADADELTA.md) + +#### ADAM + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAM.md) + +#### AdamW + +> [!NOTE] +> [Here in detail](./Fancy-Methods/ADAM-W.md) + +#### LION + +> [!NOTE] +> [Here in detail](./Fancy-Methods/LION.md) + +### Hessian Free + + + + +[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/) + +[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 + +[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 + +[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html) + +[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)