# Optimization We basically try to see the error and minimize it by moving towards the ***gradient*** ## Types of Learning Algorithms In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`. Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others. So, often we train the `model` on a subset of samples. ### Online Learning This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`. On each `point` we get the ***gradient*** and then we update `weights`. ### Mini-Batch In this approach, we divide our `dataset` in small batches called `mini-batches`. These need to be ***balanced*** in order not to have ***imbalances***. This technique is the ***most used one*** ## Tips and Tricks ### Learning Rate This is the `hyperparameter` we use to tune our ***learning steps***. Sometimes we have it too big and this causes ***overshootings***. So a quick solution may be to turn it down. However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter` ### Weight initialization We need to avoid `neurons` to have the same ***gradient***. This is easily achievable by using ***small random values***. However, if we have a ***large `fan-in`***, then it's ***easy to overshoot***, then it's better to initialize those `weights` ***proportionally to*** $\sqrt{\text{fan-in}}$: $$ w = \frac{ np.random(N) }{ \sqrt{N} } $$ #### Xavier-Glorot Initialization Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a `uniform distribution` with a `std-dev` $$ \sigma^2 = \text{gain} \cdot \sqrt{ \frac{ 2 }{ \text{fan-in} + \text{fan-out} } } $$ and bounded between $a$ and $-a$ $$ a = \text{gain} \cdot \sqrt{ \frac{ 6 }{ \text{fan-in} + \text{fan-out} } } $$ Alternatively, one can use a `normal-distribution` $\mathcal{N}(0, \sigma^2)$. Note that `gain` is in the **original paper** is equal to $1$ ### Decorrelating input components Since ***highly correlated features*** don't offer much in terms of ***new information***, probably we need to go in the ***latent space*** to find the `latent-variables` governing those `features`. #### PCA > [!CAUTION] > This topic won't be explained here as it's something > usually learnt for `Machine Learning`, a > ***prerequisite*** for approaching `Deep Learning`. This is a method we can use to discard `features` that will ***add little to no information*** ## Common problems in MultiLayer Networks ### Hitting a Plateau This happenes wehn we have a ***big `learning-rate`*** which makes `weights` go high in ***absolute value***. Because this happens ***too quickly***, we could see a ***quick diminishing error*** and this is usually ***mistaken for a minimum point***, while instead it's a ***plateau***. ## Speeding up Mini-Batch Learning ### Momentum[^momentum] We use this method ***mainly when we use `SGD`*** as a ***learning techniques*** This method is better explained if we imagine our error surface as an actual surface and we place a ball over it. ***The ball will start rolling towards the steepest descent*** (initially), but ***after gaining enough velocity*** it will follow the ***previous direction , in some measure***. So, now the ***gradient*** does modify the ***velocity*** rather than the ***position***, so the momentum will ***dampen small variations***. Moreover, once the ***momentum builds up***, we will easily ***pass over plateaus*** as the ***ball will continue to roll over*** until it is stopped by a negative ***gradient*** #### Momentum Equations There are a couple of them, mainly. One of them uses a term to evaluate the `momentum`, $p$, called `SGD momentum` or `momentum term` or `momentum parameter`: $$ \begin{aligned} p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\ w_{k+1} &= w_{k} - \gamma p_{k+1} \end{aligned} $$ The other one is ***logically equivalent*** to the previous, but it update the `weights` in ***one step*** and is called `Stochastic Heavy Ball Method`: $$ w_{k+1} = w_k - \gamma \nabla L(X, y, w_k) + \beta ( w_k - w_{k-1}) $$ > [!NOTE] > This is how to choose $\beta$: > > $0 < \beta < 1$ > > If $\beta = 0$, then we are doing > ***gradient descent***, if $\beta > 1$ then we > ***will have numerical instabilities***. > > The ***larger*** $\beta$ the > ***higher the `momentum`***, so it will > ***turn slower*** > [!TIP] > usual values are $\beta = 0.9$ or $\beta = 0.99$ > and usually we start from 0.5 initially, to raise it > whenever we are stuck. > > When we increase $\beta$, then the `learning rate` > ***must decrease accordingly*** > (e.g. from 0.9 to 0.99, `learning-rate` must be > divided by a factor of 10) #### Nesterov (1983) Sutskever (2012) Accelerated Momentum Differently from the previous [momentum](#momentum-equations), we take an ***intermediate*** step where we ***update the `weights`*** according to the ***previous `momentum`*** and then we compute the ***new `momentum`*** in this new position, and then we ***update again*** $$ \begin{aligned} \hat{w}_k & = w_k - \beta p_k \\ p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, \hat{w}_k) \\ w_{k+1} &= w_{k} - \gamma p_{k+1} \end{aligned} $$ #### Why Momentum Works While it has been ***hypothesized*** that ***acceleration*** made ***convergence faster***, this is ***only true for convex problems without much noise***, though this may be ***part of the story*** The other half may be ***Noise Smoothing*** by smoothing the optimization process, however according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason. ### Separate Adaptive Learning Rates Since `weights` may ***greatly vary*** across `layers`, having a ***single `learning-rate` might not be ideal. So the idea is to set a `local learning-rate` to control the `global` one as a ***multiplicative factor*** #### Local Learning rates - Start with $1$ as the ***starting point*** for `local learning-rates` which we'll call `gain` from now on. - If the `gradient` has the ***same sign, increase it*** - Otherwise, ***multiplicatively decrease it*** $$ w_{i,j} = - g_{i,j} \cdot \eta \frac{ d \, Out }{ d \, w_{i,j} } \\ g_{i,j}(t) = \begin{cases} g_{i,j}(t - 1) + \delta & \left( \frac{ d \, Out }{ d \, w_{i,j} } (t) \cdot \frac{ d \, Out }{ d \, w_{i,j} } (t-1) \right) > 0 \\ g_{i,j}(t - 1) \cdot (1 - \delta) & \left( \frac{ d \, Out }{ d \, w_{i,j} } (t) \cdot \frac{ d \, Out }{ d \, w_{i,j} } (t-1) \right) \leq 0 \end{cases} $$ With this method, if there are oscillations, we will have `gains` around $1$ > [!TIP] > > - Usually a value for $d$ is $0.05$ > - Limit `gains` around some values: > > - $[0.1, 10]$ > - $[0.01, 100]$ > > - Use `full-batches` or `big mini-batches` so that > the ***gradient*** doesn't oscillate because of > sampling errors > - Combine it with [Momentum](#momentum) > - Remember that ***Adaptive `learning-rate`*** deals > with ***axis-alignment*** ### rmsprop | Root Mean Square Propagation #### rprop | Resilient Propagation[^rprop-torch] This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates), but in this case we don't use the [AIMD](#local-learning-rates) technique and ***we don't take into account*** the ***magnitude of the gradient*** but ***only the sign*** - If ***gradient*** has same sign: - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$ - else: - $step_{k} = step_{k} \cdot \eta_-$ where $0 <\eta_- < 1$ > [!TIP] > > Limit the step size in a range where: > > - $\inf < 50$ > - $\sup > 1 \text{M}$ > [!CAUTION] > > rprop does ***not work*** with `mini-batches` as > the ***sign of the gradient changes frequently*** #### rmsprop in detail[^rmsprop-torch] The idea is that [rprop](#rprop--resilient-propagation) is ***equivalent to using the gradient divided by its value*** (as you either multiply for $1$ or $-1$), however it means that between `mini-batches` the ***divisor*** changes each time, oscillating. The solution is to have a ***running average*** of the ***magnitude of the squared gradient for each `weight`***: $$ MeanSquare(w, t) = \alpha MeanSquare(w, t-1) + (1 - \alpha) \left( \frac{d\, Out}{d\, w}^2 \right) $$ We then divide the ***gradient by the `square root`*** of that value #### Further Developments - `rmsprop` with `momentum` does not work as it should - `rmsprop` with `Nesterov momentum` works best if usedto divide the ***correction*** rather than the ***jump*** - `rmsprop` with `adaptive learnings` needs more investigation ### Fancy Methods #### Adaptive Gradient ##### Convex Case - Conjugate Gradient/Acceleration - L-BFGS - Quasi-Newton Methods ##### Non-Convex Case Pay attention, here the `Hessian` may not be `Positive Semi Defined`, thus when the ***gradient*** is $0$ we don't necessarily know where we are. - Natural Gradient Methods - Curvature Adaptive - [Adagrad](./Fancy-Methods/ADAGRAD.md) - [AdaDelta](./Fancy-Methods/ADADELTA.md) - [RMSprop](#rmsprop-in-detail) - [ADAM](./Fancy-Methods/ADAM.md) - l-BFGS - [heavy ball gradient](#momentum) - [momemtum](#momentum) - Noise Injection: - Simulated Annealing - Langevin Method #### Adagrad > [!NOTE] > [Here in detail](./Fancy-Methods/ADAGRAD.md) #### Adadelta > [!NOTE] > [Here in detail](./Fancy-Methods/ADADELTA.md) #### ADAM > [!NOTE] > [Here in detail](./Fancy-Methods/ADAM.md) #### AdamW > [!NOTE] > [Here in detail](./Fancy-Methods/ADAM-W.md) #### LION > [!NOTE] > [Here in detail](./Fancy-Methods/LION.md) ### Hessian Free [^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/) [^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 [^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 [^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html) [^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)