Added until rmsprop and stub for other methods

2025-04-19 17:30:10 +02:00
parent 47eac8ff47
commit fbd1c0cccd
1 changed files with 427 additions and 0 deletions
--- a/Chapters/5-Optimization/INDEX.md
+++ b/Chapters/5-Optimization/INDEX.md
@@ -0,0 +1,427 @@
+# Optimization
+
+We basically try to see the error and minimize it by moving towards the ***gradient***
+
+## Types of Learning Algorithms
+
+In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
+Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
+
+So, often we train the `model` on a subset of samples.
+
+### Online Learning
+
+This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
+
+On each `point` we get the ***gradient*** and then we update `weights`.
+
+### Mini-Batch
+
+In this approach, we divide our `dataset` in small batches called `mini-batches`.
+These need to be ***balanced*** in order not to have ***imbalances***.
+
+This technique is the ***most used one***
+
+## Tips and Tricks
+
+### Learning Rate
+
+This is the `hyperparameter` we use to tune our
+***learning steps***.
+
+Sometimes we have it too big and this causes
+***overshootings***. So a quick solution may be to turn
+it down.
+
+However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
+
+### Weight initialization
+
+We need to avoid `neurons` to have the same
+***gradient***. This is easily achievable by using
+***small random values***.
+
+However, if we have a ***large `fan-in`***, then it's
+***easy to overshoot***, then it's better to initialize
+those `weights` ***proportionally to***
+$\sqrt{\text{fan-in}}$:
+
+$$
+w = \frac{
+    np.random(N)
+}{
+    \sqrt{N}
+}
+$$
+
+#### Xavier-Glorot Initialization
+
+<!-- TODO: Read Xavier-Glorot paper -->
+
+Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
+`uniform distribution` with a `std-dev`
+
+$$
+\sigma^2 = \text{gain} \cdot \sqrt{
+    \frac{
+        2
+    }{
+        \text{fan-in} + \text{fan-out}
+    }
+}
+$$
+
+and bounded between $a$ and $-a$
+
+$$
+a = \text{gain} \cdot \sqrt{
+    \frac{
+        6
+    }{
+        \text{fan-in} + \text{fan-out}
+    }
+}
+$$
+
+Alternatively, one can use a `normal-distribution`
+$\mathcal{N}(0, \sigma^2)$.
+
+Note that `gain` is in the **original paper** is equal
+to $1$
+
+### Decorrelating input components
+
+Since ***highly correlated features*** don't offer much
+in terms of ***new information***, probably we need
+to go in the ***latent space*** to find the
+`latent-variables` governing those `features`.
+
+#### PCA
+
+> [!CAUTION]
+> This topic won't be explained here as it's something
+> usually learnt for `Machine Learning`, a
+> ***prerequisite*** for approaching `Deep Learning`.
+
+This is a method we can use to discard `features` that
+will ***add little to no information***
+
+## Common problems in MultiLayer Networks
+
+### Hitting a Plateau
+
+This happenes wehn we have a ***big `learning-rate`***
+which makes `weights` go high in ***absolute value***.
+
+Because this happens ***too quickly***, we could
+see a ***quick diminishing error*** and this is usually
+***mistaken for a minimum point***, while instead
+it's a ***plateau***.
+
+## Speeding up Mini-Batch Learning
+
+### Momentum[^momentum]
+
+We use this method ***mainly when we use `SGD`*** as
+a ***learning techniques***
+
+This method is better explained if we imagine
+our error surface as an actual surface and we place a
+ball over it.
+
+***The ball will start rolling towards the steepest
+descent*** (initially), but ***after gaining enough
+velocity*** it will follow the ***previous direction
+, in some measure***.
+
+So, now the ***gradient*** does modify the ***velocity***
+rather than the ***position***, so the momentum will
+***dampen small variations***.
+
+Moreover, once the ***momentum builds up***, we will
+easily ***pass over plateaus*** as the
+***ball will continue to roll over*** until it is
+stopped by a negative ***gradient***
+
+#### Momentum Equations
+
+There are a couple of them, mainly.
+
+One of them uses a term to evaluate the `momentum`, $p$,
+called `SGD momentum` or `momentum term` or
+`momentum parameter`:
+
+$$
+\begin{aligned}
+    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
+    w_{k+1} &= w_{k} - \gamma p_{k+1}
+\end{aligned}
+$$
+
+The other one is ***logically equivalent*** to the
+previous, but it update the `weights` in ***one step***
+and is called `Stochastic Heavy Ball Method`:
+
+$$
+    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
+        + \beta ( w_k - w_{k-1})
+$$
+
+> [!NOTE]
+> This is how to choose $\beta$:
+>
+> $0 < \beta < 1$
+>
+> If $\beta = 0$, then we are doing
+> ***gradient descent***, if $\beta > 1$ then we
+> ***will have numerical instabilities***.
+>
+> The ***larger*** $\beta$ the
+> ***higher the `momentum`***, so it will
+> ***turn slower***
+
+> [!TIP]
+> usual values are $\beta = 0.9$ or $\beta = 0.99$
+> and usually we start from 0.5 initially, to raise it
+> whenever we are stuck.
+>
+> When we increase $\beta$, then the `learning rate`
+> ***must decrease accordingly***
+> (e.g. from 0.9 to 0.99, `learning-rate` must be
+> divided by a factor of 10)
+
+#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
+
+Differently from the previous
+[momentum](#momentum-equations),
+we take an ***intermediate*** step where we
+***update the `weights`*** according to the
+***previous `momentum`*** and then we compute the
+***new `momentum`*** in this new position, and then
+we ***update again***
+
+$$
+\begin{aligned}
+    \hat{w}_k & = w_k - \beta p_k \\
+    p_{k+1} &= \beta p_{k} +
+        \eta \nabla L(X, y, \hat{w}_k) \\
+    w_{k+1} &= w_{k} - \gamma p_{k+1}
+\end{aligned}
+$$
+
+#### Why Momentum Works
+
+While it has been ***hypothesized*** that
+***acceleration*** made ***convergence faster***, this
+is
+***only true for convex problems without much noise***,
+though this may be ***part of the story***
+
+The other half may be ***Noise Smoothing*** by
+smoothing the optimization process, however
+according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
+
+### Separate Adaptive Learning Rates
+
+Since `weights` may ***greatly vary*** across `layers`,
+having a ***single `learning-rate` might not be ideal.
+
+So the idea is to set a `local learning-rate` to
+control the `global` one as a ***multiplicative factor***
+
+#### Local Learning rates
+
+- Start with $1$ as the ***starting point*** for
+    `local learning-rates` which we'll call `gain` from
+    now on.
+- If the `gradient` has the ***same sign, increase it***
+- Otherwise, ***multiplicatively decrease it***
+
+$$
+    w_{i,j} = - g_{i,j} \cdot \eta \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    }
+
+    \\
+    g_{i,j}(t) = \begin{cases}
+
+    g_{i,j}(t - 1) + \delta
+    & \left( \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t)
+    \cdot
+    \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t-1) \right) > 0 \\
+
+
+    g_{i,j}(t - 1) \cdot (1 - \delta)
+    & \left( \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t)
+    \cdot
+    \frac{
+        d \, Out
+    }{
+        d \, w_{i,j}
+    } (t-1) \right) \leq 0
+\end{cases}
+$$
+
+With this method, if there are oscillations, we will have
+`gains` around $1$
+
+> [!TIP]
+>
+> - Usually a value for $d$ is $0.05$
+> - Limit `gains` around some values:
+>
+>   - $[0.1, 10]$
+>   - $[0.01, 100]$
+>
+> - Use `full-batches` or `big mini-batches` so that
+>   the ***gradient*** doesn't oscillate because of
+>   sampling errors
+> - Combine it with [Momentum](#momentum)
+> - Remember that ***Adaptive `learning-rate`*** deals
+>   with ***axis-alignment***
+
+### rmsprop | Root Mean Square Propagation
+
+#### rprop | Resilient Propagation[^rprop-torch]
+
+This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
+but in this case we don't use the
+[AIMD](#local-learning-rates) technique and
+***we don't take into account*** the
+***magnitude of the gradient*** but ***only the sign***
+
+- If ***gradient*** has same sign:
+  - $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
+- else:
+  - $step_{k} = step_{k} \cdot \eta_-$
+    where $0 <\eta_- < 1$
+
+> [!TIP]
+>
+> Limit the step size in a range where:
+>
+> - $\inf < 50$
+> - $\sup > 1 \text{M}$
+
+> [!CAUTION]
+>
+> rprop does ***not work*** with `mini-batches` as
+> the ***sign of the gradient changes frequently***
+
+#### rmsprop in detail[^rmsprop-torch]
+
+The idea is that [rprop](#rprop--resilient-propagation)
+is ***equivalent to using the gradient divided by its
+value*** (as you either multiply for $1$ or $-1$),
+however it means that between `mini-batches` the
+***divisor*** changes each time, oscillating.
+
+The solution is to have a ***running average*** of
+the ***magnitude of the squared gradient for
+each `weight`***:
+
+$$
+    MeanSquare(w, t) =
+        \alpha MeanSquare(w, t-1) +
+        (1 - \alpha)
+        \left(
+            \frac{d\, Out}{d\, w}^2
+        \right)
+$$
+
+We then divide the ***gradient by the `square root`***
+of that value
+
+#### Further Developments
+
+- `rmsprop` with `momentum` does not work as it should
+- `rmsprop` with `Nesterov momentum` works best
+    if usedto divide the ***correction*** rather than
+    the ***jump***
+- `rmsprop` with `adaptive learnings` needs more
+    investigation
+
+### Fancy Methods
+
+#### Adaptive Gradient
+
+<!-- TODO: Expand over these -->
+
+##### Convex Case
+
+- Conjugate Gradient/Acceleration
+- L-BFGS
+- Quasi-Newton Methods
+
+##### Non-Convex Case
+
+Pay attention, here the `Hessian` may not be
+`Positive Semi Defined`, thus when the ***gradient*** is
+$0$ we don't necessarily know where we are.
+
+- Natural Gradient Methods
+- Curvature Adaptive
+  - [Adagrad](./Fancy-Methods/ADAGRAD.md)
+  - [AdaDelta](./Fancy-Methods/ADADELTA.md)
+  - [RMSprop](#rmsprop-in-detail)
+  - [ADAM](./Fancy-Methods/ADAM.md)
+  - l-BFGS
+  - [heavy ball gradient](#momentum)
+  - [momemtum](#momentum)
+- Noise Injection:
+  - Simulated Annealing
+  - Langevin Method
+
+#### Adagrad
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAGRAD.md)
+
+#### Adadelta
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADADELTA.md)
+
+#### ADAM
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAM.md)
+
+#### AdamW
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/ADAM-W.md)
+
+#### LION
+
+> [!NOTE]
+> [Here in detail](./Fancy-Methods/LION.md)
+
+### Hessian Free
+
+<!-- TODO: Add PDF 5 pg. 38 -->
+
+
+[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
+
+[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
+
+[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
+
+[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
+
+[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)