Added until rmsprop and stub for other methods
This commit is contained in:
parent
47eac8ff47
commit
fbd1c0cccd
427
Chapters/5-Optimization/INDEX.md
Normal file
427
Chapters/5-Optimization/INDEX.md
Normal file
@ -0,0 +1,427 @@
|
||||
# Optimization
|
||||
|
||||
We basically try to see the error and minimize it by moving towards the ***gradient***
|
||||
|
||||
## Types of Learning Algorithms
|
||||
|
||||
In `Deep Learning` it's not unusual to be facing ***highly redundant*** `datasets`.
|
||||
Because of this, usually ***gradient*** from some `samples` is the ***same*** for some others.
|
||||
|
||||
So, often we train the `model` on a subset of samples.
|
||||
|
||||
### Online Learning
|
||||
|
||||
This is the ***extreme*** of our techniques to deal with ***redundancy*** of `data`.
|
||||
|
||||
On each `point` we get the ***gradient*** and then we update `weights`.
|
||||
|
||||
### Mini-Batch
|
||||
|
||||
In this approach, we divide our `dataset` in small batches called `mini-batches`.
|
||||
These need to be ***balanced*** in order not to have ***imbalances***.
|
||||
|
||||
This technique is the ***most used one***
|
||||
|
||||
## Tips and Tricks
|
||||
|
||||
### Learning Rate
|
||||
|
||||
This is the `hyperparameter` we use to tune our
|
||||
***learning steps***.
|
||||
|
||||
Sometimes we have it too big and this causes
|
||||
***overshootings***. So a quick solution may be to turn
|
||||
it down.
|
||||
|
||||
However, we are ***trading speed for accuracy***, thus it's better to wait before tuning this `parameter`
|
||||
|
||||
### Weight initialization
|
||||
|
||||
We need to avoid `neurons` to have the same
|
||||
***gradient***. This is easily achievable by using
|
||||
***small random values***.
|
||||
|
||||
However, if we have a ***large `fan-in`***, then it's
|
||||
***easy to overshoot***, then it's better to initialize
|
||||
those `weights` ***proportionally to***
|
||||
$\sqrt{\text{fan-in}}$:
|
||||
|
||||
$$
|
||||
w = \frac{
|
||||
np.random(N)
|
||||
}{
|
||||
\sqrt{N}
|
||||
}
|
||||
$$
|
||||
|
||||
#### Xavier-Glorot Initialization
|
||||
|
||||
<!-- TODO: Read Xavier-Glorot paper -->
|
||||
|
||||
Here `weights` are ***proportional*** to $\sqrt{\text{fan-in}}$ as well, but we ***sample*** from a
|
||||
`uniform distribution` with a `std-dev`
|
||||
|
||||
$$
|
||||
\sigma^2 = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
2
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
and bounded between $a$ and $-a$
|
||||
|
||||
$$
|
||||
a = \text{gain} \cdot \sqrt{
|
||||
\frac{
|
||||
6
|
||||
}{
|
||||
\text{fan-in} + \text{fan-out}
|
||||
}
|
||||
}
|
||||
$$
|
||||
|
||||
Alternatively, one can use a `normal-distribution`
|
||||
$\mathcal{N}(0, \sigma^2)$.
|
||||
|
||||
Note that `gain` is in the **original paper** is equal
|
||||
to $1$
|
||||
|
||||
### Decorrelating input components
|
||||
|
||||
Since ***highly correlated features*** don't offer much
|
||||
in terms of ***new information***, probably we need
|
||||
to go in the ***latent space*** to find the
|
||||
`latent-variables` governing those `features`.
|
||||
|
||||
#### PCA
|
||||
|
||||
> [!CAUTION]
|
||||
> This topic won't be explained here as it's something
|
||||
> usually learnt for `Machine Learning`, a
|
||||
> ***prerequisite*** for approaching `Deep Learning`.
|
||||
|
||||
This is a method we can use to discard `features` that
|
||||
will ***add little to no information***
|
||||
|
||||
## Common problems in MultiLayer Networks
|
||||
|
||||
### Hitting a Plateau
|
||||
|
||||
This happenes wehn we have a ***big `learning-rate`***
|
||||
which makes `weights` go high in ***absolute value***.
|
||||
|
||||
Because this happens ***too quickly***, we could
|
||||
see a ***quick diminishing error*** and this is usually
|
||||
***mistaken for a minimum point***, while instead
|
||||
it's a ***plateau***.
|
||||
|
||||
## Speeding up Mini-Batch Learning
|
||||
|
||||
### Momentum[^momentum]
|
||||
|
||||
We use this method ***mainly when we use `SGD`*** as
|
||||
a ***learning techniques***
|
||||
|
||||
This method is better explained if we imagine
|
||||
our error surface as an actual surface and we place a
|
||||
ball over it.
|
||||
|
||||
***The ball will start rolling towards the steepest
|
||||
descent*** (initially), but ***after gaining enough
|
||||
velocity*** it will follow the ***previous direction
|
||||
, in some measure***.
|
||||
|
||||
So, now the ***gradient*** does modify the ***velocity***
|
||||
rather than the ***position***, so the momentum will
|
||||
***dampen small variations***.
|
||||
|
||||
Moreover, once the ***momentum builds up***, we will
|
||||
easily ***pass over plateaus*** as the
|
||||
***ball will continue to roll over*** until it is
|
||||
stopped by a negative ***gradient***
|
||||
|
||||
#### Momentum Equations
|
||||
|
||||
There are a couple of them, mainly.
|
||||
|
||||
One of them uses a term to evaluate the `momentum`, $p$,
|
||||
called `SGD momentum` or `momentum term` or
|
||||
`momentum parameter`:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
The other one is ***logically equivalent*** to the
|
||||
previous, but it update the `weights` in ***one step***
|
||||
and is called `Stochastic Heavy Ball Method`:
|
||||
|
||||
$$
|
||||
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
|
||||
+ \beta ( w_k - w_{k-1})
|
||||
$$
|
||||
|
||||
> [!NOTE]
|
||||
> This is how to choose $\beta$:
|
||||
>
|
||||
> $0 < \beta < 1$
|
||||
>
|
||||
> If $\beta = 0$, then we are doing
|
||||
> ***gradient descent***, if $\beta > 1$ then we
|
||||
> ***will have numerical instabilities***.
|
||||
>
|
||||
> The ***larger*** $\beta$ the
|
||||
> ***higher the `momentum`***, so it will
|
||||
> ***turn slower***
|
||||
|
||||
> [!TIP]
|
||||
> usual values are $\beta = 0.9$ or $\beta = 0.99$
|
||||
> and usually we start from 0.5 initially, to raise it
|
||||
> whenever we are stuck.
|
||||
>
|
||||
> When we increase $\beta$, then the `learning rate`
|
||||
> ***must decrease accordingly***
|
||||
> (e.g. from 0.9 to 0.99, `learning-rate` must be
|
||||
> divided by a factor of 10)
|
||||
|
||||
#### Nesterov (1983) Sutskever (2012) Accelerated Momentum
|
||||
|
||||
Differently from the previous
|
||||
[momentum](#momentum-equations),
|
||||
we take an ***intermediate*** step where we
|
||||
***update the `weights`*** according to the
|
||||
***previous `momentum`*** and then we compute the
|
||||
***new `momentum`*** in this new position, and then
|
||||
we ***update again***
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{w}_k & = w_k - \beta p_k \\
|
||||
p_{k+1} &= \beta p_{k} +
|
||||
\eta \nabla L(X, y, \hat{w}_k) \\
|
||||
w_{k+1} &= w_{k} - \gamma p_{k+1}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
#### Why Momentum Works
|
||||
|
||||
While it has been ***hypothesized*** that
|
||||
***acceleration*** made ***convergence faster***, this
|
||||
is
|
||||
***only true for convex problems without much noise***,
|
||||
though this may be ***part of the story***
|
||||
|
||||
The other half may be ***Noise Smoothing*** by
|
||||
smoothing the optimization process, however
|
||||
according to these papers[^no-noise-smoothing][^no-noise-smoothing-2] this may not be the actual reason.
|
||||
|
||||
### Separate Adaptive Learning Rates
|
||||
|
||||
Since `weights` may ***greatly vary*** across `layers`,
|
||||
having a ***single `learning-rate` might not be ideal.
|
||||
|
||||
So the idea is to set a `local learning-rate` to
|
||||
control the `global` one as a ***multiplicative factor***
|
||||
|
||||
#### Local Learning rates
|
||||
|
||||
- Start with $1$ as the ***starting point*** for
|
||||
`local learning-rates` which we'll call `gain` from
|
||||
now on.
|
||||
- If the `gradient` has the ***same sign, increase it***
|
||||
- Otherwise, ***multiplicatively decrease it***
|
||||
|
||||
$$
|
||||
w_{i,j} = - g_{i,j} \cdot \eta \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
}
|
||||
|
||||
\\
|
||||
g_{i,j}(t) = \begin{cases}
|
||||
|
||||
g_{i,j}(t - 1) + \delta
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) > 0 \\
|
||||
|
||||
|
||||
g_{i,j}(t - 1) \cdot (1 - \delta)
|
||||
& \left( \frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t)
|
||||
\cdot
|
||||
\frac{
|
||||
d \, Out
|
||||
}{
|
||||
d \, w_{i,j}
|
||||
} (t-1) \right) \leq 0
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
With this method, if there are oscillations, we will have
|
||||
`gains` around $1$
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> - Usually a value for $d$ is $0.05$
|
||||
> - Limit `gains` around some values:
|
||||
>
|
||||
> - $[0.1, 10]$
|
||||
> - $[0.01, 100]$
|
||||
>
|
||||
> - Use `full-batches` or `big mini-batches` so that
|
||||
> the ***gradient*** doesn't oscillate because of
|
||||
> sampling errors
|
||||
> - Combine it with [Momentum](#momentum)
|
||||
> - Remember that ***Adaptive `learning-rate`*** deals
|
||||
> with ***axis-alignment***
|
||||
|
||||
### rmsprop | Root Mean Square Propagation
|
||||
|
||||
#### rprop | Resilient Propagation[^rprop-torch]
|
||||
|
||||
This is basically the same idea of [separating learning rates](#separate-adaptive-learning-rates),
|
||||
but in this case we don't use the
|
||||
[AIMD](#local-learning-rates) technique and
|
||||
***we don't take into account*** the
|
||||
***magnitude of the gradient*** but ***only the sign***
|
||||
|
||||
- If ***gradient*** has same sign:
|
||||
- $step_{k} = step_{k} \cdot \eta_+$ where $\eta_+ > 1$
|
||||
- else:
|
||||
- $step_{k} = step_{k} \cdot \eta_-$
|
||||
where $0 <\eta_- < 1$
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> Limit the step size in a range where:
|
||||
>
|
||||
> - $\inf < 50$
|
||||
> - $\sup > 1 \text{M}$
|
||||
|
||||
> [!CAUTION]
|
||||
>
|
||||
> rprop does ***not work*** with `mini-batches` as
|
||||
> the ***sign of the gradient changes frequently***
|
||||
|
||||
#### rmsprop in detail[^rmsprop-torch]
|
||||
|
||||
The idea is that [rprop](#rprop--resilient-propagation)
|
||||
is ***equivalent to using the gradient divided by its
|
||||
value*** (as you either multiply for $1$ or $-1$),
|
||||
however it means that between `mini-batches` the
|
||||
***divisor*** changes each time, oscillating.
|
||||
|
||||
The solution is to have a ***running average*** of
|
||||
the ***magnitude of the squared gradient for
|
||||
each `weight`***:
|
||||
|
||||
$$
|
||||
MeanSquare(w, t) =
|
||||
\alpha MeanSquare(w, t-1) +
|
||||
(1 - \alpha)
|
||||
\left(
|
||||
\frac{d\, Out}{d\, w}^2
|
||||
\right)
|
||||
$$
|
||||
|
||||
We then divide the ***gradient by the `square root`***
|
||||
of that value
|
||||
|
||||
#### Further Developments
|
||||
|
||||
- `rmsprop` with `momentum` does not work as it should
|
||||
- `rmsprop` with `Nesterov momentum` works best
|
||||
if usedto divide the ***correction*** rather than
|
||||
the ***jump***
|
||||
- `rmsprop` with `adaptive learnings` needs more
|
||||
investigation
|
||||
|
||||
### Fancy Methods
|
||||
|
||||
#### Adaptive Gradient
|
||||
|
||||
<!-- TODO: Expand over these -->
|
||||
|
||||
##### Convex Case
|
||||
|
||||
- Conjugate Gradient/Acceleration
|
||||
- L-BFGS
|
||||
- Quasi-Newton Methods
|
||||
|
||||
##### Non-Convex Case
|
||||
|
||||
Pay attention, here the `Hessian` may not be
|
||||
`Positive Semi Defined`, thus when the ***gradient*** is
|
||||
$0$ we don't necessarily know where we are.
|
||||
|
||||
- Natural Gradient Methods
|
||||
- Curvature Adaptive
|
||||
- [Adagrad](./Fancy-Methods/ADAGRAD.md)
|
||||
- [AdaDelta](./Fancy-Methods/ADADELTA.md)
|
||||
- [RMSprop](#rmsprop-in-detail)
|
||||
- [ADAM](./Fancy-Methods/ADAM.md)
|
||||
- l-BFGS
|
||||
- [heavy ball gradient](#momentum)
|
||||
- [momemtum](#momentum)
|
||||
- Noise Injection:
|
||||
- Simulated Annealing
|
||||
- Langevin Method
|
||||
|
||||
#### Adagrad
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAGRAD.md)
|
||||
|
||||
#### Adadelta
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADADELTA.md)
|
||||
|
||||
#### ADAM
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM.md)
|
||||
|
||||
#### AdamW
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/ADAM-W.md)
|
||||
|
||||
#### LION
|
||||
|
||||
> [!NOTE]
|
||||
> [Here in detail](./Fancy-Methods/LION.md)
|
||||
|
||||
### Hessian Free
|
||||
|
||||
<!-- TODO: Add PDF 5 pg. 38 -->
|
||||
|
||||
|
||||
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
||||
|
||||
[^no-noise-smoothing]: Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4
|
||||
|
||||
[^no-noise-smoothing-2]: Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1
|
||||
|
||||
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
||||
|
||||
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
||||
Loading…
x
Reference in New Issue
Block a user