558 lines
19 KiB
Markdown
558 lines
19 KiB
Markdown
# Optimization
|
|
|
|
## Beyond Full Batches
|
|
|
|
Even though full batches give the best picture of a probability dristribution
|
|
of data points, it's computationally expensive.
|
|
|
|
Since data is usually **highly redundant**, we can think of getting smaller
|
|
sets that are classes balanced, **mini-batches**, to update weights.
|
|
While this doesn't give the same results as full batches, is still reliable.
|
|
|
|
When we need to bring things to the extreme, we can even update over a single
|
|
data point, **online learning**, however they are not as efficient as
|
|
mini-batches as **they do not use matrix multiplications, which are GPU efficient**
|
|
|
|
## Learning rate Scheduling
|
|
|
|
## Xavier-Glorot Weight initialization
|
|
|
|
> [!WARNING]
|
|
> Before Xavier-Glorot there was another initialization technique proportional
|
|
> to fan-in:
|
|
>
|
|
> $$ W \propto \frac{rand(in, out)}{\sqrt{in}}$$
|
|
>
|
|
> Though, Xavier-Glorot is not the only available initialization as there are
|
|
> many others[^torch-init]
|
|
|
|
Whenever we initialize weights, we need to be careful to **break simmetry**, as
|
|
**identical hiddden nodes gets the exact same results**, making us
|
|
lose representation power.
|
|
|
|
Another problem with weight initialization is the **overshooting**. This is
|
|
caused by **many small changes over weights**. The idea to solve this is by
|
|
**initializing weights proprotionally to fan-in (input) and fan-out (output)**
|
|
|
|
A technique we use to initialize weights comes from Xavier and Glorot, called
|
|
Xavier-Glorot initialization:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
&W \propto \frac{rand(in, out)}{in + out} \\
|
|
&rand = \mathcal{U}(-a, a) \rightarrow a = g \cdot \sqrt{\frac{6}{in + out}} \\
|
|
&\,\,\,\,\text{or} \\
|
|
&rand =\mathcal{N}(0, \sigma^2) \rightarrow \sigma = g \cdot
|
|
\sqrt{\frac{2}{in + out}}
|
|
\end{aligned}
|
|
$$
|
|
|
|
In other words, xavier glorot extracts weights from either a uniform distribution,
|
|
or a normal one, scaled by a factor $g$ called gain
|
|
|
|
[^torch-init]: [Pytorch Official Docs | `torch.nn.init` | 18th November 2025](https://docs.pytorch.org/docs/stable/nn.init.html)
|
|
|
|
## Momentum
|
|
|
|
> [!TIP]
|
|
> For $\beta$ going from 0.9 to 0.99, the learning rate needs to be decreased by
|
|
> a factor of 10
|
|
|
|
It's a technique inspired by physics. Imagine a ball rolling over a plane. Once
|
|
it has enough speed, even if the plane changes inclination, the ball has
|
|
still energy to move along the previous way because of its momentum.
|
|
|
|
Whenever on a gradient descent we have oscillations, **momentum dampens** all
|
|
movements steering us from the previous direction. Here momentum at time $k$
|
|
is $p_k$
|
|
|
|
$$
|
|
\begin{aligned}
|
|
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, Y, W_{k}) \\
|
|
W_{k+1} &= W_{k} - \gamma p_{k+1} \\
|
|
\beta &\in [0, 1]
|
|
\end{aligned}
|
|
$$
|
|
|
|
Or, in a more compact way, logically equivalent to the previous one:
|
|
|
|
$$
|
|
W_{k+1} = W_{k} - \gamma \nabla L(X, Y, W_{k}) + \beta(W_{k} - W_{k-1})
|
|
$$
|
|
|
|
The larger $\beta$ the slower it curves, accumulating more of previous directions.
|
|
To play it safe, use smaller values once you are at the beginning where updates
|
|
are large and slowly turn it up to values near 1
|
|
|
|
> [!NOTE]
|
|
>
|
|
> - $\eta$: hyperparameter related to the gradient, usually equal to the learnign
|
|
> rate
|
|
> - $\gamma$: Learning rate
|
|
> - $\beta$: hyperparameter of dampening factor
|
|
> - $\nabla L(X, Y, W_{k})$: gradient of the loss
|
|
>
|
|
|
|
## Nesterov Acceleated Gradient (aka NAG)
|
|
|
|
This method takes inpiration from Nesterov's optimization for convex functions and
|
|
applies it to momentum. Its quirk is that it never computes the gradient where it
|
|
lands on, but on a temporary computation of them before the actual update.
|
|
|
|
|Vanilla Momentum[^Akshay-medium-1] | Nesterov Momentum[^Akshay-medium-1] |
|
|
|--|--|
|
|
|  |  |
|
|
|
|
To illustrate better its quirk, here's the formulation:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\hat{W}_{k} &= W_{k} - \beta p_k \\
|
|
p_{k+1} &= \beta p_{k} + \eta\nabla L(X, Y, \hat{W}_k) \\
|
|
W_{k+1} &= W_{k} - \gamma p_{k+1}
|
|
\end{aligned}
|
|
$$
|
|
|
|
As it can be seen, the loss is computer over $\hat{W}_{k}$ rather than $W_{k}$
|
|
which will be our actual weights. The idea is to follow the previous momentum
|
|
blindly, see where it goes and then make the correction.
|
|
|
|
[^Akshay-medium-1]: [Akshay L Chandra | Learning Parameters, Part 2: Momentum-Based & Nesterov Accelerated Gradient Descent | 18th November 2025](https://medium.com/data-science/learning-parameters-part-2-a190bef2d12)
|
|
|
|
## Justifying Faster Optimization for Momentum Based Methods
|
|
|
|
While many people justify the speed of momentum based methods for its acceleration,
|
|
this doesn't hold true as it's only accelerated for convex functions.
|
|
|
|
Since we have no idea, most of the times, how our gradient function looks like,
|
|
we can't make assumptions about it being convex.
|
|
|
|
So, the most compelling explanation lies in the fact that a momentum based
|
|
optimization is like computing a running average of the loss gradient, smoothing
|
|
the noise introduced by the smaller sampling size. In fact, with momentum is not necessary to average steps like in SGD
|
|
|
|
## Separate Adaptive Learning Rate
|
|
|
|
The idea is that each weight of each layer may need its own learnig rate to avoid
|
|
overshooting and smooth the magnitude of received gradients, high over last layers
|
|
and low over first ones (architecture wise)
|
|
|
|
The trick is to have a global learning rate that is adjusted by a local gain that
|
|
is increased each time the weight keeps the same sign and viceversa:
|
|
|
|
$$
|
|
\Delta w_{i,j} = - \eta \cdot g_{i,j} \frac{d \,Loss}{d \, w_{i,j}} \\
|
|
|
|
g_{i,j}(n +1 ) = \begin{cases}
|
|
g_{i,j}(n) + 0.05 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) > 0 \\
|
|
g_{i,j}(n) \cdot 0.95 & \Delta w_{i,j}(n + 1) \cdot \Delta w_{i,j}(n) < 0
|
|
\end{cases}
|
|
$$
|
|
|
|
This method ensures that if the weight oscillates, the gain will dampen it.
|
|
Moreover, should it be totally random, it will hover near 1, keeping gradient
|
|
updates unchanged.
|
|
|
|
> [!NOTE]
|
|
> The way $g$ is updated is similar to AIMD in TCP Congestion
|
|
|
|
<!-- Comment for linter complains-->
|
|
|
|
> [!TIP]
|
|
>
|
|
> - **Clip gains to some margins** - $[0.1, 10]$ or $[0.01, 100]$
|
|
> - **Use full batch or big mini-batches** - This ensures that the change in sign
|
|
> is not due to sampling errors
|
|
> - **Combine this with momentum**
|
|
> - **Use this to deal with axis-alignment problems**
|
|
>
|
|
|
|
## Resilient Backpropagation (aka RProp)
|
|
|
|
Instead of using the magnitude of the gradient, **RProp uses the sign to derive
|
|
updates** that is multiplied by a step value. Here's the formulation[^florian-1]:
|
|
|
|
$$
|
|
w_{i,j}^{(n)} =w_{i,j}^{(n-1)} - s_{i,j}^{(n-1)} \cdot \text{sign}\left(
|
|
\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}
|
|
\right) \\
|
|
s_{i,j}^{(n)} = \begin{cases}
|
|
s_{i,j}^{(n - 1)} \cdot 1.2 &
|
|
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
|
\cdot
|
|
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) > 0 \\
|
|
s_{i,j}^{(n - 1)} \cdot 0.5 &
|
|
\text{sign}\left(\frac{d \, Loss^{(n)}}{d \, w_{i,j}}\right)
|
|
\cdot
|
|
\text{sign}\left(\frac{d \, Loss^{(n- 1)}}{d \, w_{i,j}}\right) < 0
|
|
\end{cases} \\
|
|
s_{i,j} \in [10^{-6}, 50]
|
|
$$
|
|
|
|
It is noticeable that , like
|
|
[separate adaptive learning rates](#separate-adaptive-learning-rate) it increase
|
|
or decreases the gain. However, since it uses multiplication to increase it, makes
|
|
it unusable for anything but full-batches, beacause of its fast growth.
|
|
|
|
[^florian-1]: [Florian | RProp | 19th november 2025](https://florian.github.io/rprop/)
|
|
|
|
## Root Mean Square Propagation (aka RMSProp)
|
|
|
|
As the name implies, it propagates the loss over, a bit like momentum. Since
|
|
[RProp](#resilient-backpropagation-aka-rprop) uses only the sign of the gradient,
|
|
it's almost like dividing the gradient by its magnitude, which is bad in case of
|
|
mini-batches, as all divisors are different.
|
|
|
|
RMSProp solves this by keeping the gradient magnitude similar across mini-batches
|
|
by keeping a running average of it:
|
|
|
|
$$
|
|
L^{(k)} = \beta L^{(k-1)} + (1 - \beta) \left(
|
|
\frac{d \, Loss}{d\, W^{(k -1)}}
|
|
\right)^2 \\
|
|
W^{(k)} = W^{(k-1)} - \eta \frac{1}{\sqrt{L^{(k)}}}\frac{d \, Loss}{d\, W^{(k -1)}}\\
|
|
\text{usually } \beta = 0.9
|
|
$$
|
|
|
|
What this method does is keeping a running average of the measn square error,
|
|
hence the name, and use it to normalize the gradient keeping it similar across
|
|
mini-batches.
|
|
|
|
> [!NOTE]
|
|
> While it can be used with momentum, it doesn't seem to add as much benefits as
|
|
> using it standalone.
|
|
>
|
|
> With Nesterov, it works best if used to normalize the correction, rather than
|
|
> the jump. While for the adaptive learning rates, it still requires further
|
|
> investigations to prove the efficacy.
|
|
>
|
|
|
|
## Adaptive Gradient Methods
|
|
|
|
<!--
|
|
MARK: AdaGrad
|
|
-->
|
|
### AdaGrad[^adagrad-torch]
|
|
|
|
`AdaGrad` is an ***optimization method*** aimed
|
|
to:
|
|
|
|
<ins>***"find needles in the haystack in the form of
|
|
very predictive yet rarely observed features"***
|
|
[^adagrad-official-paper]</ins>
|
|
|
|
`AdaGrad`, opposed to a standard `SGD` that is the
|
|
***same for each gradient geometry***, tries to
|
|
***incorporate geometry from earlier iterations***.
|
|
|
|
#### AdaGrad Algorithm
|
|
|
|
Instead `AdaGrad` takes another
|
|
approach[^anelli-adagrad-2][^adagrad-official-paper]:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
g_{i}^{(k + 1)} &= \frac{d \, Loss}{d \, w_{i}^{(k)}} \\
|
|
G^{(k + 1)} &= \sum_{\tau = 1}^{t} g^{(\tau)} g^{(\tau)T}\\
|
|
w_{i}^{(k + 1)} &=
|
|
w_{i}^{(k)} - \eta \cdot\frac{
|
|
1
|
|
}{
|
|
\sqrt{G_{i,i}^{(k +1)} + \epsilon}
|
|
} \cdot g_{i}^{(k+1)} \\
|
|
|
|
\end{aligned}
|
|
$$
|
|
|
|
Here $G^{(k)}$ is the ***sum of outer product*** of the
|
|
***gradient*** until time $t$, though ***usually it is
|
|
not used*** $G_t$, which is ***impractical because
|
|
of the high number of dimensions***, so we use
|
|
$diag(G_t)$ which can be
|
|
***computed in linear time***[^adagrad-official-paper]
|
|
|
|
The $\epsilon$ term here is used to
|
|
***avoid dividing by 0***[^anelli-adagrad-2] and has a
|
|
small value, usually in the order of $10^{-8}$
|
|
|
|
> [!NOTE]
|
|
>
|
|
> This example is tough to understand if we where to apply it to a matrix $W$
|
|
> instead of a vector. To make it easier to understand in matricial notation:
|
|
>
|
|
$$
|
|
\begin{aligned}
|
|
\nabla L^{(k + 1)} &= \frac{d \, Loss^{(k)}}{d \, W^{(k)}} \\
|
|
G^{(k + 1)} &= G^{(k)} +(\nabla L^{(k+1)}) ^2 \\
|
|
W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}
|
|
{\sqrt{G^{(k+1)} + \epsilon}}
|
|
\end{aligned}
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> In other words, compute the gradient and scale it for the sum of its squares
|
|
> until that point
|
|
|
|
#### AdaGrad effectiveness[^anelli-adagrad-3]
|
|
|
|
- When we have ***many dimensions, many features are
|
|
irrelevant***
|
|
- ***Rarer Features are more relevant***
|
|
- It adapts $\eta$ to the right metric space
|
|
by projecting gradient stochastic updates with
|
|
[Mahalanobis norm](https://en.wikipedia.org/wiki/Mahalanobis_distance), a distance of a point from
|
|
a probability distribution.
|
|
|
|
#### AdaGrad Considerations
|
|
|
|
- It eliminates the need of manually tuning the
|
|
`learning rates`, which is usually set to
|
|
$0.01$
|
|
- The squared ***gradients*** are accumulated during
|
|
iterations, making the `learning-rate` become
|
|
***smaller and smaller***, thus becoming 0 and untrainable
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^adagrad-official-paper]: [Adaptive Subgradient Methods for Online Learning and Stochastic Optimization](https://web.stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf)
|
|
|
|
[^adagrad-torch]: [Adagrad | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
|
|
|
|
[^regret-definition]: [Definition of Regret | 19th April 2025](https://eitca.org/artificial-intelligence/eitc-ai-arl-advanced-reinforcement-learning/tradeoff-between-exploration-and-exploitation/exploration-and-exploitation/examination-review-exploration-and-exploitation/explain-the-concept-of-regret-in-reinforcement-learning-and-how-it-is-used-to-evaluate-the-performance-of-an-algorithm/#:~:text=Regret%20quantifies%20the%20difference%20in,and%20making%20decisions%20over%20time.)
|
|
|
|
[^anelli-adagrad-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 42
|
|
|
|
[^anelli-adagrad-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 43
|
|
|
|
[^anelli-adagrad-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 44
|
|
|
|
### AdaDelta[^adadelta-offcial-paper]
|
|
|
|
`ADADELTA` was inspired by [`AdaGrad`](./ADAGRAD.md) and
|
|
created to address some problems of it, like
|
|
***sensitivity to initial `parameters` and corresponding
|
|
gradient***[^adadelta-offcial-paper]
|
|
|
|
To address all these problems, `ADADELTA` accumulates
|
|
***gradients over a `window` as a running average***, rather than ***accumulating
|
|
it over all instances***:
|
|
|
|
$$
|
|
G^{(k+1)} = \beta \cdot G^{(k)} +
|
|
(1 - \beta) \cdot \nabla L^{(k+1)}
|
|
$$
|
|
|
|
The update, which is very similar to the one in
|
|
[AdaGrad](./ADAGRAD.md#the-algorithm), becomes:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
W^{(k+1)} &= W^{(k)} - \eta \frac{\nabla L^{(k + 1)}}{\sqrt{G^{(k+1)} + \epsilon}}
|
|
\end{aligned}
|
|
$$
|
|
|
|
Technically speaking, the last equation is basically equivalent to the
|
|
[RMSProp](#root-mean-square-propagation-aka-rmsprop) one, as $G$ is
|
|
equivalent to the running average of the mean square.
|
|
|
|
However, as the author pointed out[^adadelta-units], this equation does not
|
|
respect units of measures. We should correct this problem
|
|
by ***considering the curvature locally smooth*** and
|
|
taking an approximation of it at the next step, becoming:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\Delta W^{(k)} &= - \frac{\sqrt{S^{(k-1)}}}{\sqrt{G^{(k)}}}
|
|
\nabla L^{(k)}\\
|
|
S^{(k)} &= \beta S^{(k - 1)} + (1 - \beta) \Delta W^{2(k)} \\
|
|
W^{(k +1 )} &= W^{(k)} + \Delta W^{(k)}
|
|
\end{aligned}
|
|
$$
|
|
|
|
As we can notice, the ***`learning rate` completely
|
|
disappears from the equation, eliminating the need to
|
|
set one***
|
|
|
|
> [!WARNING]
|
|
> Here $\Delta W$ is already negative, that's why there's a $+$ in the last
|
|
> equation
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^adadelta-offcial-paper]: [Official ADADELTA Paper | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
|
|
|
[^adadelta-units]: [Official ADADELTA Paper | Paragraph 3.2 Idea 2: Correct Units with Hessian Approximation | arXiv:1212.5701v1](https://arxiv.org/pdf/1212.5701)
|
|
|
|
### Adaptive Moment Estimation (aka AdaM)
|
|
|
|
AdaM computes both the momentum and the squared gradients with running
|
|
averages, which are 0 filled at time $k = 0$:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
M^{(k+1)} &= \beta_1 M^{(k)} + (1 - \beta_1) \nabla L \\
|
|
V^{(k+1)} &= \beta_2 V^{(k)} + (1 - \beta_2) \nabla L^2 \\
|
|
\end{aligned}
|
|
$$
|
|
|
|
> [!WARNING]
|
|
> The squared gradient can be thought as the variance, however it's not centered
|
|
|
|
Then it corrects them to be used in the final formulation:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\hat{M}^{(k+1)} &= \frac{M^{(k+1)}}{1 - \beta_1^{k + 1}} \\
|
|
\hat{V}^{(k+1)} &= \frac{V^{(k+1)}}{1 - \beta_2^{k + 1}} \\
|
|
\end{aligned}
|
|
$$
|
|
|
|
> [!WARNING]
|
|
> $\beta_1$ and $\beta_2$ are put to the power of $k + 1$, the timestep.
|
|
|
|
Then it computes the update in this way:
|
|
|
|
$$
|
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)}}
|
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
|
$$
|
|
|
|
Even though Adam works, it doesn't generalize well and, particularly in image
|
|
problems, it perform worse than standard SGD. Moreover, we need to keep 3 buffers
|
|
instead of 1 as for SGD, which 2 of them need parameters tuning.
|
|
|
|
> [!NOTE]
|
|
> Author proposed values are $\beta_1 = 0.9$, $\beta_2 = 0.999$ and
|
|
> $\epsilon = 10^-8$
|
|
|
|
### AdamW
|
|
|
|
AdamW, tries to solve AdaM problems by introducing weight decay. In all honesty,
|
|
AdaM already implements it, however it is usually added to momentum, getting
|
|
scaled by $\sqrt{\hat{V}}$:
|
|
|
|
$$
|
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} + \alpha W^{(k)}}
|
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon}
|
|
$$
|
|
|
|
AdamW authors saw that this was inefficient as it was influences by the uncentered
|
|
variance, thus modified the formula to this:
|
|
|
|
$$
|
|
W^{(k + 1)} = W^{(k)} - \eta \frac{\hat{M}^{(k + 1)} }
|
|
{\sqrt{\hat{V}^{(k+1)}} + \epsilon} + \lambda W^{(k)}
|
|
$$
|
|
|
|
### Lion (evoLved sIgn mOmeNtum)[^official-paper]
|
|
|
|
`Lion` is the result of a ***genetic search algorithm*** aimed to
|
|
find the best `optimizer`.
|
|
|
|
It starts from a population of `AdamW` algorithms to
|
|
***speed up the search***. Opposed to
|
|
`Adam` and `AdamW`, it keeps track
|
|
***only for the momentum*** and ***gradient sign***,
|
|
requiring ***less `memory`***.
|
|
|
|
Since ***uniform updates yields larger norms***,
|
|
`Lion` requires a ***smaller `learning-rate`***
|
|
and a ***larger decoupled `weight-decay`***
|
|
$\lambda$[^official-paper-1].
|
|
|
|
The ***advantages of `Lion` over `Adam` and `AdamW`
|
|
increase with the size of
|
|
the `mini-batch`***[^official-paper-1]
|
|
|
|
#### Symbolic Representation[^official-paper-2]
|
|
|
|
New ***trained algorithms*** are represented
|
|
`simbolically`, bringing these advantages:
|
|
|
|
- `Algorithms` must be ***implemented*** as `programs`
|
|
- It ***easier to analyze, comprehend and transfer to
|
|
new task*** these `algorithms`, rather than other
|
|
`algorithms` such as `NeuralNetworks`
|
|
- We can **estimate the *complexity*** by looking
|
|
at the ***length of code***
|
|
|
|
#### Tournament[^official-paper-3]
|
|
|
|
The best code is found with a ***tournament style
|
|
evolution***. Each cycle it picks the ***best
|
|
`algorithm`*** which will be
|
|
***copied and mutated*** and the ***oldest is removed***
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^official-paper]: [Official Lion Paper | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
|
|
|
[^official-paper-1]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
|
|
|
[^official-paper-2]: [Official Lion Paper| Paragraph 1 pg. 3 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
|
|
|
[^official-paper-3]: [Official Lion Paper| Paragraph 2 pg. 4-5 | arXiv:2302.06675v4](https://arxiv.org/pdf/2302.06675)
|
|
|
|
## Hessian Free Optimization
|
|
|
|
Since we are moving on a function which gradient is not constant, by looking at
|
|
the curvature, [Hessian Matrix](./../15-Appendix-A/INDEX.md#hessian-matrix),
|
|
we can see when it starts to change.
|
|
|
|
### Newton's Method
|
|
|
|
This method would technically give us the solution in one step on a quadratic
|
|
function, but it is unfeasible due to the memory and computational requirements:
|
|
|
|
$$
|
|
\Delta W = - \epsilon H(W)^{-1} \times \frac{d\, L}{d\, W}
|
|
$$
|
|
|
|
### Conjugate Gradient
|
|
|
|
The idea is to correct the weights so that we reduce the gradient to 0 across
|
|
perpendicular directions. This means that, for each update, we are not messing up
|
|
previous optimizations.
|
|
|
|
While it is usually used for quadratic error surfaces, there's a non linear variant
|
|
(non-linear conjugate gradient) that usually works well. However it is also
|
|
possible to approximate the true error function with a quadratic one, using the
|
|
standard method.
|
|
|
|
It gives a solution after $N$ steps over an $N$ dimensional quadratic surface,
|
|
however we need to penalize frequent changes in weights, especially for hidden
|
|
activities of [`RNNs`](./../8-Recurrent-Networks/INDEX.md)
|
|
|
|
## Optimization Tricks
|
|
|
|
### Input decorrelation
|
|
|
|
If you have a linear neuron, think of a Feed Forward and not of a Convolution,
|
|
it's better to decorrelate input components.
|
|
|
|
A way to achieve this is through a
|
|
[PCA](./../15-Appendix-A/INDEX.md#computing-pca),
|
|
transforming the error surface from an ellipse to a circle.
|
|
|
|
### Recognize Plateaus
|
|
|
|
If we start with big learning rates, since weights gain a big magnitude, the
|
|
derivative will be small and the error will not decrease significantly.
|
|
|
|
This may seem a local minima, but this is usually a plateau.
|
|
|
|
### Mini-Batch Speed up
|
|
|
|
To speed up mini batch training use these methods:
|
|
|
|
- [**Momentum**](#momentum)
|
|
- [**Separate adaptive learing rates for each parameter**](#separate-adaptive-learning-rate)
|
|
- [**rmsprop**](#root-mean-square-propagation-aka-rmsprop)
|
|
- [**Adaptive Gradients Methods**](#adaptive-gradient-methods)
|
|
|
|
### Mini-batches vs Full-Batches
|
|
|
|
The rule of thumb is to use **full-batches for small datasets or small redundancy**
|
|
, while **mini-batches for redundant datasets**
|
|
|