13 KiB
Optimization
We basically try to see the error and minimize it by moving towards the gradient
Types of Learning Algorithms
In Deep Learning it's not unusual to be facing highly redundant datasets.
Because of this, usually gradient from some samples is the same for some others.
So, often we train the model on a subset of samples.
Online Learning
This is the extreme of our techniques to deal with redundancy of data.
On each point we get the gradient and then we update weights.
Mini-Batch
In this approach, we divide our dataset in small batches called mini-batches.
These need to be balanced in order not to have imbalances.
This technique is the most used one
Tips and Tricks
Learning Rate
This is the hyperparameter we use to tune our
learning steps.
Sometimes we have it too big and this causes overshootings. So a quick solution may be to turn it down.
However, we are trading speed for accuracy, thus it's better to wait before tuning this parameter
Weight initialization
We need to avoid neurons to have the same
gradient. This is easily achievable by using
small random values.
However, if we have a large fan-in, then it's
easy to overshoot, then it's better to initialize
those weights proportionally to
\sqrt{\text{fan-in}}:
w = \frac{
np.random(N)
}{
\sqrt{N}
}
Xavier-Glorot Initialization
Here weights are proportional to \sqrt{\text{fan-in}} as well, but we sample from a
uniform distribution with a std-dev
\sigma^2 = \text{gain} \cdot \sqrt{
\frac{
2
}{
\text{fan-in} + \text{fan-out}
}
}
and bounded between a and -a
a = \text{gain} \cdot \sqrt{
\frac{
6
}{
\text{fan-in} + \text{fan-out}
}
}
Alternatively, one can use a normal-distribution
\mathcal{N}(0, \sigma^2).
Note that gain is in the original paper is equal
to 1
Decorrelating input components
Since highly correlated features don't offer much
in terms of new information, probably we need
to go in the latent space to find the
latent-variables governing those features.
PCA
Caution
This topic won't be explained here as it's something usually learnt for
Machine Learning, a prerequisite for approachingDeep Learning.
This is a method we can use to discard features that
will add little to no information
Common problems in MultiLayer Networks
Hitting a Plateau
This happenes wehn we have a big learning-rate
which makes weights go high in absolute value.
Because this happens too quickly, we could see a quick diminishing error and this is usually mistaken for a minimum point, while instead it's a plateau.
Speeding up Mini-Batch Learning
Momentum1
We use this method mainly when we use SGD as
a learning techniques
This method is better explained if we imagine our error surface as an actual surface and we place a ball over it.
The ball will start rolling towards the steepest descent (initially), but after gaining enough velocity it will follow the previous direction , in some measure.
So, now the gradient does modify the velocity rather than the position, so the momentum will dampen small variations.
Moreover, once the momentum builds up, we will easily pass over plateaus as the ball will continue to roll over until it is stopped by a negative gradient
Momentum Equations
There are a couple of them, mainly.
One of them uses a term to evaluate the momentum, p,
called SGD momentum or momentum term or
momentum parameter:
\begin{aligned}
p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}
The other one is logically equivalent to the
previous, but it update the weights in one step
and is called Stochastic Heavy Ball Method:
w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
+ \beta ( w_k - w_{k-1})
Note
This is how to choose
\beta:
0 < \beta < 1If
\beta = 0, then we are doing gradient descent, if\beta > 1then we will have numerical instabilities.The larger
\betathe higher themomentum, so it will turn slower
Tip
usual values are
\beta = 0.9or\beta = 0.99and usually we start from 0.5 initially, to raise it whenever we are stuck.When we increase
\beta, then thelearning ratemust decrease accordingly (e.g. from 0.9 to 0.99,learning-ratemust be divided by a factor of 10)
Nesterov (1983) Sutskever (2012) Accelerated Momentum
Differently from the previous
momentum,
we take an intermediate step where we
update the weights according to the
previous momentum and then we compute the
new momentum in this new position, and then
we update again
\begin{aligned}
\hat{w}_k & = w_k - \beta p_k \\
p_{k+1} &= \beta p_{k} +
\eta \nabla L(X, y, \hat{w}_k) \\
w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}
Why Momentum Works
While it has been hypothesized that acceleration made convergence faster, this is only true for convex problems without much noise, though this may be part of the story
The other half may be Noise Smoothing by smoothing the optimization process, however according to these papers23 this may not be the actual reason.
Separate Adaptive Learning Rates
Since weights may greatly vary across layers,
having a ***single learning-rate might not be ideal.
So the idea is to set a local learning-rate to
control the global one as a multiplicative factor
Local Learning rates
- Start with
1as the starting point forlocal learning-rateswhich we'll callgainfrom now on. - If the
gradienthas the same sign, increase it - Otherwise, multiplicatively decrease it
w_{i,j} = - g_{i,j} \cdot \eta \frac{
d \, Out
}{
d \, w_{i,j}
}
\\
g_{i,j}(t) = \begin{cases}
g_{i,j}(t - 1) + \delta
& \left( \frac{
d \, Out
}{
d \, w_{i,j}
} (t)
\cdot
\frac{
d \, Out
}{
d \, w_{i,j}
} (t-1) \right) > 0 \\
g_{i,j}(t - 1) \cdot (1 - \delta)
& \left( \frac{
d \, Out
}{
d \, w_{i,j}
} (t)
\cdot
\frac{
d \, Out
}{
d \, w_{i,j}
} (t-1) \right) \leq 0
\end{cases}
With this method, if there are oscillations, we will have
gains around 1
Tip
Usually a value for
dis0.05Limit
gainsaround some values:
[0.1, 10][0.01, 100]Use
full-batchesorbig mini-batchesso that the gradient doesn't oscillate because of sampling errorsCombine it with Momentum
Remember that Adaptive
learning-ratedeals with axis-alignment
rmsprop | Root Mean Square Propagation
rprop | Resilient Propagation4
This is basically the same idea of separating learning rates, but in this case we don't use the AIMD technique and we don't take into account the magnitude of the gradient but only the sign
- If gradient has same sign:
step_{k} = step_{k} \cdot \eta_+where\eta_+ > 1
- else:
step_{k} = step_{k} \cdot \eta_-where0 <\eta_- < 1
Tip
Limit the step size in a range where:
\inf < 50\sup > 1 \text{M}
Caution
rprop does not work with
mini-batchesas the sign of the gradient changes frequently
rmsprop in detail5
The idea is that rprop
is equivalent to using the gradient divided by its
value (as you either multiply for 1 or -1),
however it means that between mini-batches the
divisor changes each time, oscillating.
The solution is to have a running average of
the magnitude of the squared gradient for
each weight:
MeanSquare(w, t) =
\alpha MeanSquare(w, t-1) +
(1 - \alpha)
\left(
\frac{d\, Out}{d\, w}^2
\right)
We then divide the gradient by the square root
of that value
Further Developments
rmspropwithmomentumdoes not work as it shouldrmspropwithNesterov momentumworks best if usedto divide the correction rather than the jumprmspropwithadaptive learningsneeds more investigation
Fancy Methods
Adaptive Gradient
Convex Case
- Conjugate Gradient/Acceleration
- L-BFGS
- Quasi-Newton Methods
Non-Convex Case
Pay attention, here the Hessian may not be
Positive Semi Defined, thus when the gradient is
0 we don't necessarily know where we are.
- Natural Gradient Methods
- Curvature Adaptive
- Noise Injection:
- Simulated Annealing
- Langevin Method
Adagrad
Note
Adadelta
Note
ADAM
Note
AdamW
Note
LION
Note
Hessian Free6
How much can we learn from a given
Loss space?
The best way to move would be along the gradient, assuming it has the same curvature (e.g. It's and has a local minimum).
But usually this is not the case, so we need to move where the ratio of gradient and curvature is high
Newton's Method
This method takes into account the curvature
of the Loss
With this method, the update would be:
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
d \, E
}{
d \, \vec{w}
}
If this could be feasible we'll go on the minimum in
one step, but it's not, as the
computations
needed to get a Hessian increase exponentially.
The thing is that whenever we update weights with
the Steepest Descent method, each update messes up
another, while the curvature can help to scale
these updates so that they do not disturb each other.
Curvature Approximations
However, since the Hessian is
too expensive to compute, we can approximate it.
- We can take only the diagonal elements
- Other algorithms (e.g. Hessian Free)
- Conjugate Gradient to minimize the approximation error
Conjugate Gradient78
Caution
This is an oversemplification of the topic, so reading the footnotes material is greatly advised.
The basic idea is that, in order not to mess up previous
directions, we optimize along perpendicular directions.
This method is guaranteed to mathematically succeed after N steps, the dimension of the space, in practice the error will be minimal.
This method works well for non-quadratic errors
and the Hessian Free optimizer uses this method
on genuinely quadratic surfaces, which are
quadratic approximations of the real surface
-
Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 ↩︎
-
Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 ↩︎
-
RMSprop | Official PyTorch Documentation | 19th April 2025 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76 ↩︎