Christian Risi 8d8266059b Added Hessian Free and Conjugate method

2025-04-20 12:35:09 +02:00

13 KiB

Raw Blame History

Optimization

We basically try to see the error and minimize it by moving towards the gradient

Types of Learning Algorithms

In Deep Learning it's not unusual to be facing highly redundant datasets. Because of this, usually gradient from some samples is the same for some others.

So, often we train the model on a subset of samples.

Online Learning

This is the extreme of our techniques to deal with redundancy of data.

On each point we get the gradient and then we update weights.

Mini-Batch

In this approach, we divide our dataset in small batches called mini-batches. These need to be balanced in order not to have imbalances.

This technique is the most used one

Tips and Tricks

Learning Rate

This is the hyperparameter we use to tune our learning steps.

Sometimes we have it too big and this causes overshootings. So a quick solution may be to turn it down.

However, we are trading speed for accuracy, thus it's better to wait before tuning this parameter

Weight initialization

We need to avoid neurons to have the same gradient. This is easily achievable by using small random values.

However, if we have a large fan-in, then it's easy to overshoot, then it's better to initialize those weights proportionally to \sqrt{\text{fan-in}}:


w = \frac{
    np.random(N)
}{
    \sqrt{N}
}

Xavier-Glorot Initialization

Here weights are proportional to \sqrt{\text{fan-in}} as well, but we sample from a uniform distribution with a std-dev


\sigma^2 = \text{gain} \cdot \sqrt{
    \frac{
        2
    }{
        \text{fan-in} + \text{fan-out}
    }
}

and bounded between a and -a


a = \text{gain} \cdot \sqrt{
    \frac{
        6
    }{
        \text{fan-in} + \text{fan-out}
    }
}

Alternatively, one can use a normal-distribution \mathcal{N}(0, \sigma^2).

Note that gain is in the original paper is equal to 1

Decorrelating input components

Since highly correlated features don't offer much in terms of new information, probably we need to go in the latent space to find the latent-variables governing those features.

PCA

Caution

This topic won't be explained here as it's something usually learnt for Machine Learning, a prerequisite for approaching Deep Learning.

This is a method we can use to discard features that will add little to no information

Common problems in MultiLayer Networks

Hitting a Plateau

This happenes wehn we have a big learning-rate which makes weights go high in absolute value.

Because this happens too quickly, we could see a quick diminishing error and this is usually mistaken for a minimum point, while instead it's a plateau.

Speeding up Mini-Batch Learning

Momentum¹

We use this method mainly when we use SGD as a learning techniques

This method is better explained if we imagine our error surface as an actual surface and we place a ball over it.

The ball will start rolling towards the steepest descent (initially), but after gaining enough velocity it will follow the previous direction , in some measure.

So, now the gradient does modify the velocity rather than the position, so the momentum will dampen small variations.

Moreover, once the momentum builds up, we will easily pass over plateaus as the ball will continue to roll over until it is stopped by a negative gradient

Momentum Equations

There are a couple of them, mainly.

One of them uses a term to evaluate the momentum, p, called SGD momentum or momentum term or momentum parameter:


\begin{aligned}
    p_{k+1} &= \beta p_{k} + \eta \nabla L(X, y, w_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}

The other one is logically equivalent to the previous, but it update the weights in one step and is called Stochastic Heavy Ball Method:


    w_{k+1} = w_k - \gamma \nabla L(X, y, w_k)
        + \beta ( w_k - w_{k-1})

Note

This is how to choose \beta:

0 < \beta < 1

If \beta = 0, then we are doing gradient descent, if \beta > 1 then we will have numerical instabilities.

The larger \beta the higher the momentum, so it will turn slower

Tip

usual values are \beta = 0.9 or \beta = 0.99 and usually we start from 0.5 initially, to raise it whenever we are stuck.

When we increase \beta, then the learning rate must decrease accordingly (e.g. from 0.9 to 0.99, learning-rate must be divided by a factor of 10)

Nesterov (1983) Sutskever (2012) Accelerated Momentum

Differently from the previous momentum, we take an intermediate step where we update the weights according to the previous momentum and then we compute the new momentum in this new position, and then we update again


\begin{aligned}
    \hat{w}_k & = w_k - \beta p_k \\
    p_{k+1} &= \beta p_{k} +
        \eta \nabla L(X, y, \hat{w}_k) \\
    w_{k+1} &= w_{k} - \gamma p_{k+1}
\end{aligned}

Why Momentum Works

While it has been hypothesized that acceleration made convergence faster, this is only true for convex problems without much noise, though this may be part of the story

The other half may be Noise Smoothing by smoothing the optimization process, however according to these papers²³ this may not be the actual reason.

Separate Adaptive Learning Rates

Since weights may greatly vary across layers, having a ***single learning-rate might not be ideal.

So the idea is to set a local learning-rate to control the global one as a multiplicative factor

Local Learning rates

Start with 1 as the starting point for local learning-rates which we'll call gain from now on.
If the gradient has the same sign, increase it
Otherwise, multiplicatively decrease it


    w_{i,j} = - g_{i,j} \cdot \eta \frac{
        d \, Out
    }{
        d \, w_{i,j}
    }

    \\
    g_{i,j}(t) = \begin{cases}

    g_{i,j}(t - 1) + \delta
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) > 0 \\


    g_{i,j}(t - 1) \cdot (1 - \delta)
    & \left( \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t)
    \cdot
    \frac{
        d \, Out
    }{
        d \, w_{i,j}
    } (t-1) \right) \leq 0
\end{cases}

With this method, if there are oscillations, we will have gains around 1

Tip

Usually a value for d is 0.05

Limit gains around some values:

[0.1, 10]

[0.01, 100]

Use full-batches or big mini-batches so that the gradient doesn't oscillate because of sampling errors

Combine it with Momentum

Remember that Adaptive learning-rate deals with axis-alignment

rmsprop | Root Mean Square Propagation

rprop | Resilient Propagation⁴

This is basically the same idea of separating learning rates, but in this case we don't use the AIMD technique and we don't take into account the magnitude of the gradient but only the sign

If gradient has same sign:
- step_{k} = step_{k} \cdot \eta_+ where \eta_+ > 1
else:
- step_{k} = step_{k} \cdot \eta_- where 0 <\eta_- < 1

Tip

Limit the step size in a range where:

\inf < 50

\sup > 1 \text{M}

Caution

rprop does not work with mini-batches as the sign of the gradient changes frequently

rmsprop in detail⁵

The idea is that rprop is equivalent to using the gradient divided by its value (as you either multiply for 1 or -1), however it means that between mini-batches the divisor changes each time, oscillating.

The solution is to have a running average of the magnitude of the squared gradient for each weight:


    MeanSquare(w, t) =
        \alpha MeanSquare(w, t-1) +
        (1 - \alpha)
        \left(
            \frac{d\, Out}{d\, w}^2
        \right)

We then divide the gradient by the square root of that value

Further Developments

rmsprop with momentum does not work as it should
rmsprop with Nesterov momentum works best if usedto divide the correction rather than the jump
rmsprop with adaptive learnings needs more investigation

Fancy Methods

Adaptive Gradient

Convex Case

Conjugate Gradient/Acceleration
L-BFGS
Quasi-Newton Methods

Non-Convex Case

Pay attention, here the Hessian may not be Positive Semi Defined, thus when the gradient is 0 we don't necessarily know where we are.

Natural Gradient Methods
Curvature Adaptive
- Adagrad
- AdaDelta
- RMSprop
- ADAM
- l-BFGS
- heavy ball gradient
- momemtum
Noise Injection:
- Simulated Annealing
- Langevin Method

Adagrad

Note

Here in detail

Adadelta

Note

Here in detail

ADAM

Note

Here in detail

AdamW

Note

Here in detail

LION

Note

Here in detail

Hessian Free⁶

How much can we learn from a given Loss space?

The best way to move would be along the gradient, assuming it has the same curvature (e.g. It's and has a local minimum).

But usually this is not the case, so we need to move where the ratio of gradient and curvature is high

Newton's Method

This method takes into account the curvature of the Loss

With this method, the update would be:


\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
    d \, E
}{
    d \, \vec{w}
}

If this could be feasible we'll go on the minimum in one step, but it's not, as the computations needed to get a Hessian increase exponentially.

The thing is that whenever we update weights with the Steepest Descent method, each update messes up another, while the curvature can help to scale these updates so that they do not disturb each other.

Curvature Approximations

However, since the Hessian is too expensive to compute, we can approximate it.

We can take only the diagonal elements
Other algorithms (e.g. Hessian Free)
Conjugate Gradient to minimize the approximation error

Conjugate Gradient⁷⁸

Caution

This is an oversemplification of the topic, so reading the footnotes material is greatly advised.

The basic idea is that, in order not to mess up previous directions, we optimize along perpendicular directions.

This method is guaranteed to mathematically succeed after N steps, the dimension of the space, in practice the error will be minimal.

This method works well for non-quadratic errors and the Hessian Free optimizer uses this method on genuinely quadratic surfaces, which are quadratic approximations of the real surface

Distill Pub | 18th April 2025 ↩︎
Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent | arXiv:2402.02325v4 ↩︎
Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization | arXiv:2402.02325v1 ↩︎
Rprop | Official PyTorch Documentation | 19th April 2025 ↩︎
RMSprop | Official PyTorch Documentation | 19th April 2025 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81 ↩︎
Conjugate Gradient Method | Wikipedia | 20th April 2025 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76 ↩︎

13 KiB Raw Blame History

Optimization

Types of Learning Algorithms

Online Learning

Mini-Batch

Tips and Tricks

Learning Rate

Weight initialization

Xavier-Glorot Initialization

Decorrelating input components

PCA

Common problems in MultiLayer Networks

Hitting a Plateau

Speeding up Mini-Batch Learning

Momentum1

Momentum Equations

Nesterov (1983) Sutskever (2012) Accelerated Momentum

Why Momentum Works

Separate Adaptive Learning Rates

Local Learning rates

rmsprop | Root Mean Square Propagation

rprop | Resilient Propagation4

rmsprop in detail5

Further Developments

Fancy Methods

Adaptive Gradient

Convex Case

Non-Convex Case

Adagrad

Adadelta

ADAM

AdamW

LION

Hessian Free6

Newton's Method

Curvature Approximations

Conjugate Gradient78

13 KiB

Raw Blame History

Momentum¹

rprop | Resilient Propagation⁴

rmsprop in detail⁵

Hessian Free⁶

Conjugate Gradient⁷⁸