Added Hessian Free and Conjugate method

This commit is contained in:
Christian Risi 2025-04-20 12:35:09 +02:00
parent 26e0f11b42
commit 8d8266059b

View File

@ -411,10 +411,78 @@ $0$ we don't necessarily know where we are.
> [!NOTE]
> [Here in detail](./Fancy-Methods/LION.md)
### Hessian Free
### Hessian Free[^anelli-hessian-free]
How much can we `learn` from a given
`Loss` space?
The ***best way to move*** would be along the
***gradient***, assuming it has
the ***same curvature***
(e.g. It's and has a local minimum).
But ***usually this is not the case***, so we need
to move ***where the ratio of gradient and curvature is
high***
#### Newton's Method
This method takes into account the ***curvature***
of the `Loss`
With this method, the update would be:
$$
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
d \, E
}{
d \, \vec{w}
}
$$
***If this could be feasible we'll go on the minimum in
one step***, but it's not, as the
***computations***
needed to get a `Hessian` ***increase exponentially***.
The thing is that whenever we ***update `weights`*** with
the `Steepest Descent` method, each update *messes up*
another, while the ***curvature*** can help to ***scale
these updates*** so that they do not disturb each other.
#### Curvature Approximations
However, since the `Hessian` is
***too expensive to compute***, we can approximate it.
- We can take only the ***diagonal elements***
- ***Other algorithms*** (e.g. Hessian Free)
- ***Conjugate Gradient*** to minimize the
***approximation error***
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
> [!CAUTION]
>
> This is an oversemplification of the topic, so reading
> the footnotes material is greatly advised.
The basic idea is that, in order not to mess up previous
directions, we ***`optimize` along perpendicular directions***.
This method is ***guaranteed to mathematically succeed
after N steps, the dimension of the space***, in practice
the error will be minimal.
This ***method works well for `non-quadratic errors`***
and the `Hessian Free` `optimizer` uses this method
on ***genuinely quadratic surfaces***, which are
***quadratic approximations of the real surface***
<!-- TODO: Add PDF 5 pg. 38 -->
<!-- Footnotes -->
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
@ -425,3 +493,9 @@ $0$ we don't necessarily know where we are.
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76