Added Hessian Free and Conjugate method
This commit is contained in:
parent
26e0f11b42
commit
8d8266059b
@ -411,10 +411,78 @@ $0$ we don't necessarily know where we are.
|
|||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> [Here in detail](./Fancy-Methods/LION.md)
|
> [Here in detail](./Fancy-Methods/LION.md)
|
||||||
|
|
||||||
### Hessian Free
|
### Hessian Free[^anelli-hessian-free]
|
||||||
|
|
||||||
|
How much can we `learn` from a given
|
||||||
|
`Loss` space?
|
||||||
|
|
||||||
|
The ***best way to move*** would be along the
|
||||||
|
***gradient***, assuming it has
|
||||||
|
the ***same curvature***
|
||||||
|
(e.g. It's and has a local minimum).
|
||||||
|
|
||||||
|
But ***usually this is not the case***, so we need
|
||||||
|
to move ***where the ratio of gradient and curvature is
|
||||||
|
high***
|
||||||
|
|
||||||
|
#### Newton's Method
|
||||||
|
|
||||||
|
This method takes into account the ***curvature***
|
||||||
|
of the `Loss`
|
||||||
|
|
||||||
|
With this method, the update would be:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
|
||||||
|
d \, E
|
||||||
|
}{
|
||||||
|
d \, \vec{w}
|
||||||
|
}
|
||||||
|
$$
|
||||||
|
|
||||||
|
***If this could be feasible we'll go on the minimum in
|
||||||
|
one step***, but it's not, as the
|
||||||
|
***computations***
|
||||||
|
needed to get a `Hessian` ***increase exponentially***.
|
||||||
|
|
||||||
|
The thing is that whenever we ***update `weights`*** with
|
||||||
|
the `Steepest Descent` method, each update *messes up*
|
||||||
|
another, while the ***curvature*** can help to ***scale
|
||||||
|
these updates*** so that they do not disturb each other.
|
||||||
|
|
||||||
|
#### Curvature Approximations
|
||||||
|
|
||||||
|
However, since the `Hessian` is
|
||||||
|
***too expensive to compute***, we can approximate it.
|
||||||
|
|
||||||
|
- We can take only the ***diagonal elements***
|
||||||
|
- ***Other algorithms*** (e.g. Hessian Free)
|
||||||
|
- ***Conjugate Gradient*** to minimize the
|
||||||
|
***approximation error***
|
||||||
|
|
||||||
|
#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
>
|
||||||
|
> This is an oversemplification of the topic, so reading
|
||||||
|
> the footnotes material is greatly advised.
|
||||||
|
|
||||||
|
The basic idea is that, in order not to mess up previous
|
||||||
|
directions, we ***`optimize` along perpendicular directions***.
|
||||||
|
|
||||||
|
This method is ***guaranteed to mathematically succeed
|
||||||
|
after N steps, the dimension of the space***, in practice
|
||||||
|
the error will be minimal.
|
||||||
|
|
||||||
|
This ***method works well for `non-quadratic errors`***
|
||||||
|
and the `Hessian Free` `optimizer` uses this method
|
||||||
|
on ***genuinely quadratic surfaces***, which are
|
||||||
|
***quadratic approximations of the real surface***
|
||||||
|
|
||||||
|
|
||||||
<!-- TODO: Add PDF 5 pg. 38 -->
|
<!-- TODO: Add PDF 5 pg. 38 -->
|
||||||
|
|
||||||
|
<!-- Footnotes -->
|
||||||
|
|
||||||
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
[^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
|
||||||
|
|
||||||
@ -425,3 +493,9 @@ $0$ we don't necessarily know where we are.
|
|||||||
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
[^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
|
||||||
|
|
||||||
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
[^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
|
||||||
|
|
||||||
|
[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
|
||||||
|
|
||||||
|
[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
|
||||||
|
|
||||||
|
[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user