Added Hessian Free and Conjugate method

2025-04-20 12:35:09 +02:00
parent 26e0f11b42
commit 8d8266059b
1 changed files with 75 additions and 1 deletions
--- a/Chapters/5-Optimization/INDEX.md
+++ b/Chapters/5-Optimization/INDEX.md
@@ -411,10 +411,78 @@ $0$ we don't necessarily know where we are.
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/LION.md)
-### Hessian Free
+### Hessian Free[^anelli-hessian-free]
 How much can we `learn` from a given
 `Loss` space?
 The ***best way to move*** would be along the
 ***gradient***, assuming it has
 the ***same curvature***
 (e.g. It's  and has a local minimum).
 But ***usually this is not the case***, so we need
 to move ***where the ratio of gradient and curvature is
 high***
 #### Newton's Method
 This method takes into account the ***curvature***
 of the `Loss`
 With this method, the update would be:
 $$
 \Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
    d \, E
 }{
    d \, \vec{w}
 }
 $$
 ***If this could be feasible we'll go on the minimum in
 one step***, but it's not, as the
 ***computations***
 needed to get a `Hessian` ***increase exponentially***.
 The thing is that whenever we ***update `weights`*** with
 the `Steepest Descent` method, each update *messes up*
 another, while the ***curvature*** can help to ***scale
 these updates*** so that they do not disturb each other.
 #### Curvature Approximations
 However, since the `Hessian` is
 ***too expensive to compute***, we can approximate it.
 - We can take only the ***diagonal elements***
 - ***Other algorithms*** (e.g. Hessian Free)
 - ***Conjugate Gradient*** to minimize the
    ***approximation error***
 #### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
 > [!CAUTION]
 >
 > This is an oversemplification of the topic, so reading
 > the footnotes material is greatly advised.
 The basic idea is that, in order not to mess up previous
 directions, we ***`optimize` along perpendicular directions***.
 This method is ***guaranteed to mathematically succeed
 after N steps, the dimension of the space***, in practice
 the error will be minimal.
 This ***method works well for `non-quadratic errors`***
 and the `Hessian Free` `optimizer` uses this method
 on ***genuinely quadratic surfaces***, which are
 ***quadratic approximations of the real surface***
 <!-- TODO: Add PDF 5 pg. 38 -->
 <!-- Footnotes -->
 [^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)
@@ -425,3 +493,9 @@ $0$ we don't necessarily know where we are.
 [^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)
 [^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
 [^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
 [^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
 [^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76