Added Hessian Free and Conjugate method

2025-04-20 12:35:09 +02:00
parent 26e0f11b42
commit 8d8266059b
1 changed files with 75 additions and 1 deletions
--- a/Chapters/5-Optimization/INDEX.md
+++ b/Chapters/5-Optimization/INDEX.md
@@ -411,10 +411,78 @@ $0$ we don't necessarily know where we are.
 > [!NOTE]
 > [Here in detail](./Fancy-Methods/LION.md)

-### Hessian Free
+### Hessian Free[^anelli-hessian-free]
+
+How much can we `learn` from a given
+`Loss` space?
+
+The ***best way to move*** would be along the
+***gradient***, assuming it has
+the ***same curvature***
+(e.g. It's  and has a local minimum).
+
+But ***usually this is not the case***, so we need
+to move ***where the ratio of gradient and curvature is
+high***
+
+#### Newton's Method
+
+This method takes into account the ***curvature***
+of the `Loss`
+
+With this method, the update would be:
+
+$$
+\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{
+    d \, E
+}{
+    d \, \vec{w}
+}
+$$
+
+***If this could be feasible we'll go on the minimum in
+one step***, but it's not, as the
+***computations***
+needed to get a `Hessian` ***increase exponentially***.
+
+The thing is that whenever we ***update `weights`*** with
+the `Steepest Descent` method, each update *messes up*
+another, while the ***curvature*** can help to ***scale
+these updates*** so that they do not disturb each other.
+
+#### Curvature Approximations
+
+However, since the `Hessian` is
+***too expensive to compute***, we can approximate it.
+
+- We can take only the ***diagonal elements***
+- ***Other algorithms*** (e.g. Hessian Free)
+- ***Conjugate Gradient*** to minimize the
+    ***approximation error***
+
+#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient]
+
+> [!CAUTION]
+>
+> This is an oversemplification of the topic, so reading
+> the footnotes material is greatly advised.
+
+The basic idea is that, in order not to mess up previous
+directions, we ***`optimize` along perpendicular directions***.
+
+This method is ***guaranteed to mathematically succeed
+after N steps, the dimension of the space***, in practice
+the error will be minimal.
+
+This ***method works well for `non-quadratic errors`***
+and the `Hessian Free` `optimizer` uses this method
+on ***genuinely quadratic surfaces***, which are
+***quadratic approximations of the real surface***
+

 <!-- TODO: Add PDF 5 pg. 38 -->

+<!-- Footnotes -->

 [^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/)

@@ -425,3 +493,9 @@ $0$ we don't necessarily know where we are.
 [^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html)

 [^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
+
+[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81
+
+[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method)
+
+[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76