diff --git a/Chapters/5-Optimization/INDEX.md b/Chapters/5-Optimization/INDEX.md index 6df1fdc..9a2fa04 100644 --- a/Chapters/5-Optimization/INDEX.md +++ b/Chapters/5-Optimization/INDEX.md @@ -411,10 +411,78 @@ $0$ we don't necessarily know where we are. > [!NOTE] > [Here in detail](./Fancy-Methods/LION.md) -### Hessian Free +### Hessian Free[^anelli-hessian-free] + +How much can we `learn` from a given +`Loss` space? + +The ***best way to move*** would be along the +***gradient***, assuming it has +the ***same curvature*** +(e.g. It's and has a local minimum). + +But ***usually this is not the case***, so we need +to move ***where the ratio of gradient and curvature is +high*** + +#### Newton's Method + +This method takes into account the ***curvature*** +of the `Loss` + +With this method, the update would be: + +$$ +\Delta\vec{w} = - \epsilon H(\vec{w})^{-1} \cdot \frac{ + d \, E +}{ + d \, \vec{w} +} +$$ + +***If this could be feasible we'll go on the minimum in +one step***, but it's not, as the +***computations*** +needed to get a `Hessian` ***increase exponentially***. + +The thing is that whenever we ***update `weights`*** with +the `Steepest Descent` method, each update *messes up* +another, while the ***curvature*** can help to ***scale +these updates*** so that they do not disturb each other. + +#### Curvature Approximations + +However, since the `Hessian` is +***too expensive to compute***, we can approximate it. + +- We can take only the ***diagonal elements*** +- ***Other algorithms*** (e.g. Hessian Free) +- ***Conjugate Gradient*** to minimize the + ***approximation error*** + +#### Conjugate Gradient[^conjugate-wikipedia][^anelli-conjugate-gradient] + +> [!CAUTION] +> +> This is an oversemplification of the topic, so reading +> the footnotes material is greatly advised. + +The basic idea is that, in order not to mess up previous +directions, we ***`optimize` along perpendicular directions***. + +This method is ***guaranteed to mathematically succeed +after N steps, the dimension of the space***, in practice +the error will be minimal. + +This ***method works well for `non-quadratic errors`*** +and the `Hessian Free` `optimizer` uses this method +on ***genuinely quadratic surfaces***, which are +***quadratic approximations of the real surface*** + + [^momentum]: [Distill Pub | 18th April 2025](https://distill.pub/2017/momentum/) @@ -425,3 +493,9 @@ $0$ we don't necessarily know where we are. [^rprop-torch]: [Rprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html) [^rmsprop-torch]: [RMSprop | Official PyTorch Documentation | 19th April 2025](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html) + +[^anelli-hessian-free]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 67-81 + +[^conjugate-wikipedia]: [Conjugate Gradient Method | Wikipedia | 20th April 2025](https://en.wikipedia.org/wiki/Conjugate_gradient_method) + +[^anelli-conjugate-gradient]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 5 pg. 74-76