diff --git a/Chapters/9-Improve-Generalization/INDEX.md b/Chapters/9-Improve-Generalization/INDEX.md index e69de29..3e4a881 100644 --- a/Chapters/9-Improve-Generalization/INDEX.md +++ b/Chapters/9-Improve-Generalization/INDEX.md @@ -0,0 +1,202 @@ +# Improve Generalization[^anelli-generalization] + +## Problems of Datasets[^anelli-generalization-1] + +While `datasets` try to mimic reality as close as +possible, we only deal with an ***approximation of the +real probability distribution***. + +This means that our `samples` are affected by +`sampling errors`, thus ***our `model` needs to be +robust against these `errors`*** + +## Preventing Overfitting[^anelli-generalization-2] + +- Get more `data` +- Take a `model` that has a + ***[`Capacity`](#capacity-of-an-nn) big enough + to fit `regularities`, but not `spurious + regularities`*** +- Take the `average` of ***many `models`*** + - [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble] + - [`Boosting`](#boosting)[^wikipedia-boosting] + - [`Bagging`](#bagging)[^wikipedia-bagging] +- `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \ + Take the same `NN`, but use different `weights` + +## Capacity of a `NN`[^anelli-generalization-3] + +We have many methods to ***limit `capacity` in a `NN`*** + +### Architecture + +- ***limit the number of `layers` and number of +`unit` per `layer`*** + +- Control `meta parameters` using `Cross-Validation` + +### Early Stopping + +Start with ***small `weigths`*** and ***stop before +`overfitting`*** + +This is very useful when we don't want to `re-train` our +`model` to `fine-tune` `meta-parameters`. + +> [!NOTE] +> +> Here the capacity is ***usually*** reduced because +> ***usually*** with ***small `wieghts`*** +> [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***. +> +> In simpler terms, whenever we are in a `linear range`, +> it's almost like all of these `layers` are squashed +> into a ***single big `linear layer`***, thus ***the +> capacity is reduced*** + +### Weight Decay + +***Penalize large `weights` by `L2` or `L1` +penalties***, making them +***small*** unless they ***have a big error over the +gradient*** + +### Adding Noise + +***Add noise to `weights` or `activites`***. + +Whenever we are adding `Gaussian noise`, the +`variance` is ***affected by squared +the weights***[^gaussian-noise-squared-variance]. + +Thus, the `noise`, by making our result `noisy` as well, +it a sense it acts like a `regularizer`[^toronto-noise] + +## Ensemble Averaging[^anelli-generalization-4] + +Since in our `models` we ***need to trade `Bias` for +`Variance`***, we can prefer to have ***more `models` +with a low `bias` and a higher `variance`***. + +Now, both the `bias` and `variance` can be represented +as the ***learning capacity*** of our `model`: + +- ***higher `bias`, lower `capacity`*** +- ***higher `variance`, higher `capacity`*** + +> [!WARNING] +> +> Remember that a ***high `capacity` makes the single +> `model` to be prone to `overfitting`***, however +> this is ***desired here***. + +In this way, since we have ***many `models`, the +`variance` will be averaged***, reducing it. + +However, this method is ***more powerful when individual +`models` disagree*** + +> [!WARNING] +> +> One single `model` or `predictor` among all will be +> better than the ***combined one*** on some +> `data points`, but this expected as it is an average +> result, thus lower than the best. + +### Making Predictors Disagree[^anelli-generalization-5] + +- Rely on the `algorithm` to optimize in a + `local minimum` +- Use different `models` that may not be `NN` + - (e.g. `Classification Tree` with a `NN`) + - (e.g. change `NN` architecture or `parameters`) +- Change their `traning data` via [`boosting`](#boosting) + or [`bagging`](#bagging) + +#### Boosting + +This technique is based on ***`training` many +`low-capacity` `models` on ALL `training-set`*** + +> [!NOTE] +> +> - It was very goood with MNIST along `NNs` +> - Allow the ***same `model` to excel in a small subset +> of `points`*** + +#### Bagging + +This technique is based on ***`training` some `models` on +different `subsets` of the `training-set`*** + +> [!NOTE] +> +> - ***Too expensive for `NNs`*** in many cases +> - Mainly used to train many `decision-trees` for +> `random-forests` + +### Averaging `models` + +- ***`average` probabilities*** +- ***`geometric mean`[^wikipedia-geometric] + probabilities*** +- ***[`droput`](#dropout)*** + +### Dropout[^dropout-paper] + + + +`Dropout` is a technique ***where each `hidden-unit` +has a probability of 0.5 of being `omitted` +at `training-time` +(not considered during computation)***, however all `NNs` +share the same `weights`. + +Whenever we `train` with this technique, we take a +`NN` with an architecture. Then we ***sample +some `NNs` by omitting each `hidden-unit` with +p = 0.5***. + +This is equal to ***`train` (potentially)*** $2^n$ +***`NNs` `sampled` from the original***. However at +`inference` and `test` time, ***we will `average` the +results by multiplying each `weight` by p, thus 0.5, +using only the original `NN`*** + +Since the number of `sampled NNs` is very high, and very +often ***higher than `datapoints`, each `NN` is usually +`trained` only on 1 `datapoint`, if `trained` at all***. + + + + + +[^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 + +[^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2 + +[^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 + +[^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 + +[^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16 + +[^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15 + +[^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/) + +[^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf) + +[^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf) + +[^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257) + +[^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning)) + +[^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning)) + +[^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating) + +[^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean) + +[^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)