# Improve Generalization[^anelli-generalization] ## Problems of Datasets[^anelli-generalization-1] While `datasets` try to mimic reality as close as possible, we only deal with an ***approximation of the real probability distribution***. This means that our `samples` are affected by `sampling errors`, thus ***our `model` needs to be robust against these `errors`*** ## Preventing Overfitting[^anelli-generalization-2] - Get more `data` - Take a `model` that has a ***[`Capacity`](#capacity-of-an-nn) big enough to fit `regularities`, but not `spurious regularities`*** - Take the `average` of ***many `models`*** - [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble] - [`Boosting`](#boosting)[^wikipedia-boosting] - [`Bagging`](#bagging)[^wikipedia-bagging] - `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \ Take the same `NN`, but use different `weights` ## Capacity of a `NN`[^anelli-generalization-3] We have many methods to ***limit `capacity` in a `NN`*** ### Architecture - ***limit the number of `layers` and number of `unit` per `layer`*** - Control `meta parameters` using `Cross-Validation` ### Early Stopping Start with ***small `weigths`*** and ***stop before `overfitting`*** This is very useful when we don't want to `re-train` our `model` to `fine-tune` `meta-parameters`. > [!NOTE] > > Here the capacity is ***usually*** reduced because > ***usually*** with ***small `wieghts`*** > [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***. > > In simpler terms, whenever we are in a `linear range`, > it's almost like all of these `layers` are squashed > into a ***single big `linear layer`***, thus ***the > capacity is reduced*** ### Weight Decay ***Penalize large `weights` by `L2` or `L1` penalties***, making them ***small*** unless they ***have a big error over the gradient*** ### Adding Noise ***Add noise to `weights` or `activites`***. Whenever we are adding `Gaussian noise`, the `variance` is ***affected by squared the weights***[^gaussian-noise-squared-variance]. Thus, the `noise`, by making our result `noisy` as well, it a sense it acts like a `regularizer`[^toronto-noise] ## Ensemble Averaging[^anelli-generalization-4] Since in our `models` we ***need to trade `Bias` for `Variance`***, we can prefer to have ***more `models` with a low `bias` and a higher `variance`***. Now, both the `bias` and `variance` can be represented as the ***learning capacity*** of our `model`: - ***higher `bias`, lower `capacity`*** - ***higher `variance`, higher `capacity`*** > [!WARNING] > > Remember that a ***high `capacity` makes the single > `model` to be prone to `overfitting`***, however > this is ***desired here***. In this way, since we have ***many `models`, the `variance` will be averaged***, reducing it. However, this method is ***more powerful when individual `models` disagree*** > [!WARNING] > > One single `model` or `predictor` among all will be > better than the ***combined one*** on some > `data points`, but this expected as it is an average > result, thus lower than the best. ### Making Predictors Disagree[^anelli-generalization-5] - Rely on the `algorithm` to optimize in a `local minimum` - Use different `models` that may not be `NN` - (e.g. `Classification Tree` with a `NN`) - (e.g. change `NN` architecture or `parameters`) - Change their `traning data` via [`boosting`](#boosting) or [`bagging`](#bagging) #### Boosting This technique is based on ***`training` many `low-capacity` `models` on ALL `training-set`*** > [!NOTE] > > - It was very goood with MNIST along `NNs` > - Allow the ***same `model` to excel in a small subset > of `points`*** #### Bagging This technique is based on ***`training` some `models` on different `subsets` of the `training-set`*** > [!NOTE] > > - ***Too expensive for `NNs`*** in many cases > - Mainly used to train many `decision-trees` for > `random-forests` ### Averaging `models` - ***`average` probabilities*** - ***`geometric mean`[^wikipedia-geometric] probabilities*** - ***[`droput`](#dropout)*** ### Dropout[^dropout-paper] `Dropout` is a technique ***where each `hidden-unit` has a probability of 0.5 of being `omitted` at `training-time` (not considered during computation)***, however all `NNs` share the same `weights`. Whenever we `train` with this technique, we take a `NN` with an architecture. Then we ***sample some `NNs` by omitting each `hidden-unit` with p = 0.5***. This is equal to ***`train` (potentially)*** $2^n$ ***`NNs` `sampled` from the original***. However at `inference` and `test` time, ***we will `average` the results by multiplying each `weight` by p, thus 0.5, using only the original `NN`*** Since the number of `sampled NNs` is very high, and very often ***higher than `datapoints`, each `NN` is usually `trained` only on 1 `datapoint`, if `trained` at all***. [^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 [^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2 [^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 [^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 [^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16 [^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15 [^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/) [^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf) [^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf) [^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257) [^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning)) [^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning)) [^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating) [^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean) [^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)