Added Generalization Techniques

2025-05-04 12:51:30 +02:00
parent 4dee495423
commit 56c0cf768b
1 changed files with 202 additions and 0 deletions
--- a/Chapters/9-Improve-Generalization/INDEX.md
+++ b/Chapters/9-Improve-Generalization/INDEX.md
@@ -0,0 +1,202 @@
+# Improve Generalization[^anelli-generalization]
+
+## Problems of Datasets[^anelli-generalization-1]
+
+While `datasets` try to mimic reality as close as
+possible, we only deal with an ***approximation of the
+real probability distribution***.
+
+This means that our `samples` are affected by
+`sampling errors`, thus ***our `model` needs to be
+robust against these `errors`***
+
+## Preventing Overfitting[^anelli-generalization-2]
+
+- Get more `data`
+- Take a `model` that has a
+  ***[`Capacity`](#capacity-of-an-nn) big enough
+  to fit `regularities`, but not `spurious
+  regularities`***
+- Take the `average` of ***many `models`***
+  - [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble]
+  - [`Boosting`](#boosting)[^wikipedia-boosting]
+  - [`Bagging`](#bagging)[^wikipedia-bagging]
+- `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \
+    Take the same `NN`, but use different `weights`
+
+## Capacity of a `NN`[^anelli-generalization-3]
+
+We have many methods to ***limit `capacity` in a `NN`***
+
+### Architecture
+
+- ***limit the number of `layers` and number of
+`unit` per `layer`***
+
+- Control `meta parameters` using `Cross-Validation`
+
+### Early Stopping
+
+Start with ***small `weigths`*** and ***stop before
+`overfitting`***
+
+This is very useful when we don't want to `re-train` our
+`model` to `fine-tune` `meta-parameters`.
+
+> [!NOTE]
+>
+> Here the capacity is ***usually*** reduced because
+> ***usually*** with ***small `wieghts`***
+> [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***.
+>
+> In simpler terms, whenever we are in a `linear range`,
+> it's almost like all of these `layers` are squashed
+> into a ***single big `linear layer`***, thus ***the
+> capacity is reduced***
+
+### Weight Decay
+
+***Penalize large `weights` by `L2` or `L1`
+penalties***, making them
+***small*** unless they ***have a big error over the
+gradient***
+
+### Adding Noise
+
+***Add noise to `weights` or `activites`***.
+
+Whenever we are adding `Gaussian noise`, the
+`variance` is ***affected by squared
+the weights***[^gaussian-noise-squared-variance].
+
+Thus, the `noise`, by making our result `noisy` as well,
+it a sense it acts like a `regularizer`[^toronto-noise]
+
+## Ensemble Averaging[^anelli-generalization-4]
+
+Since in our `models` we ***need to trade `Bias` for
+`Variance`***, we can prefer to have ***more `models`
+with a low `bias` and a higher `variance`***.
+
+Now, both the `bias` and `variance` can be represented
+as the ***learning capacity*** of our `model`:
+
+- ***higher `bias`, lower `capacity`***
+- ***higher `variance`, higher `capacity`***
+
+> [!WARNING]
+>
+> Remember that a ***high `capacity` makes the single
+> `model` to be prone to `overfitting`***, however
+> this is ***desired here***.
+
+In this way, since we have ***many `models`, the
+`variance` will be averaged***, reducing it.
+
+However, this method is ***more powerful when individual
+`models` disagree***
+
+> [!WARNING]
+>
+> One single `model` or `predictor` among all will be
+> better than the ***combined one*** on some
+> `data points`, but this expected as it is an average
+> result, thus lower than the best.
+
+### Making Predictors Disagree[^anelli-generalization-5]
+
+- Rely on the `algorithm` to optimize in a
+  `local minimum`
+- Use different `models` that may not be `NN`
+    - (e.g. `Classification Tree` with a `NN`)
+    - (e.g. change `NN` architecture or `parameters`)
+- Change their `traning data` via [`boosting`](#boosting)
+    or [`bagging`](#bagging)
+
+#### Boosting
+
+This technique is based on ***`training` many
+`low-capacity` `models` on ALL `training-set`***
+
+> [!NOTE]
+>
+> - It was very goood with MNIST along `NNs`
+> - Allow the ***same `model` to excel in a small subset
+> of `points`***
+
+#### Bagging
+
+This technique is based on ***`training` some `models` on
+different `subsets` of the `training-set`***
+
+> [!NOTE]
+>
+> - ***Too expensive for `NNs`*** in many cases
+> - Mainly used to train many `decision-trees` for
+> `random-forests`
+
+### Averaging `models`
+
+- ***`average` probabilities***
+- ***`geometric mean`[^wikipedia-geometric]
+  probabilities***
+- ***[`droput`](#dropout)***
+
+### Dropout[^dropout-paper]
+
+<!-- TODO: Finish understading this -->
+
+`Dropout` is a technique ***where each `hidden-unit`
+has a probability of 0.5 of being `omitted`
+at `training-time`
+(not considered during computation)***, however all `NNs`
+share the same `weights`.
+
+Whenever we `train` with this technique, we take a
+`NN` with an architecture. Then we ***sample
+some `NNs` by omitting each `hidden-unit` with
+p = 0.5***.
+
+This is equal to ***`train` (potentially)*** $2^n$
+***`NNs` `sampled` from the original***. However at
+`inference` and `test` time, ***we will `average` the
+results by multiplying each `weight` by p, thus 0.5,
+using only the original `NN`***
+
+Since the number of `sampled NNs` is very high, and very
+often ***higher than `datapoints`, each `NN` is usually
+`trained` only on 1 `datapoint`, if `trained` at all***.
+
+<!-- TODO: Add PDF 9 pg 20 to 23 -->
+
+<!-- Footnotes -->
+
+[^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9
+
+[^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2
+
+[^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
+
+[^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
+
+[^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16
+
+[^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15
+
+[^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/)
+
+[^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
+
+[^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
+
+[^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257)
+
+[^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning))
+
+[^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning))
+
+[^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
+
+[^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean)
+
+[^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)