203 lines
6.7 KiB
Markdown
203 lines
6.7 KiB
Markdown
# Improve Generalization[^anelli-generalization]
|
|
|
|
## Problems of Datasets[^anelli-generalization-1]
|
|
|
|
While `datasets` try to mimic reality as close as
|
|
possible, we only deal with an ***approximation of the
|
|
real probability distribution***.
|
|
|
|
This means that our `samples` are affected by
|
|
`sampling errors`, thus ***our `model` needs to be
|
|
robust against these `errors`***
|
|
|
|
## Preventing Overfitting[^anelli-generalization-2]
|
|
|
|
- Get more `data`
|
|
- Take a `model` that has a
|
|
***[`Capacity`](#capacity-of-an-nn) big enough
|
|
to fit `regularities`, but not `spurious
|
|
regularities`***
|
|
- Take the `average` of ***many `models`***
|
|
- [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble]
|
|
- [`Boosting`](#boosting)[^wikipedia-boosting]
|
|
- [`Bagging`](#bagging)[^wikipedia-bagging]
|
|
- `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \
|
|
Take the same `NN`, but use different `weights`
|
|
|
|
## Capacity of a `NN`[^anelli-generalization-3]
|
|
|
|
We have many methods to ***limit `capacity` in a `NN`***
|
|
|
|
### Architecture
|
|
|
|
- ***limit the number of `layers` and number of
|
|
`unit` per `layer`***
|
|
|
|
- Control `meta parameters` using `Cross-Validation`
|
|
|
|
### Early Stopping
|
|
|
|
Start with ***small `weigths`*** and ***stop before
|
|
`overfitting`***
|
|
|
|
This is very useful when we don't want to `re-train` our
|
|
`model` to `fine-tune` `meta-parameters`.
|
|
|
|
> [!NOTE]
|
|
>
|
|
> Here the capacity is ***usually*** reduced because
|
|
> ***usually*** with ***small `wieghts`***
|
|
> [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***.
|
|
>
|
|
> In simpler terms, whenever we are in a `linear range`,
|
|
> it's almost like all of these `layers` are squashed
|
|
> into a ***single big `linear layer`***, thus ***the
|
|
> capacity is reduced***
|
|
|
|
### Weight Decay
|
|
|
|
***Penalize large `weights` by `L2` or `L1`
|
|
penalties***, making them
|
|
***small*** unless they ***have a big error over the
|
|
gradient***
|
|
|
|
### Adding Noise
|
|
|
|
***Add noise to `weights` or `activites`***.
|
|
|
|
Whenever we are adding `Gaussian noise`, the
|
|
`variance` is ***affected by squared
|
|
the weights***[^gaussian-noise-squared-variance].
|
|
|
|
Thus, the `noise`, by making our result `noisy` as well,
|
|
it a sense it acts like a `regularizer`[^toronto-noise]
|
|
|
|
## Ensemble Averaging[^anelli-generalization-4]
|
|
|
|
Since in our `models` we ***need to trade `Bias` for
|
|
`Variance`***, we can prefer to have ***more `models`
|
|
with a low `bias` and a higher `variance`***.
|
|
|
|
Now, both the `bias` and `variance` can be represented
|
|
as the ***learning capacity*** of our `model`:
|
|
|
|
- ***higher `bias`, lower `capacity`***
|
|
- ***higher `variance`, higher `capacity`***
|
|
|
|
> [!WARNING]
|
|
>
|
|
> Remember that a ***high `capacity` makes the single
|
|
> `model` to be prone to `overfitting`***, however
|
|
> this is ***desired here***.
|
|
|
|
In this way, since we have ***many `models`, the
|
|
`variance` will be averaged***, reducing it.
|
|
|
|
However, this method is ***more powerful when individual
|
|
`models` disagree***
|
|
|
|
> [!WARNING]
|
|
>
|
|
> One single `model` or `predictor` among all will be
|
|
> better than the ***combined one*** on some
|
|
> `data points`, but this expected as it is an average
|
|
> result, thus lower than the best.
|
|
|
|
### Making Predictors Disagree[^anelli-generalization-5]
|
|
|
|
- Rely on the `algorithm` to optimize in a
|
|
`local minimum`
|
|
- Use different `models` that may not be `NN`
|
|
- (e.g. `Classification Tree` with a `NN`)
|
|
- (e.g. change `NN` architecture or `parameters`)
|
|
- Change their `traning data` via [`boosting`](#boosting)
|
|
or [`bagging`](#bagging)
|
|
|
|
#### Boosting
|
|
|
|
This technique is based on ***`training` many
|
|
`low-capacity` `models` on ALL `training-set`***
|
|
|
|
> [!NOTE]
|
|
>
|
|
> - It was very goood with MNIST along `NNs`
|
|
> - Allow the ***same `model` to excel in a small subset
|
|
> of `points`***
|
|
|
|
#### Bagging
|
|
|
|
This technique is based on ***`training` some `models` on
|
|
different `subsets` of the `training-set`***
|
|
|
|
> [!NOTE]
|
|
>
|
|
> - ***Too expensive for `NNs`*** in many cases
|
|
> - Mainly used to train many `decision-trees` for
|
|
> `random-forests`
|
|
|
|
### Averaging `models`
|
|
|
|
- ***`average` probabilities***
|
|
- ***`geometric mean`[^wikipedia-geometric]
|
|
probabilities***
|
|
- ***[`droput`](#dropout)***
|
|
|
|
### Dropout[^dropout-paper]
|
|
|
|
<!-- TODO: Finish understading this -->
|
|
|
|
`Dropout` is a technique ***where each `hidden-unit`
|
|
has a probability of 0.5 of being `omitted`
|
|
at `training-time`
|
|
(not considered during computation)***, however all `NNs`
|
|
share the same `weights`.
|
|
|
|
Whenever we `train` with this technique, we take a
|
|
`NN` with an architecture. Then we ***sample
|
|
some `NNs` by omitting each `hidden-unit` with
|
|
p = 0.5***.
|
|
|
|
This is equal to ***`train` (potentially)*** $2^n$
|
|
***`NNs` `sampled` from the original***. However at
|
|
`inference` and `test` time, ***we will `average` the
|
|
results by multiplying each `weight` by p, thus 0.5,
|
|
using only the original `NN`***
|
|
|
|
Since the number of `sampled NNs` is very high, and very
|
|
often ***higher than `datapoints`, each `NN` is usually
|
|
`trained` only on 1 `datapoint`, if `trained` at all***.
|
|
|
|
<!-- TODO: Add PDF 9 pg 20 to 23 -->
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9
|
|
|
|
[^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2
|
|
|
|
[^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
|
|
|
|
[^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
|
|
|
|
[^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16
|
|
|
|
[^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15
|
|
|
|
[^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/)
|
|
|
|
[^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
|
|
|
|
[^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
|
|
|
|
[^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257)
|
|
|
|
[^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning))
|
|
|
|
[^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning))
|
|
|
|
[^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
|
|
|
|
[^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean)
|
|
|
|
[^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)
|