Added Generalization Techniques
This commit is contained in:
parent
4dee495423
commit
56c0cf768b
@ -0,0 +1,202 @@
|
||||
# Improve Generalization[^anelli-generalization]
|
||||
|
||||
## Problems of Datasets[^anelli-generalization-1]
|
||||
|
||||
While `datasets` try to mimic reality as close as
|
||||
possible, we only deal with an ***approximation of the
|
||||
real probability distribution***.
|
||||
|
||||
This means that our `samples` are affected by
|
||||
`sampling errors`, thus ***our `model` needs to be
|
||||
robust against these `errors`***
|
||||
|
||||
## Preventing Overfitting[^anelli-generalization-2]
|
||||
|
||||
- Get more `data`
|
||||
- Take a `model` that has a
|
||||
***[`Capacity`](#capacity-of-an-nn) big enough
|
||||
to fit `regularities`, but not `spurious
|
||||
regularities`***
|
||||
- Take the `average` of ***many `models`***
|
||||
- [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble]
|
||||
- [`Boosting`](#boosting)[^wikipedia-boosting]
|
||||
- [`Bagging`](#bagging)[^wikipedia-bagging]
|
||||
- `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \
|
||||
Take the same `NN`, but use different `weights`
|
||||
|
||||
## Capacity of a `NN`[^anelli-generalization-3]
|
||||
|
||||
We have many methods to ***limit `capacity` in a `NN`***
|
||||
|
||||
### Architecture
|
||||
|
||||
- ***limit the number of `layers` and number of
|
||||
`unit` per `layer`***
|
||||
|
||||
- Control `meta parameters` using `Cross-Validation`
|
||||
|
||||
### Early Stopping
|
||||
|
||||
Start with ***small `weigths`*** and ***stop before
|
||||
`overfitting`***
|
||||
|
||||
This is very useful when we don't want to `re-train` our
|
||||
`model` to `fine-tune` `meta-parameters`.
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> Here the capacity is ***usually*** reduced because
|
||||
> ***usually*** with ***small `wieghts`***
|
||||
> [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***.
|
||||
>
|
||||
> In simpler terms, whenever we are in a `linear range`,
|
||||
> it's almost like all of these `layers` are squashed
|
||||
> into a ***single big `linear layer`***, thus ***the
|
||||
> capacity is reduced***
|
||||
|
||||
### Weight Decay
|
||||
|
||||
***Penalize large `weights` by `L2` or `L1`
|
||||
penalties***, making them
|
||||
***small*** unless they ***have a big error over the
|
||||
gradient***
|
||||
|
||||
### Adding Noise
|
||||
|
||||
***Add noise to `weights` or `activites`***.
|
||||
|
||||
Whenever we are adding `Gaussian noise`, the
|
||||
`variance` is ***affected by squared
|
||||
the weights***[^gaussian-noise-squared-variance].
|
||||
|
||||
Thus, the `noise`, by making our result `noisy` as well,
|
||||
it a sense it acts like a `regularizer`[^toronto-noise]
|
||||
|
||||
## Ensemble Averaging[^anelli-generalization-4]
|
||||
|
||||
Since in our `models` we ***need to trade `Bias` for
|
||||
`Variance`***, we can prefer to have ***more `models`
|
||||
with a low `bias` and a higher `variance`***.
|
||||
|
||||
Now, both the `bias` and `variance` can be represented
|
||||
as the ***learning capacity*** of our `model`:
|
||||
|
||||
- ***higher `bias`, lower `capacity`***
|
||||
- ***higher `variance`, higher `capacity`***
|
||||
|
||||
> [!WARNING]
|
||||
>
|
||||
> Remember that a ***high `capacity` makes the single
|
||||
> `model` to be prone to `overfitting`***, however
|
||||
> this is ***desired here***.
|
||||
|
||||
In this way, since we have ***many `models`, the
|
||||
`variance` will be averaged***, reducing it.
|
||||
|
||||
However, this method is ***more powerful when individual
|
||||
`models` disagree***
|
||||
|
||||
> [!WARNING]
|
||||
>
|
||||
> One single `model` or `predictor` among all will be
|
||||
> better than the ***combined one*** on some
|
||||
> `data points`, but this expected as it is an average
|
||||
> result, thus lower than the best.
|
||||
|
||||
### Making Predictors Disagree[^anelli-generalization-5]
|
||||
|
||||
- Rely on the `algorithm` to optimize in a
|
||||
`local minimum`
|
||||
- Use different `models` that may not be `NN`
|
||||
- (e.g. `Classification Tree` with a `NN`)
|
||||
- (e.g. change `NN` architecture or `parameters`)
|
||||
- Change their `traning data` via [`boosting`](#boosting)
|
||||
or [`bagging`](#bagging)
|
||||
|
||||
#### Boosting
|
||||
|
||||
This technique is based on ***`training` many
|
||||
`low-capacity` `models` on ALL `training-set`***
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> - It was very goood with MNIST along `NNs`
|
||||
> - Allow the ***same `model` to excel in a small subset
|
||||
> of `points`***
|
||||
|
||||
#### Bagging
|
||||
|
||||
This technique is based on ***`training` some `models` on
|
||||
different `subsets` of the `training-set`***
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> - ***Too expensive for `NNs`*** in many cases
|
||||
> - Mainly used to train many `decision-trees` for
|
||||
> `random-forests`
|
||||
|
||||
### Averaging `models`
|
||||
|
||||
- ***`average` probabilities***
|
||||
- ***`geometric mean`[^wikipedia-geometric]
|
||||
probabilities***
|
||||
- ***[`droput`](#dropout)***
|
||||
|
||||
### Dropout[^dropout-paper]
|
||||
|
||||
<!-- TODO: Finish understading this -->
|
||||
|
||||
`Dropout` is a technique ***where each `hidden-unit`
|
||||
has a probability of 0.5 of being `omitted`
|
||||
at `training-time`
|
||||
(not considered during computation)***, however all `NNs`
|
||||
share the same `weights`.
|
||||
|
||||
Whenever we `train` with this technique, we take a
|
||||
`NN` with an architecture. Then we ***sample
|
||||
some `NNs` by omitting each `hidden-unit` with
|
||||
p = 0.5***.
|
||||
|
||||
This is equal to ***`train` (potentially)*** $2^n$
|
||||
***`NNs` `sampled` from the original***. However at
|
||||
`inference` and `test` time, ***we will `average` the
|
||||
results by multiplying each `weight` by p, thus 0.5,
|
||||
using only the original `NN`***
|
||||
|
||||
Since the number of `sampled NNs` is very high, and very
|
||||
often ***higher than `datapoints`, each `NN` is usually
|
||||
`trained` only on 1 `datapoint`, if `trained` at all***.
|
||||
|
||||
<!-- TODO: Add PDF 9 pg 20 to 23 -->
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9
|
||||
|
||||
[^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2
|
||||
|
||||
[^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
|
||||
|
||||
[^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3
|
||||
|
||||
[^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16
|
||||
|
||||
[^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15
|
||||
|
||||
[^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/)
|
||||
|
||||
[^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
|
||||
|
||||
[^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)
|
||||
|
||||
[^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257)
|
||||
|
||||
[^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning))
|
||||
|
||||
[^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning))
|
||||
|
||||
[^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating)
|
||||
|
||||
[^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean)
|
||||
|
||||
[^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)
|
||||
Loading…
x
Reference in New Issue
Block a user