# Improve Generalization[^anelli-generalization]

## Problems of Datasets[^anelli-generalization-1]

While `datasets` try to mimic reality as close as
possible, we only deal with an ***approximation of the
real probability distribution***.

This means that our `samples` are affected by
`sampling errors`, thus ***our `model` needs to be
robust against these `errors`***

## Preventing Overfitting[^anelli-generalization-2]

- Get more `data`
- Take a `model` that has a
  ***[`Capacity`](#capacity-of-an-nn) big enough
  to fit `regularities`, but not `spurious
  regularities`***
- Take the `average` of ***many `models`***
  - [`Ensemble Averaging`](#ensemble-averaging)[^wikipedia-ensemble]
  - [`Boosting`](#boosting)[^wikipedia-boosting]
  - [`Bagging`](#bagging)[^wikipedia-bagging]
- `Bayesian`[^bayesian-nn][^bayesian-toronto] approach: \
    Take the same `NN`, but use different `weights`

## Capacity of a `NN`[^anelli-generalization-3]

We have many methods to ***limit `capacity` in a `NN`***

### Architecture

- ***limit the number of `layers` and number of
`unit` per `layer`***

- Control `meta parameters` using `Cross-Validation`

### Early Stopping

Start with ***small `weigths`*** and ***stop before
`overfitting`***

This is very useful when we don't want to `re-train` our
`model` to `fine-tune` `meta-parameters`.

> [!NOTE]
>
> Here the capacity is ***usually*** reduced because
> ***usually*** with ***small `wieghts`***
> [`activation functions`](./../3-Activation-Functions/INDEX.md) are in their ***`linear` range***.
>
> In simpler terms, whenever we are in a `linear range`,
> it's almost like all of these `layers` are squashed
> into a ***single big `linear layer`***, thus ***the
> capacity is reduced***

### Weight Decay

***Penalize large `weights` by `L2` or `L1`
penalties***, making them
***small*** unless they ***have a big error over the
gradient***

### Adding Noise

***Add noise to `weights` or `activites`***.

Whenever we are adding `Gaussian noise`, the
`variance` is ***affected by squared
the weights***[^gaussian-noise-squared-variance].

Thus, the `noise`, by making our result `noisy` as well,
it a sense it acts like a `regularizer`[^toronto-noise]

## Ensemble Averaging[^anelli-generalization-4]

Since in our `models` we ***need to trade `Bias` for
`Variance`***, we can prefer to have ***more `models`
with a low `bias` and a higher `variance`***.

Now, both the `bias` and `variance` can be represented
as the ***learning capacity*** of our `model`:

- ***higher `bias`, lower `capacity`***
- ***higher `variance`, higher `capacity`***

> [!WARNING]
>
> Remember that a ***high `capacity` makes the single
> `model` to be prone to `overfitting`***, however
> this is ***desired here***.

In this way, since we have ***many `models`, the
`variance` will be averaged***, reducing it.

However, this method is ***more powerful when individual
`models` disagree***

> [!WARNING]
>
> One single `model` or `predictor` among all will be
> better than the ***combined one*** on some
> `data points`, but this expected as it is an average
> result, thus lower than the best.

### Making Predictors Disagree[^anelli-generalization-5]

- Rely on the `algorithm` to optimize in a
  `local minimum`
- Use different `models` that may not be `NN`
    - (e.g. `Classification Tree` with a `NN`)
    - (e.g. change `NN` architecture or `parameters`)
- Change their `traning data` via [`boosting`](#boosting)
    or [`bagging`](#bagging)

#### Boosting

This technique is based on ***`training` many
`low-capacity` `models` on ALL `training-set`***

> [!NOTE]
>
> - It was very goood with MNIST along `NNs`
> - Allow the ***same `model` to excel in a small subset
> of `points`***

#### Bagging

This technique is based on ***`training` some `models` on
different `subsets` of the `training-set`***

> [!NOTE]
>
> - ***Too expensive for `NNs`*** in many cases
> - Mainly used to train many `decision-trees` for
> `random-forests`

### Averaging `models`

- ***`average` probabilities***
- ***`geometric mean`[^wikipedia-geometric]
  probabilities***
- ***[`droput`](#dropout)***

### Dropout[^dropout-paper]

<!-- TODO: Finish understading this -->

`Dropout` is a technique ***where each `hidden-unit`
has a probability of 0.5 of being `omitted`
at `training-time`
(not considered during computation)***, however all `NNs`
share the same `weights`.

Whenever we `train` with this technique, we take a
`NN` with an architecture. Then we ***sample
some `NNs` by omitting each `hidden-unit` with
p = 0.5***.

This is equal to ***`train` (potentially)*** $2^n$
***`NNs` `sampled` from the original***. However at
`inference` and `test` time, ***we will `average` the
results by multiplying each `weight` by p, thus 0.5,
using only the original `NN`***

Since the number of `sampled NNs` is very high, and very
often ***higher than `datapoints`, each `NN` is usually
`trained` only on 1 `datapoint`, if `trained` at all***.

<!-- TODO: Add PDF 9 pg 20 to 23 -->

<!-- Footnotes -->

[^anelli-generalization]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9

[^anelli-generalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2

[^anelli-generalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3

[^anelli-generalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3

[^anelli-generalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16

[^anelli-generalization-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15

[^bayesian-nn]: [Bayesian Neural Networks | C.S. Toronto | 3rd May 2025](https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/)

[^toronto-noise]: [C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)

[^bayesian-toronto]: [C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025](https://www.cs.toronto.edu/~hinton/coursera/lecture9/lec9.pdf)

[^gaussian-noise-squared-variance]: [Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025](https://math.stackexchange.com/a/2793257)

[^wikipedia-ensemble]: [Ensemble Averaging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning))

[^wikipedia-boosting]: [Boosting | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Boosting_(machine_learning))

[^wikipedia-bagging]: [Bagging | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Bootstrap_aggregating)

[^wikipedia-geometric]: [Geometric Mean | Wikipedia | 3rd May 2025](https://en.wikipedia.org/wiki/Geometric_mean)

[^dropout-paper]: [Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025](https://jmlr.csail.mit.edu/papers/volume15/srivastava14a.old/srivastava14a.pdf)