2025-05-04 12:51:30 +02:00

6.7 KiB

Improve Generalization1

Problems of Datasets2

While datasets try to mimic reality as close as possible, we only deal with an approximation of the real probability distribution.

This means that our samples are affected by sampling errors, thus our model needs to be robust against these errors

Preventing Overfitting3

  • Get more data
  • Take a model that has a Capacity big enough to fit regularities, but not spurious regularities
  • Take the average of many models
  • Bayesian78 approach:
    Take the same NN, but use different weights

Capacity of a NN9

We have many methods to limit capacity in a NN

Architecture

  • limit the number of layers and number of unit per layer

  • Control meta parameters using Cross-Validation

Early Stopping

Start with small weigths and stop before overfitting

This is very useful when we don't want to re-train our model to fine-tune meta-parameters.

Note

Here the capacity is usually reduced because usually with small wieghts activation functions are in their linear range.

In simpler terms, whenever we are in a linear range, it's almost like all of these layers are squashed into a single big linear layer, thus the capacity is reduced

Weight Decay

Penalize large weights by L2 or L1 penalties, making them small unless they have a big error over the gradient

Adding Noise

Add noise to weights or activites.

Whenever we are adding Gaussian noise, the variance is affected by squared the weights10.

Thus, the noise, by making our result noisy as well, it a sense it acts like a regularizer11

Ensemble Averaging12

Since in our models we need to trade Bias for Variance, we can prefer to have more models with a low bias and a higher variance.

Now, both the bias and variance can be represented as the learning capacity of our model:

  • higher bias, lower capacity
  • higher variance, higher capacity

Warning

Remember that a high capacity makes the single model to be prone to overfitting, however this is desired here.

In this way, since we have many models, the variance will be averaged, reducing it.

However, this method is more powerful when individual models disagree

Warning

One single model or predictor among all will be better than the combined one on some data points, but this expected as it is an average result, thus lower than the best.

Making Predictors Disagree13

  • Rely on the algorithm to optimize in a local minimum
  • Use different models that may not be NN
    • (e.g. Classification Tree with a NN)
    • (e.g. change NN architecture or parameters)
  • Change their traning data via boosting or bagging

Boosting

This technique is based on training many low-capacity models on ALL training-set

Note

  • It was very goood with MNIST along NNs
  • Allow the same model to excel in a small subset of points

Bagging

This technique is based on training some models on different subsets of the training-set

Note

  • Too expensive for NNs in many cases
  • Mainly used to train many decision-trees for random-forests

Averaging models

  • average probabilities
  • geometric mean14 probabilities
  • droput

Dropout15

Dropout is a technique where each hidden-unit has a probability of 0.5 of being omitted at training-time (not considered during computation), however all NNs share the same weights.

Whenever we train with this technique, we take a NN with an architecture. Then we sample some NNs by omitting each hidden-unit with p = 0.5.

This is equal to train (potentially) 2^n NNs sampled from the original. However at inference and test time, we will average the results by multiplying each weight by p, thus 0.5, using only the original NN

Since the number of sampled NNs is very high, and very often higher than datapoints, each NN is usually trained only on 1 datapoint, if trained at all.


  1. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 ↩︎

  2. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2 ↩︎

  3. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎

  4. Ensemble Averaging | Wikipedia | 3rd May 2025 ↩︎

  5. Boosting | Wikipedia | 3rd May 2025 ↩︎

  6. Bagging | Wikipedia | 3rd May 2025 ↩︎

  7. Bayesian Neural Networks | C.S. Toronto | 3rd May 2025 ↩︎

  8. C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025 ↩︎

  9. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎

  10. Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025 ↩︎

  11. C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025 ↩︎

  12. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16 ↩︎

  13. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15 ↩︎

  14. Geometric Mean | Wikipedia | 3rd May 2025 ↩︎

  15. Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025 ↩︎