Christian Risi 56c0cf768b Added Generalization Techniques

2025-05-04 12:51:30 +02:00

6.7 KiB

Raw Permalink Blame History

Improve Generalization¹

Problems of Datasets²

While datasets try to mimic reality as close as possible, we only deal with an approximation of the real probability distribution.

This means that our samples are affected by sampling errors, thus our model needs to be robust against these errors

Preventing Overfitting³

Get more data
Take a model that has a Capacity big enough to fit regularities, but not spurious regularities
Take the average of many models
- Ensemble Averaging⁴
- Boosting⁵
- Bagging⁶
Bayesian⁷⁸ approach:
Take the same NN, but use different weights

Capacity of a `NN`⁹

We have many methods to limit capacity in a NN

Architecture

limit the number of layers and number of unit per layer
Control meta parameters using Cross-Validation

Early Stopping

Start with small weigths and stop before overfitting

This is very useful when we don't want to re-train our model to fine-tune meta-parameters.

Note

Here the capacity is usually reduced because usually with small wieghts activation functions are in their linear range.

In simpler terms, whenever we are in a linear range, it's almost like all of these layers are squashed into a single big linear layer, thus the capacity is reduced

Weight Decay

Penalize large weights by L2 or L1 penalties, making them small unless they have a big error over the gradient

Adding Noise

Add noise to weights or activites.

Whenever we are adding Gaussian noise, the variance is affected by squared the weights¹⁰.

Thus, the noise, by making our result noisy as well, it a sense it acts like a regularizer¹¹

Ensemble Averaging¹²

Since in our models we need to trade Bias for Variance, we can prefer to have more models with a low bias and a higher variance.

Now, both the bias and variance can be represented as the learning capacity of our model:

higher bias, lower capacity
higher variance, higher capacity

Warning

Remember that a high capacity makes the single model to be prone to overfitting, however this is desired here.

In this way, since we have many models, the variance will be averaged, reducing it.

However, this method is more powerful when individual models disagree

Warning

One single model or predictor among all will be better than the combined one on some data points, but this expected as it is an average result, thus lower than the best.

Making Predictors Disagree¹³

Rely on the algorithm to optimize in a local minimum
Use different models that may not be NN
- (e.g. Classification Tree with a NN)
- (e.g. change NN architecture or parameters)
Change their traning data via boosting or bagging

Boosting

This technique is based on training many low-capacity models on ALL training-set

Note

It was very goood with MNIST along NNs

Allow the same model to excel in a small subset of points

Bagging

This technique is based on training some models on different subsets of the training-set

Note

Too expensive for NNs in many cases

Mainly used to train many decision-trees for random-forests

Averaging `models`

average probabilities
geometric mean¹⁴ probabilities
droput

Dropout¹⁵

Dropout is a technique where each hidden-unit has a probability of 0.5 of being omitted at training-time (not considered during computation), however all NNs share the same weights.

Whenever we train with this technique, we take a NN with an architecture. Then we sample some NNs by omitting each hidden-unit with p = 0.5.

This is equal to train (potentially) 2^n NNs sampled from the original. However at inference and test time, we will average the results by multiplying each weight by p, thus 0.5, using only the original NN

Since the number of sampled NNs is very high, and very often higher than datapoints, each NN is usually trained only on 1 datapoint, if trained at all.

Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎
Ensemble Averaging | Wikipedia | 3rd May 2025 ↩︎
Boosting | Wikipedia | 3rd May 2025 ↩︎
Bagging | Wikipedia | 3rd May 2025 ↩︎
Bayesian Neural Networks | C.S. Toronto | 3rd May 2025 ↩︎
C.S. Toronto | Lecture 9 pg. 19 to 27 | 3rd May 2025 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎
Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025 ↩︎
C.S. Toronto | Lecture 9 pg. 16 | 3rd May 2025 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15 ↩︎
Geometric Mean | Wikipedia | 3rd May 2025 ↩︎
Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025 ↩︎

6.7 KiB Raw Permalink Blame History

Improve Generalization1

Problems of Datasets2

Preventing Overfitting3

Capacity of a NN9