6.7 KiB
Improve Generalization1
Problems of Datasets2
While datasets try to mimic reality as close as
possible, we only deal with an approximation of the
real probability distribution.
This means that our samples are affected by
sampling errors, thus our model needs to be
robust against these errors
Preventing Overfitting3
- Get more
data - Take a
modelthat has aCapacitybig enough to fitregularities, but notspurious regularities - Take the
averageof manymodels Bayesian78 approach:
Take the sameNN, but use differentweights
Capacity of a NN9
We have many methods to limit capacity in a NN
Architecture
-
limit the number of
layersand number ofunitperlayer -
Control
meta parametersusingCross-Validation
Early Stopping
Start with small weigths and stop before
overfitting
This is very useful when we don't want to re-train our
model to fine-tune meta-parameters.
Note
Here the capacity is usually reduced because usually with small
wieghtsactivation functionsare in theirlinearrange.In simpler terms, whenever we are in a
linear range, it's almost like all of theselayersare squashed into a single biglinear layer, thus the capacity is reduced
Weight Decay
Penalize large weights by L2 or L1
penalties, making them
small unless they have a big error over the
gradient
Adding Noise
Add noise to weights or activites.
Whenever we are adding Gaussian noise, the
variance is affected by squared
the weights10.
Thus, the noise, by making our result noisy as well,
it a sense it acts like a regularizer11
Ensemble Averaging12
Since in our models we need to trade Bias for
Variance, we can prefer to have more models
with a low bias and a higher variance.
Now, both the bias and variance can be represented
as the learning capacity of our model:
- higher
bias, lowercapacity - higher
variance, highercapacity
Warning
Remember that a high
capacitymakes the singlemodelto be prone tooverfitting, however this is desired here.
In this way, since we have many models, the
variance will be averaged, reducing it.
However, this method is more powerful when individual
models disagree
Warning
One single
modelorpredictoramong all will be better than the combined one on somedata points, but this expected as it is an average result, thus lower than the best.
Making Predictors Disagree13
- Rely on the
algorithmto optimize in alocal minimum - Use different
modelsthat may not beNN- (e.g.
Classification Treewith aNN) - (e.g. change
NNarchitecture orparameters)
- (e.g.
- Change their
traning dataviaboostingorbagging
Boosting
This technique is based on training many
low-capacity models on ALL training-set
Note
- It was very goood with MNIST along
NNs- Allow the same
modelto excel in a small subset ofpoints
Bagging
This technique is based on training some models on
different subsets of the training-set
Note
- Too expensive for
NNsin many cases- Mainly used to train many
decision-treesforrandom-forests
Averaging models
Dropout15
Dropout is a technique where each hidden-unit
has a probability of 0.5 of being omitted
at training-time
(not considered during computation), however all NNs
share the same weights.
Whenever we train with this technique, we take a
NN with an architecture. Then we sample
some NNs by omitting each hidden-unit with
p = 0.5.
This is equal to train (potentially) 2^n
NNs sampled from the original. However at
inference and test time, we will average the
results by multiplying each weight by p, thus 0.5,
using only the original NN
Since the number of sampled NNs is very high, and very
often higher than datapoints, each NN is usually
trained only on 1 datapoint, if trained at all.
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 2 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 3 ↩︎
-
Scalar multiplying standard deviation | Math StackExchange | 3rd May 2025 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 11 to 16 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 9 pg. 14 - 15 ↩︎
-
Dropout: A Simple Way to Prevent Neural Networks from Overfitting | Srivastava, Hinton, Krizhevsky,Sutskever, Salakhutdinov | 3rd May 2025 ↩︎