5.6 KiB
Normalization
Batch Normalization1
The idea behind this method is the fact that
distributions changes across Layers. This
effect is called covariate shift and it
is believed to come from choosing bad
hyperparameters
What the authors wanted to do was to reduce this effect, however it was proven wrong in achieving this goal.
What this layer does in reality is smoothes the
optimization function, making the gradient more
predictable2, and as a
consequence, it keeps most activations away from
saturating area, preventing exploding and vanishing
gradients, and makes the network more robust to
hyperparameters34
Benefits in detail
- We can use larger
learning-rates, as the gradient is unaffected by them, as if we apply a scalar multiplier, this holds:\begin{aligned} \frac{ d \, BN((a\vec{W})\vec{u}) }{ d\, \vec{u} } &= \frac{ d \, BN(\vec{W}\vec{u}) }{ d\, \vec{u} } \\ \frac{ d \, BN((a\vec{W})\vec{u}) }{ d\, a \vec{W} } &= \frac{1}{a} \cdot \frac{ d \, BN(\vec{W}\vec{u}) }{ d\, \vec{W} } \end{aligned} - Training times are reduced as we can use higher
learning rates - We don't need further regularization or dropout
as
BatchNormalizationno longer makes deterministic values for a training example5. This is becausemeanandstd-deviationare computed over batches6.
Algorithm in Detail
The actual implementation for BatchNormalization is
cheating in some places.
First of all,
it scales each feature independently
instead of normalizing in
"layer inputs and outputs jointly"7:
\hat{x}^{(k)} = \frac{
x^{(k)} - E[x^{(k)}]
}{
\sqrt{ Var[x^{(k)}]}
}
Warning
This works even if these features are not decorrelated
Then it applies an identity transformation to
restore what the input originally meant7:
\begin{aligned}
y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\
k &\triangleq \text{feature}
\end{aligned}
Here both \gamma^{(k)} and \beta^{(k)} are learned
parameters, and by setting:
\begin{aligned}
\gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
\beta^{(k)} &= E[x^{(k)}]
\end{aligned}
We would restore the original activations7. However, in SGD,
or more in general in Mini-Batch learning, we do not
have access to all values of the distribution,
so we extract both mean and std-dev from the
mini-batch
Normalization operations
Once we add normalization layer, the scheme becomes:
Generic Normalization Scheme
Generally speaking, taking inspiration from
the Batch-Normalization
algorithm, we can derive a more robust one:
\vec{y} = \frac{
a
}{
\sigma
} (\vec{x} - \mu) + b
Here a and b are learnable while \sigma and
\mu are fixed.
Note
This formula does not reverse the normalization as both
aandbare learned
\sigmaand\mucomes from the actual distribution we have
Types of Normalizations
We are going to consider a dataset of images of
Height H, Width W, Channels C and
number-of-points N
Batch Norm
Normalizes only a Channel for each pixel over all
data-points
Layer Norm
Normalizes everything over a single data-point
Instance norm
It normalizes only a single Channel over
a single data-point
Group Norm8
Normalizes over multiple Channels on a
single data-point
Practical Considerations
-
How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5 ↩︎
-
How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4 ↩︎
-
Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6 ↩︎
-
Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20 ↩︎