Normalization

Batch Normalization¹

The idea behind this method is the fact that distributions changes across Layers. This effect is called covariate shift and it is believed to come from choosing bad hyperparameters

What the authors wanted to do was to reduce this effect, however it was proven wrong in achieving this goal.

What this layer does in reality is smoothes the optimization function, making the gradient more predictable², and as a consequence, it keeps most activations away from saturating area, preventing exploding and vanishing gradients, and makes the network more robust to hyperparameters³⁴

Benefits in detail

We can use larger learning-rates, as the gradient is unaffected by them, as if we apply a scalar multiplier, this holds:


\begin{aligned}
    \frac{
        d \, BN((a\vec{W})\vec{u})
    }{
        d\, \vec{u}
    } &= \frac{
        d \, BN(\vec{W}\vec{u})
    }{
        d\, \vec{u}
    } \\

    \frac{
        d \, BN((a\vec{W})\vec{u})
    }{
        d\, a \vec{W}
    } &= \frac{1}{a} \cdot \frac{
        d \, BN(\vec{W}\vec{u})
    }{
        d\, \vec{W}
    }
\end{aligned}

Training times are reduced as we can use higher learning rates
We don't need further regularization or dropout as BatchNormalization no longer makes deterministic values for a training example⁵. This is because mean and std-deviation are computed over batches⁶.

Algorithm in Detail

The actual implementation for BatchNormalization is cheating in some places.

First of all, it scales each feature independently instead of normalizing in "layer inputs and outputs jointly"⁷:


\hat{x}^{(k)} = \frac{
    x^{(k)} - E[x^{(k)}]
}{
    \sqrt{ Var[x^{(k)}]}
}

Warning

This works even if these features are not decorrelated

Then it applies an identity transformation to restore what the input originally meant⁷:


\begin{aligned}
    y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\

    k &\triangleq \text{feature}
\end{aligned}

Here both \gamma^{(k)} and \beta^{(k)} are learned parameters, and by setting:


\begin{aligned}
    \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
    \beta^{(k)} &= E[x^{(k)}]
\end{aligned}

We would restore the original activations⁷. However, in SGD, or more in general in Mini-Batch learning, we do not have access to all values of the distribution, so we extract both mean and std-dev from the mini-batch

Normalization operations

Once we add normalization layer, the scheme becomes:

Generic Normalization Scheme

Generally speaking, taking inspiration from the Batch-Normalization algorithm, we can derive a more robust one:


 \vec{y} = \frac{
    a
 }{
    \sigma
 } (\vec{x} - \mu) + b

Here a and b are learnable while \sigma and \mu are fixed.

Note

This formula does not reverse the normalization as both a and b are learned

\sigma and \mu comes from the actual distribution we have

Types of Normalizations

We are going to consider a dataset of images of Height H, Width W, Channels C and number-of-points N

Batch Norm

Normalizes only a Channel for each pixel over all data-points

Layer Norm

Normalizes everything over a single data-point

Instance norm

It normalizes only a single Channel over a single data-point

Group Norm⁸

Normalizes over multiple Channels on a single data-point

Practical Considerations

Batch Normalization Paper | arXiv:1502.03167v3 ↩︎
How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5 ↩︎
How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4 ↩︎
Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6 ↩︎
Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3 ↩︎
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20 ↩︎

5.6 KiB Raw Permalink Blame History