2025-04-22 12:25:17 +02:00

5.6 KiB

Normalization

Batch Normalization1

The idea behind this method is the fact that distributions changes across Layers. This effect is called covariate shift and it is believed to come from choosing bad hyperparameters

What the authors wanted to do was to reduce this effect, however it was proven wrong in achieving this goal.

What this layer does in reality is smoothes the optimization function, making the gradient more predictable2, and as a consequence, it keeps most activations away from saturating area, preventing exploding and vanishing gradients, and makes the network more robust to hyperparameters34

Benefits in detail

  • We can use larger learning-rates, as the gradient is unaffected by them, as if we apply a scalar multiplier, this holds:
    
    \begin{aligned}
        \frac{
            d \, BN((a\vec{W})\vec{u})
        }{
            d\, \vec{u}
        } &= \frac{
            d \, BN(\vec{W}\vec{u})
        }{
            d\, \vec{u}
        } \\
    
        \frac{
            d \, BN((a\vec{W})\vec{u})
        }{
            d\, a \vec{W}
        } &= \frac{1}{a} \cdot \frac{
            d \, BN(\vec{W}\vec{u})
        }{
            d\, \vec{W}
        }
    \end{aligned}
    
  • Training times are reduced as we can use higher learning rates
  • We don't need further regularization or dropout as BatchNormalization no longer makes deterministic values for a training example5. This is because mean and std-deviation are computed over batches6.

Algorithm in Detail

The actual implementation for BatchNormalization is cheating in some places.

First of all, it scales each feature independently instead of normalizing in "layer inputs and outputs jointly"7:


\hat{x}^{(k)} = \frac{
    x^{(k)} - E[x^{(k)}]
}{
    \sqrt{ Var[x^{(k)}]}
}

Warning

This works even if these features are not decorrelated

Then it applies an identity transformation to restore what the input originally meant7:


\begin{aligned}
    y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\

    k &\triangleq \text{feature}
\end{aligned}

Here both \gamma^{(k)} and \beta^{(k)} are learned parameters, and by setting:


\begin{aligned}
    \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
    \beta^{(k)} &= E[x^{(k)}]
\end{aligned}

We would restore the original activations7. However, in SGD, or more in general in Mini-Batch learning, we do not have access to all values of the distribution, so we extract both mean and std-dev from the mini-batch

Normalization operations

Once we add normalization layer, the scheme becomes:

Generic Normalization Scheme

Generally speaking, taking inspiration from the Batch-Normalization algorithm, we can derive a more robust one:


 \vec{y} = \frac{
    a
 }{
    \sigma
 } (\vec{x} - \mu) + b

Here a and b are learnable while \sigma and \mu are fixed.

Note

This formula does not reverse the normalization as both a and b are learned

\sigma and \mu comes from the actual distribution we have

Types of Normalizations

We are going to consider a dataset of images of Height H, Width W, Channels C and number-of-points N

Batch Norm

Normalizes only a Channel for each pixel over all data-points

Layer Norm

Normalizes everything over a single data-point

Instance norm

It normalizes only a single Channel over a single data-point

Group Norm8

Normalizes over multiple Channels on a single data-point

Practical Considerations