# Normalization ## Batch Normalization[^batch-normalization-paper] The idea behind this method is the fact that ***distributions changes across `Layers`***. This effect is called ***covariate shift*** and it is believed to come from ***choosing bad `hyperparameters`*** What the authors ***wanted to do*** was to ***reduce this effect***, however it was ***proven wrong in achieving this goal***. What this layer does in reality is ***smoothes the optimization function, making the gradient more predictable***[^non-whitening-paper], and as a consequence, it ***keeps most activations away from saturating area, preventing exploding and vanishing gradients, and makes the network more robust to `hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1] ### Benefits in detail - We can use ***larger `learning-rates`***, as the ***gradient is unaffected by them***, as if we apply a ***scalar multiplier***, this holds: $$ \begin{aligned} \frac{ d \, BN((a\vec{W})\vec{u}) }{ d\, \vec{u} } &= \frac{ d \, BN(\vec{W}\vec{u}) }{ d\, \vec{u} } \\ \frac{ d \, BN((a\vec{W})\vec{u}) }{ d\, a \vec{W} } &= \frac{1}{a} \cdot \frac{ d \, BN(\vec{W}\vec{u}) }{ d\, \vec{W} } \end{aligned} $$ - ***Training times are reduced as we can use higher `learning rates`*** - ***We don't need further regularization or dropout*** as `BatchNormalization` ***no longer makes deterministic values for a training example***[^batch-normalization-paper-1]. This is because ***`mean` and `std-deviation` are computed over batches***[^anelli-batch-normalization-2]. ### Algorithm in Detail The actual implementation for `BatchNormalization` is ***cheating*** in some places. First of all, ***it scales each `feature` independently*** instead of normalizing in ***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]: $$ \hat{x}^{(k)} = \frac{ x^{(k)} - E[x^{(k)}] }{ \sqrt{ Var[x^{(k)}]} } $$ > [!WARNING] > This ***works even if these features > are not decorrelated*** Then it ***applies an `identity transformation` to restore what the `input` originally meant***[^batch-normalization-paper-2]: $$ \begin{aligned} y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\ k &\triangleq \text{feature} \end{aligned} $$ Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned parameters, and by setting: $$ \begin{aligned} \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\ \beta^{(k)} &= E[x^{(k)}] \end{aligned} $$ We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`, or more in general in `Mini-Batch` learning, ***we do not have access to all values of the distribution***, so we ***extract both `mean` and `std-dev` from the `mini-batch`*** ## Normalization operations Once we add `normalization layer`, the scheme becomes: ### Generic Normalization Scheme Generally speaking, ***taking inspiration from the [`Batch-Normalization`](#batch-normalization) algorithm, we can derive a more robust one***: $$ \vec{y} = \frac{ a }{ \sigma } (\vec{x} - \mu) + b $$ Here $a$ and $b$ are ***learnable*** while $\sigma$ and $\mu$ are ***fixed***. > [!NOTE] > This formula does not reverse the normalization as > both $a$ and $b$ are ***learned*** > > $\sigma$ and $\mu$ ***comes from the actual > distribution we have*** ### Types of Normalizations We are going to consider a `dataset` of images of `Height` $H$, `Width` $W$, `Channels` $C$ and `number-of-points` $N$ #### Batch Norm Normalizes only a `Channel` for each `pixel` over all `data-points` #### Layer Norm Normalizes `everything` over a ***single*** `data-point` #### Instance norm It normalizes ***only*** a ***single `Channel`*** over a ***single `data-point`*** #### Group Norm[^anelli-batch-normalization-3] Normalizes over ***multiple `Channels`*** on a ***single `data-point`*** ## Practical Considerations [^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) [^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604) [^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604) [^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4 [^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) [^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) [^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6 [^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20 [^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21