From ac20c47e5a9a53a7364f1d111e956658cf7a0102 Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Tue, 22 Apr 2025 12:25:17 +0200 Subject: [PATCH] Added Normalization --- Chapters/6-Normalization-Layers/INDEX.md | 202 +++++++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 Chapters/6-Normalization-Layers/INDEX.md diff --git a/Chapters/6-Normalization-Layers/INDEX.md b/Chapters/6-Normalization-Layers/INDEX.md new file mode 100644 index 0000000..4a39ff7 --- /dev/null +++ b/Chapters/6-Normalization-Layers/INDEX.md @@ -0,0 +1,202 @@ +# Normalization + + + +## Batch Normalization[^batch-normalization-paper] + +The idea behind this method is the fact that +***distributions changes across `Layers`***. This +effect is called ***covariate shift*** and it +is believed to come from ***choosing bad +`hyperparameters`*** + +What the authors ***wanted to do*** was to ***reduce this +effect***, however it was +***proven wrong in achieving this goal***. + +What this layer does in reality is ***smoothes the +optimization function, making the gradient more +predictable***[^non-whitening-paper], and as a +consequence, it ***keeps most activations away from +saturating area, preventing exploding and vanishing +gradients, and makes the network more robust to +`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1] + +### Benefits in detail + +- We can use ***larger `learning-rates`***, as the + ***gradient is unaffected by them***, as if we + apply a ***scalar multiplier***, this holds: + $$ + \begin{aligned} + \frac{ + d \, BN((a\vec{W})\vec{u}) + }{ + d\, \vec{u} + } &= \frac{ + d \, BN(\vec{W}\vec{u}) + }{ + d\, \vec{u} + } \\ + + \frac{ + d \, BN((a\vec{W})\vec{u}) + }{ + d\, a \vec{W} + } &= \frac{1}{a} \cdot \frac{ + d \, BN(\vec{W}\vec{u}) + }{ + d\, \vec{W} + } + \end{aligned} + $$ +- ***Training times are reduced as we can use higher + `learning rates`*** +- ***We don't need further regularization or dropout*** + as `BatchNormalization` ***no longer makes + deterministic values for a + training example***[^batch-normalization-paper-1]. + This is because ***`mean` and `std-deviation` are + computed over batches***[^anelli-batch-normalization-2]. + +### Algorithm in Detail + +The actual implementation for `BatchNormalization` is +***cheating*** in some places. + +First of all, +***it scales each `feature` independently*** +instead of normalizing in +***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]: + +$$ +\hat{x}^{(k)} = \frac{ + x^{(k)} - E[x^{(k)}] +}{ + \sqrt{ Var[x^{(k)}]} +} +$$ + +> [!WARNING] +> This ***works even if these features +> are not decorrelated*** + +Then it ***applies an `identity transformation` to +restore what the `input` originally meant***[^batch-normalization-paper-2]: + +$$ +\begin{aligned} + y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\ + + k &\triangleq \text{feature} +\end{aligned} +$$ + +Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned +parameters, and by setting: + +$$ +\begin{aligned} + \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\ + \beta^{(k)} &= E[x^{(k)}] +\end{aligned} +$$ + +We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`, +or more in general in `Mini-Batch` learning, ***we do not +have access to all values of the distribution***, +so we ***extract both `mean` and `std-dev` from the +`mini-batch`*** + + + +## Normalization operations + +Once we add `normalization layer`, the scheme becomes: + + + +### Generic Normalization Scheme + +Generally speaking, ***taking inspiration from +the [`Batch-Normalization`](#batch-normalization) +algorithm, we can derive a more robust one***: + +$$ + \vec{y} = \frac{ + a + }{ + \sigma + } (\vec{x} - \mu) + b +$$ + +Here $a$ and $b$ are ***learnable*** while $\sigma$ and +$\mu$ are ***fixed***. + +> [!NOTE] +> This formula does not reverse the normalization as +> both $a$ and $b$ are ***learned*** +> +> $\sigma$ and $\mu$ ***comes from the actual +> distribution we have*** + +### Types of Normalizations + +We are going to consider a `dataset` of images of +`Height` $H$, `Width` $W$, `Channels` $C$ and +`number-of-points` $N$ + + + +#### Batch Norm + +Normalizes only a `Channel` for each `pixel` over all +`data-points` + +#### Layer Norm + +Normalizes `everything` over a ***single*** `data-point` + +#### Instance norm + +It normalizes ***only*** a ***single `Channel`*** over +a ***single `data-point`*** + +#### Group Norm[^anelli-batch-normalization-3] + +Normalizes over ***multiple `Channels`*** on a +***single `data-point`*** + + + +## Practical Considerations + + + +[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) + +[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604) + +[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604) + +[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4 + +[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) + +[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167) + +[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6 + +[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20 + +[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21