From ac20c47e5a9a53a7364f1d111e956658cf7a0102 Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Tue, 22 Apr 2025 12:25:17 +0200
Subject: [PATCH] Added Normalization

---
 Chapters/6-Normalization-Layers/INDEX.md | 202 +++++++++++++++++++++++
 1 file changed, 202 insertions(+)
 create mode 100644 Chapters/6-Normalization-Layers/INDEX.md
diff --git a/Chapters/6-Normalization-Layers/INDEX.md b/Chapters/6-Normalization-Layers/INDEX.md
new file mode 100644
index 0000000..4a39ff7
--- /dev/null
+++ b/Chapters/6-Normalization-Layers/INDEX.md
@@ -0,0 +1,202 @@
+# Normalization
+
+<!-- TODO: Add naïve method does not work PDF 6 -->
+
+## Batch Normalization[^batch-normalization-paper]
+
+The idea behind this method is the fact that
+***distributions changes across `Layers`***. This
+effect is called ***covariate shift*** and it
+is believed to come from ***choosing bad
+`hyperparameters`***
+
+What the authors ***wanted to do*** was to ***reduce this
+effect***, however it was
+<u>***proven wrong in achieving this goal***</u>.
+
+What this layer does in reality is ***smoothes the
+optimization function, making the gradient more
+predictable***[^non-whitening-paper], and as a
+consequence, it ***keeps most activations away from
+saturating area, preventing exploding and vanishing
+gradients, and makes the network more robust to
+`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1]
+
+### Benefits in detail
+
+- We can use ***larger `learning-rates`***, as the
+    ***gradient is unaffected by them***, as if we
+    apply a ***scalar multiplier***, this holds:
+    $$
+    \begin{aligned}
+        \frac{
+            d \, BN((a\vec{W})\vec{u})
+        }{
+            d\, \vec{u}
+        } &= \frac{
+            d \, BN(\vec{W}\vec{u})
+        }{
+            d\, \vec{u}
+        } \\
+
+        \frac{
+            d \, BN((a\vec{W})\vec{u})
+        }{
+            d\, a \vec{W}
+        } &= \frac{1}{a} \cdot \frac{
+            d \, BN(\vec{W}\vec{u})
+        }{
+            d\, \vec{W}
+        }
+    \end{aligned}
+    $$
+- ***Training times are reduced as we can use higher
+    `learning rates`***
+- ***We don't need further regularization or dropout***
+    as `BatchNormalization` ***no longer makes
+    deterministic values for a
+    training example***[^batch-normalization-paper-1].
+    This is because ***`mean` and `std-deviation` are
+    computed over batches***[^anelli-batch-normalization-2].
+
+### Algorithm in Detail
+
+The actual implementation for `BatchNormalization` is
+***cheating*** in some places.
+
+First of all,
+***it scales each `feature` independently***
+instead of normalizing in
+***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]:
+
+$$
+\hat{x}^{(k)} = \frac{
+    x^{(k)} - E[x^{(k)}]
+}{
+    \sqrt{ Var[x^{(k)}]}
+}
+$$
+
+> [!WARNING]
+> This ***works even if these features
+> are not decorrelated***
+
+Then it ***applies an `identity transformation` to
+restore what the `input` originally meant***[^batch-normalization-paper-2]:
+
+$$
+\begin{aligned}
+    y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\
+
+    k &\triangleq \text{feature}
+\end{aligned}
+$$
+
+Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned
+parameters, and by setting:
+
+$$
+\begin{aligned}
+    \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
+    \beta^{(k)} &= E[x^{(k)}]
+\end{aligned}
+$$
+
+We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`,
+or more in general in `Mini-Batch` learning, ***we do not
+have access to all values of the distribution***,
+so we ***extract both `mean` and `std-dev` from the
+`mini-batch`***
+
+<!-- TODO: Insert image of algorithm from paper -->
+
+## Normalization operations
+
+Once we add `normalization layer`, the scheme becomes:
+
+<!-- TODO: Insert image or memrmaid of PDF 6 pg. 14 -->
+
+### Generic Normalization Scheme
+
+Generally speaking, ***taking inspiration from
+the [`Batch-Normalization`](#batch-normalization)
+algorithm, we can derive a more robust one***:
+
+$$
+ \vec{y} = \frac{
+    a
+ }{
+    \sigma
+ } (\vec{x} - \mu) + b
+$$
+
+Here $a$ and $b$ are ***learnable*** while $\sigma$ and
+$\mu$ are ***fixed***.
+
+> [!NOTE]
+> This formula does not reverse the normalization as
+> both $a$ and $b$ are ***learned***
+>
+> $\sigma$ and $\mu$ ***comes from the actual
+> distribution we have***
+
+### Types of Normalizations
+
+We are going to consider a `dataset` of images of
+`Height` $H$, `Width` $W$, `Channels` $C$ and
+`number-of-points` $N$
+
+<!-- TODO: Add images -->
+
+#### Batch Norm
+
+Normalizes only a `Channel` for each `pixel` over all
+`data-points`
+
+#### Layer Norm
+
+Normalizes `everything` over a ***single*** `data-point`
+
+#### Instance norm
+
+It normalizes ***only*** a ***single `Channel`*** over
+a ***single `data-point`***
+
+#### Group Norm[^anelli-batch-normalization-3]
+
+Normalizes over ***multiple `Channels`*** on a
+***single `data-point`***
+
+<!-- TODO: see Aaron Defazio -->
+
+## Practical Considerations
+
+<!-- TODO: read PDF 6 pg. 22 - 23>
+
+### Why normalization helps[^anelli-batch-normalization-4]
+
+- Normalization smoothes the `objective function`
+    allowing the use of ***larger `learning-rates`***
+- The `mean` and `std-dev` are ***noisy***, and this
+    helps ***generalization*** in some cases
+- Reduces sensitivity to `weight` initialization
+
+<!-- Footnotes -->
+
+[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
+
+[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
+
+[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
+
+[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4
+
+[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
+
+[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
+
+[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6
+
+[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20
+
+[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21