# Normalization

<!-- TODO: Add naïve method does not work PDF 6 -->

## Batch Normalization[^batch-normalization-paper]

The idea behind this method is the fact that
***distributions changes across `Layers`***. This
effect is called ***covariate shift*** and it
is believed to come from ***choosing bad
`hyperparameters`***

What the authors ***wanted to do*** was to ***reduce this
effect***, however it was
<u>***proven wrong in achieving this goal***</u>.

What this layer does in reality is ***smoothes the
optimization function, making the gradient more
predictable***[^non-whitening-paper], and as a
consequence, it ***keeps most activations away from
saturating area, preventing exploding and vanishing
gradients, and makes the network more robust to
`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1]

### Benefits in detail

- We can use ***larger `learning-rates`***, as the
    ***gradient is unaffected by them***, as if we
    apply a ***scalar multiplier***, this holds:
    $$
    \begin{aligned}
        \frac{
            d \, BN((a\vec{W})\vec{u})
        }{
            d\, \vec{u}
        } &= \frac{
            d \, BN(\vec{W}\vec{u})
        }{
            d\, \vec{u}
        } \\

        \frac{
            d \, BN((a\vec{W})\vec{u})
        }{
            d\, a \vec{W}
        } &= \frac{1}{a} \cdot \frac{
            d \, BN(\vec{W}\vec{u})
        }{
            d\, \vec{W}
        }
    \end{aligned}
    $$
- ***Training times are reduced as we can use higher
    `learning rates`***
- ***We don't need further regularization or dropout***
    as `BatchNormalization` ***no longer makes
    deterministic values for a
    training example***[^batch-normalization-paper-1].
    This is because ***`mean` and `std-deviation` are
    computed over batches***[^anelli-batch-normalization-2].

### Algorithm in Detail

The actual implementation for `BatchNormalization` is
***cheating*** in some places.

First of all,
***it scales each `feature` independently***
instead of normalizing in
***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]:

$$
\hat{x}^{(k)} = \frac{
    x^{(k)} - E[x^{(k)}]
}{
    \sqrt{ Var[x^{(k)}]}
}
$$

> [!WARNING]
> This ***works even if these features
> are not decorrelated***

Then it ***applies an `identity transformation` to
restore what the `input` originally meant***[^batch-normalization-paper-2]:

$$
\begin{aligned}
    y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\

    k &\triangleq \text{feature}
\end{aligned}
$$

Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned
parameters, and by setting:

$$
\begin{aligned}
    \gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
    \beta^{(k)} &= E[x^{(k)}]
\end{aligned}
$$

We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`,
or more in general in `Mini-Batch` learning, ***we do not
have access to all values of the distribution***,
so we ***extract both `mean` and `std-dev` from the
`mini-batch`***

<!-- TODO: Insert image of algorithm from paper -->

## Normalization operations

Once we add `normalization layer`, the scheme becomes:

<!-- TODO: Insert image or memrmaid of PDF 6 pg. 14 -->

### Generic Normalization Scheme

Generally speaking, ***taking inspiration from
the [`Batch-Normalization`](#batch-normalization)
algorithm, we can derive a more robust one***:

$$
 \vec{y} = \frac{
    a
 }{
    \sigma
 } (\vec{x} - \mu) + b
$$

Here $a$ and $b$ are ***learnable*** while $\sigma$ and
$\mu$ are ***fixed***.

> [!NOTE]
> This formula does not reverse the normalization as
> both $a$ and $b$ are ***learned***
>
> $\sigma$ and $\mu$ ***comes from the actual
> distribution we have***

### Types of Normalizations

We are going to consider a `dataset` of images of
`Height` $H$, `Width` $W$, `Channels` $C$ and
`number-of-points` $N$

<!-- TODO: Add images -->

#### Batch Norm

Normalizes only a `Channel` for each `pixel` over all
`data-points`

#### Layer Norm

Normalizes `everything` over a ***single*** `data-point`

#### Instance norm

It normalizes ***only*** a ***single `Channel`*** over
a ***single `data-point`***

#### Group Norm[^anelli-batch-normalization-3]

Normalizes over ***multiple `Channels`*** on a
***single `data-point`***

<!-- TODO: see Aaron Defazio -->

## Practical Considerations

<!-- TODO: read PDF 6 pg. 22 - 23>

### Why normalization helps[^anelli-batch-normalization-4]

- Normalization smoothes the `objective function`
    allowing the use of ***larger `learning-rates`***
- The `mean` and `std-dev` are ***noisy***, and this
    helps ***generalization*** in some cases
- Reduces sensitivity to `weight` initialization

<!-- Footnotes -->

[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)

[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)

[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)

[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4

[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)

[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)

[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6

[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20

[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21