2025-04-22 12:25:17 +02:00

203 lines
5.6 KiB
Markdown

# Normalization
<!-- TODO: Add naïve method does not work PDF 6 -->
## Batch Normalization[^batch-normalization-paper]
The idea behind this method is the fact that
***distributions changes across `Layers`***. This
effect is called ***covariate shift*** and it
is believed to come from ***choosing bad
`hyperparameters`***
What the authors ***wanted to do*** was to ***reduce this
effect***, however it was
<u>***proven wrong in achieving this goal***</u>.
What this layer does in reality is ***smoothes the
optimization function, making the gradient more
predictable***[^non-whitening-paper], and as a
consequence, it ***keeps most activations away from
saturating area, preventing exploding and vanishing
gradients, and makes the network more robust to
`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1]
### Benefits in detail
- We can use ***larger `learning-rates`***, as the
***gradient is unaffected by them***, as if we
apply a ***scalar multiplier***, this holds:
$$
\begin{aligned}
\frac{
d \, BN((a\vec{W})\vec{u})
}{
d\, \vec{u}
} &= \frac{
d \, BN(\vec{W}\vec{u})
}{
d\, \vec{u}
} \\
\frac{
d \, BN((a\vec{W})\vec{u})
}{
d\, a \vec{W}
} &= \frac{1}{a} \cdot \frac{
d \, BN(\vec{W}\vec{u})
}{
d\, \vec{W}
}
\end{aligned}
$$
- ***Training times are reduced as we can use higher
`learning rates`***
- ***We don't need further regularization or dropout***
as `BatchNormalization` ***no longer makes
deterministic values for a
training example***[^batch-normalization-paper-1].
This is because ***`mean` and `std-deviation` are
computed over batches***[^anelli-batch-normalization-2].
### Algorithm in Detail
The actual implementation for `BatchNormalization` is
***cheating*** in some places.
First of all,
***it scales each `feature` independently***
instead of normalizing in
***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]:
$$
\hat{x}^{(k)} = \frac{
x^{(k)} - E[x^{(k)}]
}{
\sqrt{ Var[x^{(k)}]}
}
$$
> [!WARNING]
> This ***works even if these features
> are not decorrelated***
Then it ***applies an `identity transformation` to
restore what the `input` originally meant***[^batch-normalization-paper-2]:
$$
\begin{aligned}
y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\
k &\triangleq \text{feature}
\end{aligned}
$$
Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned
parameters, and by setting:
$$
\begin{aligned}
\gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
\beta^{(k)} &= E[x^{(k)}]
\end{aligned}
$$
We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`,
or more in general in `Mini-Batch` learning, ***we do not
have access to all values of the distribution***,
so we ***extract both `mean` and `std-dev` from the
`mini-batch`***
<!-- TODO: Insert image of algorithm from paper -->
## Normalization operations
Once we add `normalization layer`, the scheme becomes:
<!-- TODO: Insert image or memrmaid of PDF 6 pg. 14 -->
### Generic Normalization Scheme
Generally speaking, ***taking inspiration from
the [`Batch-Normalization`](#batch-normalization)
algorithm, we can derive a more robust one***:
$$
\vec{y} = \frac{
a
}{
\sigma
} (\vec{x} - \mu) + b
$$
Here $a$ and $b$ are ***learnable*** while $\sigma$ and
$\mu$ are ***fixed***.
> [!NOTE]
> This formula does not reverse the normalization as
> both $a$ and $b$ are ***learned***
>
> $\sigma$ and $\mu$ ***comes from the actual
> distribution we have***
### Types of Normalizations
We are going to consider a `dataset` of images of
`Height` $H$, `Width` $W$, `Channels` $C$ and
`number-of-points` $N$
<!-- TODO: Add images -->
#### Batch Norm
Normalizes only a `Channel` for each `pixel` over all
`data-points`
#### Layer Norm
Normalizes `everything` over a ***single*** `data-point`
#### Instance norm
It normalizes ***only*** a ***single `Channel`*** over
a ***single `data-point`***
#### Group Norm[^anelli-batch-normalization-3]
Normalizes over ***multiple `Channels`*** on a
***single `data-point`***
<!-- TODO: see Aaron Defazio -->
## Practical Considerations
<!-- TODO: read PDF 6 pg. 22 - 23>
### Why normalization helps[^anelli-batch-normalization-4]
- Normalization smoothes the `objective function`
allowing the use of ***larger `learning-rates`***
- The `mean` and `std-dev` are ***noisy***, and this
helps ***generalization*** in some cases
- Reduces sensitivity to `weight` initialization
<!-- Footnotes -->
[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4
[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6
[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20
[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21