203 lines
5.6 KiB
Markdown
203 lines
5.6 KiB
Markdown
# Normalization
|
|
|
|
<!-- TODO: Add naïve method does not work PDF 6 -->
|
|
|
|
## Batch Normalization[^batch-normalization-paper]
|
|
|
|
The idea behind this method is the fact that
|
|
***distributions changes across `Layers`***. This
|
|
effect is called ***covariate shift*** and it
|
|
is believed to come from ***choosing bad
|
|
`hyperparameters`***
|
|
|
|
What the authors ***wanted to do*** was to ***reduce this
|
|
effect***, however it was
|
|
<u>***proven wrong in achieving this goal***</u>.
|
|
|
|
What this layer does in reality is ***smoothes the
|
|
optimization function, making the gradient more
|
|
predictable***[^non-whitening-paper], and as a
|
|
consequence, it ***keeps most activations away from
|
|
saturating area, preventing exploding and vanishing
|
|
gradients, and makes the network more robust to
|
|
`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1]
|
|
|
|
### Benefits in detail
|
|
|
|
- We can use ***larger `learning-rates`***, as the
|
|
***gradient is unaffected by them***, as if we
|
|
apply a ***scalar multiplier***, this holds:
|
|
$$
|
|
\begin{aligned}
|
|
\frac{
|
|
d \, BN((a\vec{W})\vec{u})
|
|
}{
|
|
d\, \vec{u}
|
|
} &= \frac{
|
|
d \, BN(\vec{W}\vec{u})
|
|
}{
|
|
d\, \vec{u}
|
|
} \\
|
|
|
|
\frac{
|
|
d \, BN((a\vec{W})\vec{u})
|
|
}{
|
|
d\, a \vec{W}
|
|
} &= \frac{1}{a} \cdot \frac{
|
|
d \, BN(\vec{W}\vec{u})
|
|
}{
|
|
d\, \vec{W}
|
|
}
|
|
\end{aligned}
|
|
$$
|
|
- ***Training times are reduced as we can use higher
|
|
`learning rates`***
|
|
- ***We don't need further regularization or dropout***
|
|
as `BatchNormalization` ***no longer makes
|
|
deterministic values for a
|
|
training example***[^batch-normalization-paper-1].
|
|
This is because ***`mean` and `std-deviation` are
|
|
computed over batches***[^anelli-batch-normalization-2].
|
|
|
|
### Algorithm in Detail
|
|
|
|
The actual implementation for `BatchNormalization` is
|
|
***cheating*** in some places.
|
|
|
|
First of all,
|
|
***it scales each `feature` independently***
|
|
instead of normalizing in
|
|
***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]:
|
|
|
|
$$
|
|
\hat{x}^{(k)} = \frac{
|
|
x^{(k)} - E[x^{(k)}]
|
|
}{
|
|
\sqrt{ Var[x^{(k)}]}
|
|
}
|
|
$$
|
|
|
|
> [!WARNING]
|
|
> This ***works even if these features
|
|
> are not decorrelated***
|
|
|
|
Then it ***applies an `identity transformation` to
|
|
restore what the `input` originally meant***[^batch-normalization-paper-2]:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\
|
|
|
|
k &\triangleq \text{feature}
|
|
\end{aligned}
|
|
$$
|
|
|
|
Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned
|
|
parameters, and by setting:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
|
|
\beta^{(k)} &= E[x^{(k)}]
|
|
\end{aligned}
|
|
$$
|
|
|
|
We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`,
|
|
or more in general in `Mini-Batch` learning, ***we do not
|
|
have access to all values of the distribution***,
|
|
so we ***extract both `mean` and `std-dev` from the
|
|
`mini-batch`***
|
|
|
|
<!-- TODO: Insert image of algorithm from paper -->
|
|
|
|
## Normalization operations
|
|
|
|
Once we add `normalization layer`, the scheme becomes:
|
|
|
|
<!-- TODO: Insert image or memrmaid of PDF 6 pg. 14 -->
|
|
|
|
### Generic Normalization Scheme
|
|
|
|
Generally speaking, ***taking inspiration from
|
|
the [`Batch-Normalization`](#batch-normalization)
|
|
algorithm, we can derive a more robust one***:
|
|
|
|
$$
|
|
\vec{y} = \frac{
|
|
a
|
|
}{
|
|
\sigma
|
|
} (\vec{x} - \mu) + b
|
|
$$
|
|
|
|
Here $a$ and $b$ are ***learnable*** while $\sigma$ and
|
|
$\mu$ are ***fixed***.
|
|
|
|
> [!NOTE]
|
|
> This formula does not reverse the normalization as
|
|
> both $a$ and $b$ are ***learned***
|
|
>
|
|
> $\sigma$ and $\mu$ ***comes from the actual
|
|
> distribution we have***
|
|
|
|
### Types of Normalizations
|
|
|
|
We are going to consider a `dataset` of images of
|
|
`Height` $H$, `Width` $W$, `Channels` $C$ and
|
|
`number-of-points` $N$
|
|
|
|
<!-- TODO: Add images -->
|
|
|
|
#### Batch Norm
|
|
|
|
Normalizes only a `Channel` for each `pixel` over all
|
|
`data-points`
|
|
|
|
#### Layer Norm
|
|
|
|
Normalizes `everything` over a ***single*** `data-point`
|
|
|
|
#### Instance norm
|
|
|
|
It normalizes ***only*** a ***single `Channel`*** over
|
|
a ***single `data-point`***
|
|
|
|
#### Group Norm[^anelli-batch-normalization-3]
|
|
|
|
Normalizes over ***multiple `Channels`*** on a
|
|
***single `data-point`***
|
|
|
|
<!-- TODO: see Aaron Defazio -->
|
|
|
|
## Practical Considerations
|
|
|
|
<!-- TODO: read PDF 6 pg. 22 - 23>
|
|
|
|
### Why normalization helps[^anelli-batch-normalization-4]
|
|
|
|
- Normalization smoothes the `objective function`
|
|
allowing the use of ***larger `learning-rates`***
|
|
- The `mean` and `std-dev` are ***noisy***, and this
|
|
helps ***generalization*** in some cases
|
|
- Reduces sensitivity to `weight` initialization
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
|
|
|
[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
|
|
|
|
[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
|
|
|
|
[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4
|
|
|
|
[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
|
|
|
[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
|
|
|
[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6
|
|
|
|
[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20
|
|
|
|
[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21
|