Added Normalization
This commit is contained in:
parent
8d8266059b
commit
ac20c47e5a
202
Chapters/6-Normalization-Layers/INDEX.md
Normal file
202
Chapters/6-Normalization-Layers/INDEX.md
Normal file
@ -0,0 +1,202 @@
|
||||
# Normalization
|
||||
|
||||
<!-- TODO: Add naïve method does not work PDF 6 -->
|
||||
|
||||
## Batch Normalization[^batch-normalization-paper]
|
||||
|
||||
The idea behind this method is the fact that
|
||||
***distributions changes across `Layers`***. This
|
||||
effect is called ***covariate shift*** and it
|
||||
is believed to come from ***choosing bad
|
||||
`hyperparameters`***
|
||||
|
||||
What the authors ***wanted to do*** was to ***reduce this
|
||||
effect***, however it was
|
||||
<u>***proven wrong in achieving this goal***</u>.
|
||||
|
||||
What this layer does in reality is ***smoothes the
|
||||
optimization function, making the gradient more
|
||||
predictable***[^non-whitening-paper], and as a
|
||||
consequence, it ***keeps most activations away from
|
||||
saturating area, preventing exploding and vanishing
|
||||
gradients, and makes the network more robust to
|
||||
`hyperparameters`***[^non-whitening-paper-1][^anelli-batch-normalization-1]
|
||||
|
||||
### Benefits in detail
|
||||
|
||||
- We can use ***larger `learning-rates`***, as the
|
||||
***gradient is unaffected by them***, as if we
|
||||
apply a ***scalar multiplier***, this holds:
|
||||
$$
|
||||
\begin{aligned}
|
||||
\frac{
|
||||
d \, BN((a\vec{W})\vec{u})
|
||||
}{
|
||||
d\, \vec{u}
|
||||
} &= \frac{
|
||||
d \, BN(\vec{W}\vec{u})
|
||||
}{
|
||||
d\, \vec{u}
|
||||
} \\
|
||||
|
||||
\frac{
|
||||
d \, BN((a\vec{W})\vec{u})
|
||||
}{
|
||||
d\, a \vec{W}
|
||||
} &= \frac{1}{a} \cdot \frac{
|
||||
d \, BN(\vec{W}\vec{u})
|
||||
}{
|
||||
d\, \vec{W}
|
||||
}
|
||||
\end{aligned}
|
||||
$$
|
||||
- ***Training times are reduced as we can use higher
|
||||
`learning rates`***
|
||||
- ***We don't need further regularization or dropout***
|
||||
as `BatchNormalization` ***no longer makes
|
||||
deterministic values for a
|
||||
training example***[^batch-normalization-paper-1].
|
||||
This is because ***`mean` and `std-deviation` are
|
||||
computed over batches***[^anelli-batch-normalization-2].
|
||||
|
||||
### Algorithm in Detail
|
||||
|
||||
The actual implementation for `BatchNormalization` is
|
||||
***cheating*** in some places.
|
||||
|
||||
First of all,
|
||||
***it scales each `feature` independently***
|
||||
instead of normalizing in
|
||||
***"layer inputs and outputs jointly"***[^batch-normalization-paper-2]:
|
||||
|
||||
$$
|
||||
\hat{x}^{(k)} = \frac{
|
||||
x^{(k)} - E[x^{(k)}]
|
||||
}{
|
||||
\sqrt{ Var[x^{(k)}]}
|
||||
}
|
||||
$$
|
||||
|
||||
> [!WARNING]
|
||||
> This ***works even if these features
|
||||
> are not decorrelated***
|
||||
|
||||
Then it ***applies an `identity transformation` to
|
||||
restore what the `input` originally meant***[^batch-normalization-paper-2]:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
y^{(k)} &= \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} \\
|
||||
|
||||
k &\triangleq \text{feature}
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Here both $\gamma^{(k)}$ and $\beta^{(k)}$ are learned
|
||||
parameters, and by setting:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\gamma^{(k)} &= \sqrt{ Var[x^{(k)}]} \\
|
||||
\beta^{(k)} &= E[x^{(k)}]
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
We would restore the original activations[^batch-normalization-paper-2]. However, in `SGD`,
|
||||
or more in general in `Mini-Batch` learning, ***we do not
|
||||
have access to all values of the distribution***,
|
||||
so we ***extract both `mean` and `std-dev` from the
|
||||
`mini-batch`***
|
||||
|
||||
<!-- TODO: Insert image of algorithm from paper -->
|
||||
|
||||
## Normalization operations
|
||||
|
||||
Once we add `normalization layer`, the scheme becomes:
|
||||
|
||||
<!-- TODO: Insert image or memrmaid of PDF 6 pg. 14 -->
|
||||
|
||||
### Generic Normalization Scheme
|
||||
|
||||
Generally speaking, ***taking inspiration from
|
||||
the [`Batch-Normalization`](#batch-normalization)
|
||||
algorithm, we can derive a more robust one***:
|
||||
|
||||
$$
|
||||
\vec{y} = \frac{
|
||||
a
|
||||
}{
|
||||
\sigma
|
||||
} (\vec{x} - \mu) + b
|
||||
$$
|
||||
|
||||
Here $a$ and $b$ are ***learnable*** while $\sigma$ and
|
||||
$\mu$ are ***fixed***.
|
||||
|
||||
> [!NOTE]
|
||||
> This formula does not reverse the normalization as
|
||||
> both $a$ and $b$ are ***learned***
|
||||
>
|
||||
> $\sigma$ and $\mu$ ***comes from the actual
|
||||
> distribution we have***
|
||||
|
||||
### Types of Normalizations
|
||||
|
||||
We are going to consider a `dataset` of images of
|
||||
`Height` $H$, `Width` $W$, `Channels` $C$ and
|
||||
`number-of-points` $N$
|
||||
|
||||
<!-- TODO: Add images -->
|
||||
|
||||
#### Batch Norm
|
||||
|
||||
Normalizes only a `Channel` for each `pixel` over all
|
||||
`data-points`
|
||||
|
||||
#### Layer Norm
|
||||
|
||||
Normalizes `everything` over a ***single*** `data-point`
|
||||
|
||||
#### Instance norm
|
||||
|
||||
It normalizes ***only*** a ***single `Channel`*** over
|
||||
a ***single `data-point`***
|
||||
|
||||
#### Group Norm[^anelli-batch-normalization-3]
|
||||
|
||||
Normalizes over ***multiple `Channels`*** on a
|
||||
***single `data-point`***
|
||||
|
||||
<!-- TODO: see Aaron Defazio -->
|
||||
|
||||
## Practical Considerations
|
||||
|
||||
<!-- TODO: read PDF 6 pg. 22 - 23>
|
||||
|
||||
### Why normalization helps[^anelli-batch-normalization-4]
|
||||
|
||||
- Normalization smoothes the `objective function`
|
||||
allowing the use of ***larger `learning-rates`***
|
||||
- The `mean` and `std-dev` are ***noisy***, and this
|
||||
helps ***generalization*** in some cases
|
||||
- Reduces sensitivity to `weight` initialization
|
||||
|
||||
<!-- Footnotes -->
|
||||
|
||||
[^batch-normalization-paper]: [Batch Normalization Paper | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
||||
|
||||
[^non-whitening-paper]: [How Does Batch Normalization Help Optimization? | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
|
||||
|
||||
[^non-whitening-paper-1]: [How Does Batch Normalization Help Optimization? | Paragrap 3.1 | arXiv:1805.11604v5](https://arxiv.org/pdf/1805.11604)
|
||||
|
||||
[^anelli-batch-normalization-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 4
|
||||
|
||||
[^batch-normalization-paper-1]: [Batch Normalization Paper | Paragpraph 3.4 pg. 5 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
||||
|
||||
[^batch-normalization-paper-2]: [Batch Normalization Paper | Paragpraph 3 pg. 3 | arXiv:1502.03167v3](https://arxiv.org/pdf/1502.03167)
|
||||
|
||||
[^anelli-batch-normalization-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 6
|
||||
|
||||
[^anelli-batch-normalization-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 20
|
||||
|
||||
[^anelli-batch-normalization-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 6 pg. 21
|
||||
Loading…
x
Reference in New Issue
Block a user