2025-04-24 13:22:58 +02:00
|
|
|
# Convolutional Networks[^anelli-convolutional-networks]
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Add Images -->
|
|
|
|
|
|
|
|
|
|
> [!WARNING]
|
|
|
|
|
> We apply this concept ***mainly*** to `images`
|
|
|
|
|
|
2025-10-23 17:55:09 +02:00
|
|
|
Usually, for `images`, `fcnn` (short for **f**ully
|
|
|
|
|
**c**onnected **n**eural **n**etworks), are not suitable,
|
2025-04-24 13:22:58 +02:00
|
|
|
as `images` have a ***large number of `inputs`*** that is
|
|
|
|
|
***highly dimensional*** (e.g. a `32x32`, `RGB` picture
|
2025-10-23 17:55:09 +02:00
|
|
|
has dimension of 3072 data inputs)[^anelli-convolutional-networks-1]
|
2025-04-24 13:22:58 +02:00
|
|
|
|
|
|
|
|
Combine this with the fact that ***nowadays pictures
|
2025-10-23 17:55:09 +02:00
|
|
|
have (the least) `1920x1080` pixels***. This makes `FCnn`
|
2025-04-24 13:22:58 +02:00
|
|
|
***prone to overfitting***[^anelli-convolutional-networks-1]
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
>
|
|
|
|
|
> - From here on `depth` is the **3rd dimention of the
|
|
|
|
|
> activation voulume**
|
|
|
|
|
> - `FCnn` are just ***traditional `NeuralNetworks`
|
|
|
|
|
>
|
|
|
|
|
|
|
|
|
|
## ConvNet
|
|
|
|
|
|
|
|
|
|
The basic network we can achieve with a
|
|
|
|
|
`convolutional-layer` is a `ConvNet`.
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Insert mermaid or image -->
|
|
|
|
|
|
|
|
|
|
It is composed of:
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Add links -->
|
|
|
|
|
|
|
|
|
|
1. `input` (picture)
|
|
|
|
|
2. [`Convolutional Layer`](#convolutional-layer)
|
|
|
|
|
3. [`ReLU`](./../3-Activation-Functions/INDEX.md#relu)
|
|
|
|
|
4. [`Pooling layer`](#pooling-layer)
|
|
|
|
|
5. `FCnn` (Normal `NeuralNetork`)
|
|
|
|
|
6. `output` (classes tags)
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Add PDF 7 pg 7-8 -->
|
|
|
|
|
|
|
|
|
|
## Building Blocks
|
|
|
|
|
|
|
|
|
|
### Convolutional Layer
|
|
|
|
|
|
|
|
|
|
`Convolutional Layers` are `layers` that ***reduce the
|
|
|
|
|
size of the computational load*** by creating
|
|
|
|
|
`activation maps` ***computed starting from a `subset` of
|
|
|
|
|
all the available `data`***
|
|
|
|
|
|
|
|
|
|
#### Local Connectivity
|
|
|
|
|
|
|
|
|
|
To achieve such thing, we introduce the concept of
|
|
|
|
|
`local connectivity`. Basically ***each `output` is
|
|
|
|
|
linked with a `volume` smaller than the original one
|
|
|
|
|
concerning the `width` and `height`***
|
|
|
|
|
(the `depth` is always fully connected)
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Add image -->
|
|
|
|
|
|
2025-10-23 17:55:09 +02:00
|
|
|
#### Filters (aka Kernels)
|
2025-04-24 13:22:58 +02:00
|
|
|
|
|
|
|
|
These are the ***work-horse*** of the whole `layer`.
|
|
|
|
|
A filter is a ***small window that contains weights***
|
|
|
|
|
and produces the `outputs`.
|
|
|
|
|
|
2025-10-23 17:55:09 +02:00
|
|
|

|
2025-04-24 13:22:58 +02:00
|
|
|
|
|
|
|
|
We have a ***number of `filter` equal to the `depth` of
|
|
|
|
|
the `output`***.
|
|
|
|
|
This means that ***each `output-value` at
|
|
|
|
|
the same `depth` has been generated by the same `filter`***, and as such,
|
|
|
|
|
***any `volume` shares `weights`
|
|
|
|
|
across a single `depth`***.
|
|
|
|
|
|
|
|
|
|
Each `filter` share the same `height` and `width` and
|
|
|
|
|
has a `depth` equal to the one in the `input`, and their
|
|
|
|
|
`output` is usually called `activation-map`.
|
|
|
|
|
|
2025-10-23 17:55:09 +02:00
|
|
|
> [!WARNING]
|
|
|
|
|
> Don't forget about biases, one for each`kernel`
|
|
|
|
|
|
2025-04-24 13:22:58 +02:00
|
|
|
> [!NOTE]
|
|
|
|
|
> Usually what the first `activation-maps` *learn* are
|
|
|
|
|
> oriented edges, opposing colors, ecc...
|
|
|
|
|
|
|
|
|
|
Another parameter for `filters` is the `stride`, which
|
|
|
|
|
is basically the number of "hops" made from one
|
|
|
|
|
convolution and another.
|
|
|
|
|
|
|
|
|
|
The formula to determine the `output` size for any side
|
|
|
|
|
is:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
out_{side\_len} = \frac{
|
|
|
|
|
in_{side\_len} - filter_{side\_len}
|
|
|
|
|
}{
|
2025-10-23 17:55:09 +02:00
|
|
|
stride
|
|
|
|
|
} + 1
|
2025-04-24 13:22:58 +02:00
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Whenever the `stride` makes $out_{side\_len}$ ***not
|
|
|
|
|
an integer value, we add $0$ `padding`***
|
|
|
|
|
to correct this.
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
>
|
|
|
|
|
> To avoid downsizing, it is not uncommon to apply a
|
|
|
|
|
> $0$ padding of size 1 (per dimension) before applying
|
|
|
|
|
> a `filter` with `stride` equal to 1
|
|
|
|
|
>
|
|
|
|
|
> However, for a ***fast downsizing*** we can increment
|
|
|
|
|
> `striding`
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> Don't shrink too fast, it doesn't bring good results
|
|
|
|
|
|
|
|
|
|
### Pooling Layer[^pooling-layer-wikipedia]
|
|
|
|
|
|
|
|
|
|
It ***downsamples the image without resorting to
|
|
|
|
|
`learnable-parameters`***
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Insert image -->
|
|
|
|
|
|
|
|
|
|
There are many `algorithms` to make this `layer`, as:
|
|
|
|
|
|
|
|
|
|
#### Max Pooling
|
|
|
|
|
|
|
|
|
|
Takes the max element in the `window`
|
|
|
|
|
|
|
|
|
|
#### Average Pooling
|
|
|
|
|
|
|
|
|
|
Takes the average of elements in the `window`
|
|
|
|
|
|
|
|
|
|
#### Mixed Pooling
|
|
|
|
|
|
|
|
|
|
Linear sum of [Max Pooling](#max-pooling) and [Average
|
|
|
|
|
Pooling](#average-pooling)
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> This list is **NOT EXHAUSTIVE**, please refer to
|
|
|
|
|
> [this article](https://en.wikipedia.org/wiki/Pooling_layer)
|
|
|
|
|
> to know more.
|
|
|
|
|
|
|
|
|
|
This `layer` ***introduces space invariance***
|
|
|
|
|
|
2025-10-23 17:55:09 +02:00
|
|
|
## Receptive Fields[^youtube-video-receptive-fields]
|
|
|
|
|
|
|
|
|
|
At the end of our convolution we may want our output to have been influenced by all
|
|
|
|
|
pixels in our picture.
|
|
|
|
|
|
|
|
|
|
The amount of pixels that influenced our output is called receptive field and it increases
|
|
|
|
|
each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is
|
|
|
|
|
due to our kernel of producing an output deriving from more inputs, thus influenced by more
|
|
|
|
|
pixels.
|
|
|
|
|
|
|
|
|
|
However this means that before being able to have an output influenced by all pixels, we need to
|
|
|
|
|
go very deep.
|
|
|
|
|
|
|
|
|
|
To mitigate this, we can downsample by striding. This means that we will collect more pixel
|
|
|
|
|
information during upper layers, even though more sparse, and thus we'll be able to get more
|
|
|
|
|
pixel info over deep layers.
|
|
|
|
|
|
2025-04-24 13:22:58 +02:00
|
|
|
## Tips[^anelli-convolutional-networks-2]
|
|
|
|
|
|
|
|
|
|
- `1x1` `filters` make sense. ***They allow us
|
|
|
|
|
to reduce the `depth` of the next `volume`***
|
|
|
|
|
- ***Trends goes towards increasing the `depth` and
|
|
|
|
|
having smaller `filters`***
|
|
|
|
|
- ***The trend is to remove
|
|
|
|
|
[`pooling-layers`](#pooling-layer) and use only
|
|
|
|
|
[`convolutional-layers`](#convolutional-layer)***
|
|
|
|
|
- ***Common settings for
|
|
|
|
|
[`convolutional-layers`](#convolutional-layer) are:***
|
|
|
|
|
- number of filters: $K = 2^{a}$
|
|
|
|
|
[^anelli-convolutional-networks-3]
|
|
|
|
|
- tuple of `filter-size` $F$ `stride` $S$,
|
|
|
|
|
`0-padding` $P$:
|
|
|
|
|
- (3, 1, 1)
|
|
|
|
|
- (5, 1, 2)
|
|
|
|
|
- (5, 2, *whatever fits*)
|
|
|
|
|
- (1, 1, 0)
|
|
|
|
|
- See ResNet/GoogLeNet
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<!-- Footnotes -->
|
|
|
|
|
[^anelli-convolutional-networks]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7
|
|
|
|
|
|
|
|
|
|
[^anelli-convolutional-networks-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 2
|
|
|
|
|
|
|
|
|
|
[^pooling-layer-wikipedia]: [Pooling Layer | Wikipedia | 22nd April 2025](https://en.wikipedia.org/wiki/Pooling_layer)
|
|
|
|
|
|
|
|
|
|
[^anelli-convolutional-networks-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 85
|
|
|
|
|
|
|
|
|
|
[^anelli-convolutional-networks-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 70
|
2025-10-23 17:55:09 +02:00
|
|
|
|
|
|
|
|
[^youtube-video-receptive-fields]: [CNN Receptive Fields | YouTube | 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)
|