Deep-Learning/Chapters/7-Convolutional-Networks/INDEX.md

# Convolutional Networks[^anelli-convolutional-networks]

<!-- TODO: Add Images -->

> [!WARNING]
> We apply this concept ***mainly*** to `images`

Usually, for `images`, `fcnn` (short for **f**ully
**c**onnected **n**eural **n**etworks), are not suitable,
as `images` have a ***large number of `inputs`*** that is
***highly dimensional*** (e.g. a `32x32`, `RGB` picture
has dimension of 3072 data inputs)[^anelli-convolutional-networks-1]

Combine this with the fact that ***nowadays pictures
have (the least) `1920x1080` pixels***. This makes `FCnn`
***prone to overfitting***[^anelli-convolutional-networks-1]

> [!NOTE]
>
> - From here on `depth` is the **3rd dimention of the
> activation voulume**
> - `FCnn` are just ***traditional `NeuralNetworks`
>

## ConvNet

The basic network we can achieve with a
`convolutional-layer` is a `ConvNet`.

<!-- TODO: Insert mermaid or image -->

It is composed of:

<!-- TODO: Add links -->

1. `input` (picture)
2. [`Convolutional Layer`](#convolutional-layer)
3. [`ReLU`](./../3-Activation-Functions/INDEX.md#relu)
4. [`Pooling layer`](#pooling-layer)
5. `FCnn` (Normal `NeuralNetork`)
6. `output` (classes tags)

<!-- TODO: Add PDF 7 pg 7-8 -->

## Building Blocks

### Convolutional Layer

`Convolutional Layers` are `layers` that ***reduce the
size of the computational load*** by creating
`activation maps` ***computed starting from a `subset` of
all the available `data`***

#### Local Connectivity

To achieve such thing, we introduce the concept of
`local connectivity`. Basically ***each `output` is
linked with a `volume` smaller than the original one
concerning the `width` and `height`***
(the `depth` is always fully connected)

<!-- TODO: Add image -->

#### Filters (aka Kernels)

These are the ***work-horse*** of the whole `layer`.
A filter is a ***small window that contains weights***
and produces the `outputs`.

![Filter acting on an RGB picture that is 9x9](./pngs/convolution.png)

We have a ***number of `filter` equal to the `depth` of
the `output`***.
This means that ***each `output-value` at
the same `depth` has been generated by the same `filter`***, and as such,
***any `volume` shares `weights`
across a single `depth`***.

Each `filter` share the same `height` and `width` and
has a `depth` equal to the one in the `input`, and their
`output` is usually called `activation-map`.

> [!WARNING]
> Don't forget about biases, one for each`kernel`

> [!NOTE]
> Usually what the first `activation-maps` *learn* are
> oriented edges, opposing colors, ecc...

Another parameter for `filters` is the `stride`, which
is basically the number of "hops" made from one
convolution and another.

The formula to determine the `output` size for any side
is:

$$
out_{side\_len} = \frac{
    in_{side\_len} - filter_{side\_len}
}{
    stride
} + 1
$$

Whenever the `stride` makes $out_{side\_len}$ ***not
an integer value, we add $0$ `padding`***
to correct this.

> [!NOTE]
>
> To avoid downsizing, it is not uncommon to apply a
> $0$ padding of size 1 (per dimension) before applying
> a `filter` with `stride` equal to 1
>
> However, for a ***fast downsizing*** we can increment
> `striding`

> [!CAUTION]
> Don't shrink too fast, it doesn't bring good results

### Pooling Layer[^pooling-layer-wikipedia]

It ***downsamples the image without resorting to
`learnable-parameters`***

<!-- TODO: Insert image -->

There are many `algorithms` to make this `layer`, as:

#### Max Pooling

Takes the max element in the `window`

#### Average Pooling

Takes the average of elements in the `window`

#### Mixed Pooling

Linear sum of [Max Pooling](#max-pooling) and [Average
Pooling](#average-pooling)

> [!NOTE]
> This list is **NOT EXHAUSTIVE**, please refer to
> [this article](https://en.wikipedia.org/wiki/Pooling_layer)
> to know more.

This `layer` ***introduces space invariance***

## Receptive Fields[^youtube-video-receptive-fields]

At the end of our convolution we may want our output to have been influenced by all
pixels in our picture.

The amount of pixels that influenced our output is called receptive field and it increases
each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is
due to our kernel of producing an output deriving from more inputs, thus influenced by more
pixels.

However this means that before being able to have an output influenced by all pixels, we need to
go very deep.

To mitigate this, we can downsample by striding. This means that we will collect more pixel
information during upper layers, even though more sparse, and thus we'll be able to get more
pixel info over deep layers.

## Tips[^anelli-convolutional-networks-2]

- `1x1` `filters` make sense. ***They allow us
    to reduce the `depth` of the next `volume`***
- ***Trends goes towards increasing the `depth` and
    having smaller `filters`***
- ***The trend is to remove
    [`pooling-layers`](#pooling-layer) and use only
    [`convolutional-layers`](#convolutional-layer)***
- ***Common settings for
  [`convolutional-layers`](#convolutional-layer) are:***
    - number of filters: $K = 2^{a}$
    [^anelli-convolutional-networks-3]
    - tuple of `filter-size` $F$ `stride` $S$,
    `0-padding` $P$:
        - (3, 1, 1)
        - (5, 1, 2)
        - (5, 2, *whatever fits*)
        - (1, 1, 0)
- See ResNet/GoogLeNet


<!-- Footnotes -->
[^anelli-convolutional-networks]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7

[^anelli-convolutional-networks-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 2

[^pooling-layer-wikipedia]: [Pooling Layer | Wikipedia | 22nd April 2025](https://en.wikipedia.org/wiki/Pooling_layer)

[^anelli-convolutional-networks-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 85

[^anelli-convolutional-networks-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 70

[^youtube-video-receptive-fields]: [CNN Receptive Fields | YouTube | 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)
Added Convolutional Networks 2025-04-24 13:22:58 +02:00			`# Convolutional Networks[^anelli-convolutional-networks]`

			`<!-- TODO: Add Images -->`

			`> [!WARNING]`
			> We apply this concept *mainly* to `images`

Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			Usually, for `images`, `fcnn` (short for fully
			`connected neural networks), are not suitable,`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00			as `images` have a *large number of `inputs`* that is
			*highly dimensional* (e.g. a `32x32`, `RGB` picture
Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`has dimension of 3072 data inputs)[^anelli-convolutional-networks-1]`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00
			`Combine this with the fact that ***nowadays pictures`
Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			have (the least) `1920x1080` pixels***. This makes `FCnn`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00			`*prone to overfitting*[^anelli-convolutional-networks-1]`

			`> [!NOTE]`
			`>`
			> - From here on `depth` is the **3rd dimention of the
			`> activation voulume**`
			> - `FCnn` are just ***traditional `NeuralNetworks`
			`>`

			`## ConvNet`

			`The basic network we can achieve with a`
			`convolutional-layer` is a `ConvNet`.

			`<!-- TODO: Insert mermaid or image -->`

			`It is composed of:`

			`<!-- TODO: Add links -->`

			1. `input` (picture)
			2. [`Convolutional Layer`](#convolutional-layer)
			3. [`ReLU`](./../3-Activation-Functions/INDEX.md#relu)
			4. [`Pooling layer`](#pooling-layer)
			5. `FCnn` (Normal `NeuralNetork`)
			6. `output` (classes tags)

			`<!-- TODO: Add PDF 7 pg 7-8 -->`

			`## Building Blocks`

			`### Convolutional Layer`

			`Convolutional Layers` are `layers` that ***reduce the
			`size of the computational load*** by creating`
			`activation maps` ***computed starting from a `subset` of
			all the available `data`***

			`#### Local Connectivity`

			`To achieve such thing, we introduce the concept of`
			`local connectivity`. Basically ***each `output` is
			linked with a `volume` smaller than the original one
			concerning the `width` and `height`***
			(the `depth` is always fully connected)

			`<!-- TODO: Add image -->`

Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`#### Filters (aka Kernels)`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00
			These are the *work-horse* of the whole `layer`.
			`A filter is a *small window that contains weights*`
			and produces the `outputs`.

Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`![Filter acting on an RGB picture that is 9x9](./pngs/convolution.png)`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00
			We have a ***number of `filter` equal to the `depth` of
			the `output`***.
			This means that ***each `output-value` at
			the same `depth` has been generated by the same `filter`***, and as such,
			***any `volume` shares `weights`
			across a single `depth`***.

			Each `filter` share the same `height` and `width` and
			has a `depth` equal to the one in the `input`, and their
			`output` is usually called `activation-map`.

Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`> [!WARNING]`
			> Don't forget about biases, one for each`kernel`

Added Convolutional Networks 2025-04-24 13:22:58 +02:00			`> [!NOTE]`
			> Usually what the first `activation-maps` learn are
			`> oriented edges, opposing colors, ecc...`

			Another parameter for `filters` is the `stride`, which
			`is basically the number of "hops" made from one`
			`convolution and another.`

			The formula to determine the `output` size for any side
			`is:`

			`$$`
			`out_{side\_len} = \frac{`
			`in_{side\_len} - filter_{side\_len}`
			`}{`
Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`stride`
			`} + 1`
Added Convolutional Networks 2025-04-24 13:22:58 +02:00			`$$`

			Whenever the `stride` makes $out_{side\_len}$ ***not
			an integer value, we add $0$ `padding`***
			`to correct this.`

			`> [!NOTE]`
			`>`
			`> To avoid downsizing, it is not uncommon to apply a`
			`> $0$ padding of size 1 (per dimension) before applying`
			> a `filter` with `stride` equal to 1
			`>`
			`> However, for a *fast downsizing* we can increment`
			> `striding`

			`> [!CAUTION]`
			`> Don't shrink too fast, it doesn't bring good results`

			`### Pooling Layer[^pooling-layer-wikipedia]`

			`It ***downsamples the image without resorting to`
			`learnable-parameters`***

			`<!-- TODO: Insert image -->`

			There are many `algorithms` to make this `layer`, as:

			`#### Max Pooling`

			Takes the max element in the `window`

			`#### Average Pooling`

			Takes the average of elements in the `window`

			`#### Mixed Pooling`

			`Linear sum of [Max Pooling](#max-pooling) and [Average`
			`Pooling](#average-pooling)`

			`> [!NOTE]`
			`> This list is NOT EXHAUSTIVE, please refer to`
			`> [this article](https://en.wikipedia.org/wiki/Pooling_layer)`
			`> to know more.`

			This `layer` *introduces space invariance*

Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00			`## Receptive Fields[^youtube-video-receptive-fields]`

			`At the end of our convolution we may want our output to have been influenced by all`
			`pixels in our picture.`

			`The amount of pixels that influenced our output is called receptive field and it increases`
			`each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is`
			`due to our kernel of producing an output deriving from more inputs, thus influenced by more`
			`pixels.`

			`However this means that before being able to have an output influenced by all pixels, we need to`
			`go very deep.`

			`To mitigate this, we can downsample by striding. This means that we will collect more pixel`
			`information during upper layers, even though more sparse, and thus we'll be able to get more`
			`pixel info over deep layers.`

Added Convolutional Networks 2025-04-24 13:22:58 +02:00			`## Tips[^anelli-convolutional-networks-2]`

			- `1x1` `filters` make sense. ***They allow us
			to reduce the `depth` of the next `volume`***
			- ***Trends goes towards increasing the `depth` and
			having smaller `filters`***
			`- ***The trend is to remove`
			[`pooling-layers`](#pooling-layer) and use only
			[`convolutional-layers`](#convolutional-layer)***
			`- ***Common settings for`
			[`convolutional-layers`](#convolutional-layer) are:***
			`- number of filters: $K = 2^{a}$`
			`[^anelli-convolutional-networks-3]`
			- tuple of `filter-size` $F$ `stride` $S$,
			`0-padding` $P$:
			`- (3, 1, 1)`
			`- (5, 1, 2)`
			`- (5, 2, whatever fits)`
			`- (1, 1, 0)`
			`- See ResNet/GoogLeNet`


			`<!-- Footnotes -->`
			`[^anelli-convolutional-networks]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 7`

			`[^anelli-convolutional-networks-1]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 7 pg. 2`

			`[^pooling-layer-wikipedia]: [Pooling Layer \| Wikipedia \| 22nd April 2025](https://en.wikipedia.org/wiki/Pooling_layer)`

			`[^anelli-convolutional-networks-2]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 7 pg. 85`

			`[^anelli-convolutional-networks-3]: Vito Walter Anelli \| Deep Learning Material 2024/2025 \| PDF 7 pg. 70`
Added receptive fields section and fixed some info 2025-10-23 17:55:09 +02:00
			`[^youtube-video-receptive-fields]: [CNN Receptive Fields \| YouTube \| 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)`