# Convolutional Networks[^anelli-convolutional-networks]

<!-- TODO: Add Images -->

> [!WARNING]
> We apply this concept ***mainly*** to `images`

Usually, for `images`, `fcnn` (short for **f**ully
**c**onnected **n**eural **n**etworks), are not suitable,
as `images` have a ***large number of `inputs`*** that is
***highly dimensional*** (e.g. a `32x32`, `RGB` picture
has dimension of 3072 data inputs)[^anelli-convolutional-networks-1]

Combine this with the fact that ***nowadays pictures
have (the least) `1920x1080` pixels***. This makes `FCnn`
***prone to overfitting***[^anelli-convolutional-networks-1]

> [!NOTE]
>
> - From here on `depth` is the **3rd dimention of the
> activation voulume**
> - `FCnn` are just ***traditional `NeuralNetworks`
>

## ConvNet

The basic network we can achieve with a
`convolutional-layer` is a `ConvNet`.

<!-- TODO: Insert mermaid or image -->

It is composed of:

<!-- TODO: Add links -->

1. `input` (picture)
2. [`Convolutional Layer`](#convolutional-layer)
3. [`ReLU`](./../3-Activation-Functions/INDEX.md#relu)
4. [`Pooling layer`](#pooling-layer)
5. `FCnn` (Normal `NeuralNetork`)
6. `output` (classes tags)

<!-- TODO: Add PDF 7 pg 7-8 -->

## Building Blocks

### Convolutional Layer

`Convolutional Layers` are `layers` that ***reduce the
size of the computational load*** by creating
`activation maps` ***computed starting from a `subset` of
all the available `data`***

#### Local Connectivity

To achieve such thing, we introduce the concept of
`local connectivity`. Basically ***each `output` is
linked with a `volume` smaller than the original one
concerning the `width` and `height`***
(the `depth` is always fully connected)

<!-- TODO: Add image -->

#### Filters (aka Kernels)

These are the ***work-horse*** of the whole `layer`.
A filter is a ***small window that contains weights***
and produces the `outputs`.

![Filter acting on an RGB picture that is 9x9](./pngs/convolution.png)

We have a ***number of `filter` equal to the `depth` of
the `output`***.
This means that ***each `output-value` at
the same `depth` has been generated by the same `filter`***, and as such,
***any `volume` shares `weights`
across a single `depth`***.

Each `filter` share the same `height` and `width` and
has a `depth` equal to the one in the `input`, and their
`output` is usually called `activation-map`.

> [!WARNING]
> Don't forget about biases, one for each`kernel`

> [!NOTE]
> Usually what the first `activation-maps` *learn* are
> oriented edges, opposing colors, ecc...

Another parameter for `filters` is the `stride`, which
is basically the number of "hops" made from one
convolution and another.

The formula to determine the `output` size for any side
is:

$$
out_{side\_len} = \frac{
    in_{side\_len} - filter_{side\_len}
}{
    stride
} + 1
$$

Whenever the `stride` makes $out_{side\_len}$ ***not
an integer value, we add $0$ `padding`***
to correct this.

> [!NOTE]
>
> To avoid downsizing, it is not uncommon to apply a
> $0$ padding of size 1 (per dimension) before applying
> a `filter` with `stride` equal to 1
>
> However, for a ***fast downsizing*** we can increment
> `striding`

> [!CAUTION]
> Don't shrink too fast, it doesn't bring good results

### Pooling Layer[^pooling-layer-wikipedia]

It ***downsamples the image without resorting to
`learnable-parameters`***

<!-- TODO: Insert image -->

There are many `algorithms` to make this `layer`, as:

#### Max Pooling

Takes the max element in the `window`

#### Average Pooling

Takes the average of elements in the `window`

#### Mixed Pooling

Linear sum of [Max Pooling](#max-pooling) and [Average
Pooling](#average-pooling)

> [!NOTE]
> This list is **NOT EXHAUSTIVE**, please refer to
> [this article](https://en.wikipedia.org/wiki/Pooling_layer)
> to know more.

This `layer` ***introduces space invariance***

## Receptive Fields[^youtube-video-receptive-fields]

At the end of our convolution we may want our output to have been influenced by all
pixels in our picture.

The amount of pixels that influenced our output is called receptive field and it increases
each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is
due to our kernel of producing an output deriving from more inputs, thus influenced by more
pixels.

However this means that before being able to have an output influenced by all pixels, we need to
go very deep.

To mitigate this, we can downsample by striding. This means that we will collect more pixel
information during upper layers, even though more sparse, and thus we'll be able to get more
pixel info over deep layers.

## Tips[^anelli-convolutional-networks-2]

- `1x1` `filters` make sense. ***They allow us
    to reduce the `depth` of the next `volume`***
- ***Trends goes towards increasing the `depth` and
    having smaller `filters`***
- ***The trend is to remove
    [`pooling-layers`](#pooling-layer) and use only
    [`convolutional-layers`](#convolutional-layer)***
- ***Common settings for
  [`convolutional-layers`](#convolutional-layer) are:***
    - number of filters: $K = 2^{a}$
    [^anelli-convolutional-networks-3]
    - tuple of `filter-size` $F$ `stride` $S$,
    `0-padding` $P$:
        - (3, 1, 1)
        - (5, 1, 2)
        - (5, 2, *whatever fits*)
        - (1, 1, 0)
- See ResNet/GoogLeNet


<!-- Footnotes -->
[^anelli-convolutional-networks]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7

[^anelli-convolutional-networks-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 2

[^pooling-layer-wikipedia]: [Pooling Layer | Wikipedia | 22nd April 2025](https://en.wikipedia.org/wiki/Pooling_layer)

[^anelli-convolutional-networks-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 85

[^anelli-convolutional-networks-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 70

[^youtube-video-receptive-fields]: [CNN Receptive Fields | YouTube | 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)