# Convolutional Networks[^anelli-convolutional-networks] > [!WARNING] > We apply this concept ***mainly*** to `images` Usually, for `images`, `fcnn` (short for **f**ully **c**onnected **n**eural **n**etworks), are not suitable, as `images` have a ***large number of `inputs`*** that is ***highly dimensional*** (e.g. a `32x32`, `RGB` picture has dimension of 3072 data inputs)[^anelli-convolutional-networks-1] Combine this with the fact that ***nowadays pictures have (the least) `1920x1080` pixels***. This makes `FCnn` ***prone to overfitting***[^anelli-convolutional-networks-1] > [!NOTE] > > - From here on `depth` is the **3rd dimention of the > activation voulume** > - `FCnn` are just ***traditional `NeuralNetworks` > ## ConvNet The basic network we can achieve with a `convolutional-layer` is a `ConvNet`. It is composed of: 1. `input` (picture) 2. [`Convolutional Layer`](#convolutional-layer) 3. [`ReLU`](./../3-Activation-Functions/INDEX.md#relu) 4. [`Pooling layer`](#pooling-layer) 5. `FCnn` (Normal `NeuralNetork`) 6. `output` (classes tags) ## Building Blocks ### Convolutional Layer `Convolutional Layers` are `layers` that ***reduce the size of the computational load*** by creating `activation maps` ***computed starting from a `subset` of all the available `data`*** #### Local Connectivity To achieve such thing, we introduce the concept of `local connectivity`. Basically ***each `output` is linked with a `volume` smaller than the original one concerning the `width` and `height`*** (the `depth` is always fully connected) #### Filters (aka Kernels) These are the ***work-horse*** of the whole `layer`. A filter is a ***small window that contains weights*** and produces the `outputs`. ![Filter acting on an RGB picture that is 9x9](./pngs/convolution.png) We have a ***number of `filter` equal to the `depth` of the `output`***. This means that ***each `output-value` at the same `depth` has been generated by the same `filter`***, and as such, ***any `volume` shares `weights` across a single `depth`***. Each `filter` share the same `height` and `width` and has a `depth` equal to the one in the `input`, and their `output` is usually called `activation-map`. > [!WARNING] > Don't forget about biases, one for each`kernel` > [!NOTE] > Usually what the first `activation-maps` *learn* are > oriented edges, opposing colors, ecc... Another parameter for `filters` is the `stride`, which is basically the number of "hops" made from one convolution and another. The formula to determine the `output` size for any side is: $$ out_{side\_len} = \frac{ in_{side\_len} - filter_{side\_len} }{ stride } + 1 $$ Whenever the `stride` makes $out_{side\_len}$ ***not an integer value, we add $0$ `padding`*** to correct this. > [!NOTE] > > To avoid downsizing, it is not uncommon to apply a > $0$ padding of size 1 (per dimension) before applying > a `filter` with `stride` equal to 1 > > However, for a ***fast downsizing*** we can increment > `striding` > [!CAUTION] > Don't shrink too fast, it doesn't bring good results ### Pooling Layer[^pooling-layer-wikipedia] It ***downsamples the image without resorting to `learnable-parameters`*** There are many `algorithms` to make this `layer`, as: #### Max Pooling Takes the max element in the `window` #### Average Pooling Takes the average of elements in the `window` #### Mixed Pooling Linear sum of [Max Pooling](#max-pooling) and [Average Pooling](#average-pooling) > [!NOTE] > This list is **NOT EXHAUSTIVE**, please refer to > [this article](https://en.wikipedia.org/wiki/Pooling_layer) > to know more. This `layer` ***introduces space invariance*** ## Receptive Fields[^youtube-video-receptive-fields] At the end of our convolution we may want our output to have been influenced by all pixels in our picture. The amount of pixels that influenced our output is called receptive field and it increases each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is due to our kernel of producing an output deriving from more inputs, thus influenced by more pixels. However this means that before being able to have an output influenced by all pixels, we need to go very deep. To mitigate this, we can downsample by striding. This means that we will collect more pixel information during upper layers, even though more sparse, and thus we'll be able to get more pixel info over deep layers. ## Tips[^anelli-convolutional-networks-2] - `1x1` `filters` make sense. ***They allow us to reduce the `depth` of the next `volume`*** - ***Trends goes towards increasing the `depth` and having smaller `filters`*** - ***The trend is to remove [`pooling-layers`](#pooling-layer) and use only [`convolutional-layers`](#convolutional-layer)*** - ***Common settings for [`convolutional-layers`](#convolutional-layer) are:*** - number of filters: $K = 2^{a}$ [^anelli-convolutional-networks-3] - tuple of `filter-size` $F$ `stride` $S$, `0-padding` $P$: - (3, 1, 1) - (5, 1, 2) - (5, 2, *whatever fits*) - (1, 1, 0) - See ResNet/GoogLeNet [^anelli-convolutional-networks]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 [^anelli-convolutional-networks-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 2 [^pooling-layer-wikipedia]: [Pooling Layer | Wikipedia | 22nd April 2025](https://en.wikipedia.org/wiki/Pooling_layer) [^anelli-convolutional-networks-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 85 [^anelli-convolutional-networks-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 70 [^youtube-video-receptive-fields]: [CNN Receptive Fields | YouTube | 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)