Added receptive fields section and fixed some info

This commit is contained in:
Christian Risi 2025-10-23 17:55:09 +02:00
parent fc7cefb93e
commit d23d847c2e

View File

@ -5,14 +5,14 @@
> [!WARNING]
> We apply this concept ***mainly*** to `images`
Usually, for `images`, `fcnn` (short for `f`ully
`c`onnected `n`eural `n`etworks), are not suitable,
Usually, for `images`, `fcnn` (short for **f**ully
**c**onnected **n**eural **n**etworks), are not suitable,
as `images` have a ***large number of `inputs`*** that is
***highly dimensional*** (e.g. a `32x32`, `RGB` picture
has dimension of `weights`)[^anelli-convolutional-networks-1]
has dimension of 3072 data inputs)[^anelli-convolutional-networks-1]
Combine this with the fact that ***nowadays pictures
have (the least) `1920x1080` pixels*** makes `FCnn`
have (the least) `1920x1080` pixels***. This makes `FCnn`
***prone to overfitting***[^anelli-convolutional-networks-1]
> [!NOTE]
@ -61,13 +61,13 @@ concerning the `width` and `height`***
<!-- TODO: Add image -->
#### Filters
#### Filters (aka Kernels)
These are the ***work-horse*** of the whole `layer`.
A filter is a ***small window that contains weights***
and produces the `outputs`.
<!-- TODO: Add image -->
![Filter acting on an RGB picture that is 9x9](./pngs/convolution.png)
We have a ***number of `filter` equal to the `depth` of
the `output`***.
@ -80,6 +80,9 @@ Each `filter` share the same `height` and `width` and
has a `depth` equal to the one in the `input`, and their
`output` is usually called `activation-map`.
> [!WARNING]
> Don't forget about biases, one for each`kernel`
> [!NOTE]
> Usually what the first `activation-maps` *learn* are
> oriented edges, opposing colors, ecc...
@ -95,8 +98,8 @@ $$
out_{side\_len} = \frac{
in_{side\_len} - filter_{side\_len}
}{
stride + 1
}
stride
} + 1
$$
Whenever the `stride` makes $out_{side\_len}$ ***not
@ -144,6 +147,23 @@ Pooling](#average-pooling)
This `layer` ***introduces space invariance***
## Receptive Fields[^youtube-video-receptive-fields]
At the end of our convolution we may want our output to have been influenced by all
pixels in our picture.
The amount of pixels that influenced our output is called receptive field and it increases
each time we do a convolution by a factor of $k - 1$ where $k$ is the kernel size. This is
due to our kernel of producing an output deriving from more inputs, thus influenced by more
pixels.
However this means that before being able to have an output influenced by all pixels, we need to
go very deep.
To mitigate this, we can downsample by striding. This means that we will collect more pixel
information during upper layers, even though more sparse, and thus we'll be able to get more
pixel info over deep layers.
## Tips[^anelli-convolutional-networks-2]
- `1x1` `filters` make sense. ***They allow us
@ -176,3 +196,5 @@ This `layer` ***introduces space invariance***
[^anelli-convolutional-networks-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 85
[^anelli-convolutional-networks-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 7 pg. 70
[^youtube-video-receptive-fields]: [CNN Receptive Fields | YouTube | 23rd October 2025](https://www.youtube.com/watch?v=ip2HYPC_T9Q)