Added images and revised notes

This commit is contained in:
Christian Risi 2025-10-25 18:23:43 +02:00
parent 307a3f7c5d
commit 99607d5882

View File

@ -1,5 +1,15 @@
# Recurrent Networks | RNNs[^anelli-RNNs] # Recurrent Networks | RNNs[^anelli-RNNs]
## Why would we want Recurrent Networks?
To deal with **sequence** related jobs of **arbitrary length**.
In fact they can deal with inputs of varying length while being fast and memory efficient.
While **autoregressive models** always needs to analyse all past inputs
(or a window of most recent ones), at each computation,
`RNNs` don't, making them a great tool when the situation permits it.
<!-- TODO: add images --> <!-- TODO: add images -->
## A bit of History[^anelli-RNNs-1] ## A bit of History[^anelli-RNNs-1]
@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.
The idea behind [`RNNs`](#rnns) is to add ***memory*** The idea behind [`RNNs`](#rnns) is to add ***memory***
as a `hidden-state`. This helps the `model` to as a `hidden-state`. This helps the `model` to
***"remember"*** things for "long time", but it ***"remember"*** things for "long time", but since it
is ***noisy***, and as such, the best we can do is is ***noisy***, the best we can do is
to ***infer its probability distribution***, doable only to ***infer its probability distribution***, doable only
for: for:
- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system) - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model) - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
While these models are `stochastic`, While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their **deterministic**.
Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
<!-- TODO: check this:
, plus they are ***`non-linear`*** and their
***`hidden-state` is `distributed`***[^anelli-RNNs-3] ***`hidden-state` is `distributed`***[^anelli-RNNs-3]
-->
### Neurons with Memory[^anelli-RNNs-4] ### Neurons with Memory[^anelli-RNNs-4]
@ -83,7 +99,11 @@ $$
> Technically speaking, we could consider > Technically speaking, we could consider
> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5] > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6] ## Different RNNs configurations
![RNNs different configurations](./pngs/rnns-configurations.png)
### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
- Specify `initial-states` of ***all*** `units` - Specify `initial-states` of ***all*** `units`
- Specify `initial-states` for a ***subset*** of `units` - Specify `initial-states` for a ***subset*** of `units`
@ -91,27 +111,40 @@ $$
`units` for ***each `timestep`*** (Which is the most `units` for ***each `timestep`*** (Which is the most
naural way to model sequential data) naural way to model sequential data)
#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7] In other words, it depends on how you need to model data according to your sequence.
### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
- Specify ***desired final activity*** for ***all*** - Specify ***desired final activity*** for ***all***
`units` `units`
- Specify ***desired final activity*** for ***all*** - Specify ***desired final activity*** for ***all***
`units` ofr the ***last few `steps`*** `units` ofr the ***last few `steps`***
- This is good to learn `attractors` - This is good to learn `attractors`
- Makes it easy to add ***extra error derivatives*** - Makes it easy to add ***extra error derivatives***
- Speficfy the ***desired activity of a subset of - Speficfy the ***desired activity of a subset of
`units`*** `units`***
- The other `units` will be either `inputs` or - The other `units` will be either `inputs` or
`hidden-states`, as ***we fixed these*** `hidden-states`, as ***we fixed these***
#### Transforming `Data` to be used in [`RNNs`](#rnns) In other words, it depends on which kind of output you need to be produced.
- One-hot encoding: Here each `token` is a $1$ over for example, a sentimental analysis would need to have just one output, while
the `input` array a `seq2seq` job would require a full sequence.
- Learned embeddings: Here each `token` is a `point`
of a ***learned hyperspace***
### Backpropagation ## Transforming `Data` for [`RNNs`](#rnns)
Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
**`1-hot`** encode them.
While this is may be enough, there's a better way where we transform each
1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
we can save both time and space complexity.
## RNNs Training
Since [`RNNs`](#rnns) can be considered a `deep-layered` Since [`RNNs`](#rnns) can be considered a `deep-layered`
`NN`, then we firstly ***train the model `NN`, then we firstly ***train the model
@ -126,15 +159,32 @@ derivatives along `time-steps`
The thing is that is ***difficult to `train` The thing is that is ***difficult to `train`
[`RNNs`](#rnns)*** on [`RNNs`](#rnns)*** on
***long-range dependencies*** because either the ***long-range dependencies*** because the
***gradient will `vanish` or `explode`***[^anelli-RNNs-8] ***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
### Mitigating training problems in RNNs
In order to mitigate these gradient problems that impairs our network ability
to gain valuable information over long term dependencies, we have these
solutions:
- **LSTM**:\
Make the model out of little modules crafted to keep values for long time
- **Hessian Free Optimizers**:\
Use optimizers that can see the gradient direction over smaller curvatures
- **Echo State Networks**[^unipi-esn]:\
The idea is to use a **`sparsely connected large untrained network`** to keep track of
inputs for long time, while eventually be forgotten, and have a
**`trained readout network`** that converts the **`echo`** output into something usable
- **Good Initialization and Momentum**:\
Same thing as before, but we learn all connections using momentum
> [!WARNING] > [!WARNING]
> >
> `long-range dependencies` tend to have a smaller > `long-range dependencies` are more difficult to learn than `short-range` ones
> impact on the system than `short-range` ones > because of the gradient problem.
### Gated Cells ## Gated Cells
These are `neurons` that can be controlled to make These are `neurons` that can be controlled to make
them `learn` or `forget` chosen pieces of information them `learn` or `forget` chosen pieces of information
@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
> With ***chosen*** we intend choosing from the > With ***chosen*** we intend choosing from the
> `hyperspace`, so it's not really precise. > `hyperspace`, so it's not really precise.
#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia] ### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
This `cell` has a ***separate signal***, namely the This `cell` has a ***separate signal***, namely the
`cell-state`, `cell-state`,
***which controls `gates` of this `cells`, always ***which controls `gates` of this `cells`, always
initialized to `1`***. initialized to `1`***.
![LSTM cell](./pngs/lstm-cell.png)
> [!NOTE] > [!NOTE]
> >
> $W$ will be weights associated with $\vec{x}$ and > $W$ will be weights associated with $\vec{x}$ and
@ -162,9 +214,11 @@ initialized to `1`***.
> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
> ***pointwise product*** > ***pointwise product***
<!-- TODO: Add images --> ![detailed LSTM cell](./pngs/lstm-cell-detailed.png)
##### Forget Gate | Keep Gate <!-- TODO: revice formulas -->
#### Forget Gate | Keep Gate
This `gate` ***controls the `cell-state`***: This `gate` ***controls the `cell-state`***:
@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
the `cell-state` will forget that value, and opposite the `cell-state` will forget that value, and opposite
for values closer to $1$. for values closer to $1$.
##### Input Gate | Write Gate #### Input Gate | Write Gate
***controls how much of the `input` gets into the ***controls how much of the `input` gets into the
`cell-state`*** `cell-state`***
@ -203,7 +257,7 @@ the importance given to that info.
> [`input-gate`](#input-gate--write-gate) are 2 phases > [`input-gate`](#input-gate--write-gate) are 2 phases
> of the `update-phase`. > of the `update-phase`.
##### Output Gate | Read Gate #### Output Gate | Read Gate
***Controls how much of the ***Controls how much of the
`hidden-state` is forwarded*** `hidden-state` is forwarded***
@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
simpler for the `cell-states` as they ***require only simpler for the `cell-states` as they ***require only
elementwise multiplications*** elementwise multiplications***
#### GRU[^anelli-RNNs-10][^GRU-wikipedia] ### GRU[^anelli-RNNs-10][^GRU-wikipedia]
It is another type of [`gated-cell`](#gated-cells), but, It is another type of [`gated-cell`](#gated-cells), but,
on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm), on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
the `hidden-state`***, while keeping the `hidden-state`***, while keeping
***similar performances to [`LSTM`](#long-short-term-memory--lstm)***. ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
![GRU cell](./pngs/gru-cell.png)
> [!NOTE] > [!NOTE]
> [`GRU`](#gru) doesn't have any `output-gate` and > [`GRU`](#gru) doesn't have any `output-gate` and
> $h_0 = 0$ > $h_0 = 0$
<!-- TODO: Add images --> ![detailed GRU cell](./pngs/gru-cell-detailed.png)
##### Update Gate #### Update Gate
This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate) This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
@ -254,7 +310,7 @@ $$
\end{aligned} \end{aligned}
$$ $$
##### Reset Gate #### Reset Gate
This is what breaks the `information` flow from the This is what breaks the `information` flow from the
previous `hidden-state`. previous `hidden-state`.
@ -267,7 +323,7 @@ $$
\end{aligned} \end{aligned}
$$ $$
##### New `hidden-state` #### New `hidden-state`
$$ $$
\begin{aligned} \begin{aligned}
@ -308,13 +364,15 @@ understanding***
<!-- TODO: Finish this part --> <!-- TODO: Finish this part -->
#### Pros ### References
#### Cons - [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
- ***hard to train*** - [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
#### Quirks - [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
<!-- TODO: PDF 8 pg. 24 --> <!-- TODO: PDF 8 pg. 24 -->
@ -352,3 +410,5 @@ understanding***
[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm) [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)