Added images and revised notes

This commit is contained in:
Christian Risi 2025-10-25 18:23:43 +02:00
parent 307a3f7c5d
commit 99607d5882

View File

@ -1,5 +1,15 @@
# Recurrent Networks | RNNs[^anelli-RNNs]
## Why would we want Recurrent Networks?
To deal with **sequence** related jobs of **arbitrary length**.
In fact they can deal with inputs of varying length while being fast and memory efficient.
While **autoregressive models** always needs to analyse all past inputs
(or a window of most recent ones), at each computation,
`RNNs` don't, making them a great tool when the situation permits it.
<!-- TODO: add images -->
## A bit of History[^anelli-RNNs-1]
@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.
The idea behind [`RNNs`](#rnns) is to add ***memory***
as a `hidden-state`. This helps the `model` to
***"remember"*** things for "long time", but it
is ***noisy***, and as such, the best we can do is
***"remember"*** things for "long time", but since it
is ***noisy***, the best we can do is
to ***infer its probability distribution***, doable only
for:
- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
While these models are `stochastic`,
***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
**deterministic**.
Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
<!-- TODO: check this:
, plus they are ***`non-linear`*** and their
***`hidden-state` is `distributed`***[^anelli-RNNs-3]
-->
### Neurons with Memory[^anelli-RNNs-4]
@ -83,7 +99,11 @@ $$
> Technically speaking, we could consider
> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
## Different RNNs configurations
![RNNs different configurations](./pngs/rnns-configurations.png)
### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
- Specify `initial-states` of ***all*** `units`
- Specify `initial-states` for a ***subset*** of `units`
@ -91,27 +111,40 @@ $$
`units` for ***each `timestep`*** (Which is the most
naural way to model sequential data)
#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
In other words, it depends on how you need to model data according to your sequence.
### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
- Specify ***desired final activity*** for ***all***
`units`
- Specify ***desired final activity*** for ***all***
`units` ofr the ***last few `steps`***
- This is good to learn `attractors`
- Makes it easy to add ***extra error derivatives***
- This is good to learn `attractors`
- Makes it easy to add ***extra error derivatives***
- Speficfy the ***desired activity of a subset of
`units`***
- The other `units` will be either `inputs` or
- The other `units` will be either `inputs` or
`hidden-states`, as ***we fixed these***
#### Transforming `Data` to be used in [`RNNs`](#rnns)
In other words, it depends on which kind of output you need to be produced.
- One-hot encoding: Here each `token` is a $1$ over
the `input` array
- Learned embeddings: Here each `token` is a `point`
of a ***learned hyperspace***
for example, a sentimental analysis would need to have just one output, while
a `seq2seq` job would require a full sequence.
### Backpropagation
## Transforming `Data` for [`RNNs`](#rnns)
Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
**`1-hot`** encode them.
While this is may be enough, there's a better way where we transform each
1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
we can save both time and space complexity.
## RNNs Training
Since [`RNNs`](#rnns) can be considered a `deep-layered`
`NN`, then we firstly ***train the model
@ -126,15 +159,32 @@ derivatives along `time-steps`
The thing is that is ***difficult to `train`
[`RNNs`](#rnns)*** on
***long-range dependencies*** because either the
***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
***long-range dependencies*** because the
***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
### Mitigating training problems in RNNs
In order to mitigate these gradient problems that impairs our network ability
to gain valuable information over long term dependencies, we have these
solutions:
- **LSTM**:\
Make the model out of little modules crafted to keep values for long time
- **Hessian Free Optimizers**:\
Use optimizers that can see the gradient direction over smaller curvatures
- **Echo State Networks**[^unipi-esn]:\
The idea is to use a **`sparsely connected large untrained network`** to keep track of
inputs for long time, while eventually be forgotten, and have a
**`trained readout network`** that converts the **`echo`** output into something usable
- **Good Initialization and Momentum**:\
Same thing as before, but we learn all connections using momentum
> [!WARNING]
>
> `long-range dependencies` tend to have a smaller
> impact on the system than `short-range` ones
> `long-range dependencies` are more difficult to learn than `short-range` ones
> because of the gradient problem.
### Gated Cells
## Gated Cells
These are `neurons` that can be controlled to make
them `learn` or `forget` chosen pieces of information
@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
> With ***chosen*** we intend choosing from the
> `hyperspace`, so it's not really precise.
#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
This `cell` has a ***separate signal***, namely the
`cell-state`,
***which controls `gates` of this `cells`, always
initialized to `1`***.
![LSTM cell](./pngs/lstm-cell.png)
> [!NOTE]
>
> $W$ will be weights associated with $\vec{x}$ and
@ -162,9 +214,11 @@ initialized to `1`***.
> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
> ***pointwise product***
<!-- TODO: Add images -->
![detailed LSTM cell](./pngs/lstm-cell-detailed.png)
##### Forget Gate | Keep Gate
<!-- TODO: revice formulas -->
#### Forget Gate | Keep Gate
This `gate` ***controls the `cell-state`***:
@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
the `cell-state` will forget that value, and opposite
for values closer to $1$.
##### Input Gate | Write Gate
#### Input Gate | Write Gate
***controls how much of the `input` gets into the
`cell-state`***
@ -203,7 +257,7 @@ the importance given to that info.
> [`input-gate`](#input-gate--write-gate) are 2 phases
> of the `update-phase`.
##### Output Gate | Read Gate
#### Output Gate | Read Gate
***Controls how much of the
`hidden-state` is forwarded***
@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
simpler for the `cell-states` as they ***require only
elementwise multiplications***
#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
### GRU[^anelli-RNNs-10][^GRU-wikipedia]
It is another type of [`gated-cell`](#gated-cells), but,
on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
the `hidden-state`***, while keeping
***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
![GRU cell](./pngs/gru-cell.png)
> [!NOTE]
> [`GRU`](#gru) doesn't have any `output-gate` and
> $h_0 = 0$
<!-- TODO: Add images -->
![detailed GRU cell](./pngs/gru-cell-detailed.png)
##### Update Gate
#### Update Gate
This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
@ -254,7 +310,7 @@ $$
\end{aligned}
$$
##### Reset Gate
#### Reset Gate
This is what breaks the `information` flow from the
previous `hidden-state`.
@ -267,7 +323,7 @@ $$
\end{aligned}
$$
##### New `hidden-state`
#### New `hidden-state`
$$
\begin{aligned}
@ -308,13 +364,15 @@ understanding***
<!-- TODO: Finish this part -->
#### Pros
### References
#### Cons
- ***hard to train***
#### Quirks
- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
<!-- TODO: PDF 8 pg. 24 -->
@ -352,3 +410,5 @@ understanding***
[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)