From 99607d5882ab539de03bdb00f5659d093a52fd0f Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Sat, 25 Oct 2025 18:23:43 +0200 Subject: [PATCH] Added images and revised notes --- Chapters/8-Recurrent-Networks/INDEX.md | 132 ++++++++++++++++++------- 1 file changed, 96 insertions(+), 36 deletions(-) diff --git a/Chapters/8-Recurrent-Networks/INDEX.md b/Chapters/8-Recurrent-Networks/INDEX.md index 2b3c8fd..86ecfb0 100644 --- a/Chapters/8-Recurrent-Networks/INDEX.md +++ b/Chapters/8-Recurrent-Networks/INDEX.md @@ -1,5 +1,15 @@ # Recurrent Networks | RNNs[^anelli-RNNs] +## Why would we want Recurrent Networks? + +To deal with **sequence** related jobs of **arbitrary length**. + +In fact they can deal with inputs of varying length while being fast and memory efficient. + +While **autoregressive models** always needs to analyse all past inputs +(or a window of most recent ones), at each computation, +`RNNs` don't, making them a great tool when the situation permits it. + ## A bit of History[^anelli-RNNs-1] @@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***. The idea behind [`RNNs`](#rnns) is to add ***memory*** as a `hidden-state`. This helps the `model` to -***"remember"*** things for "long time", but it -is ***noisy***, and as such, the best we can do is +***"remember"*** things for "long time", but since it +is ***noisy***, the best we can do is to ***infer its probability distribution***, doable only for: - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system) - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model) -While these models are `stochastic`, -***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their +While these models are `stochastic`, technically the **a posteriori probability ditstibution** is +**deterministic**. + +Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`** + + ### Neurons with Memory[^anelli-RNNs-4] @@ -83,7 +99,11 @@ $$ > Technically speaking, we could consider > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5] -#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6] +## Different RNNs configurations + +![RNNs different configurations](./pngs/rnns-configurations.png) + +### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6] - Specify `initial-states` of ***all*** `units` - Specify `initial-states` for a ***subset*** of `units` @@ -91,27 +111,40 @@ $$ `units` for ***each `timestep`*** (Which is the most naural way to model sequential data) -#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7] +In other words, it depends on how you need to model data according to your sequence. + +### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7] - Specify ***desired final activity*** for ***all*** `units` - Specify ***desired final activity*** for ***all*** `units` ofr the ***last few `steps`*** - - This is good to learn `attractors` - - Makes it easy to add ***extra error derivatives*** + - This is good to learn `attractors` + - Makes it easy to add ***extra error derivatives*** - Speficfy the ***desired activity of a subset of `units`*** - - The other `units` will be either `inputs` or + - The other `units` will be either `inputs` or `hidden-states`, as ***we fixed these*** -#### Transforming `Data` to be used in [`RNNs`](#rnns) +In other words, it depends on which kind of output you need to be produced. -- One-hot encoding: Here each `token` is a $1$ over - the `input` array -- Learned embeddings: Here each `token` is a `point` - of a ***learned hyperspace*** +for example, a sentimental analysis would need to have just one output, while +a `seq2seq` job would require a full sequence. -### Backpropagation +## Transforming `Data` for [`RNNs`](#rnns) + +Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either +having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then +**`1-hot`** encode them. + +While this is may be enough, there's a better way where we transform each +1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase. + +To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have +16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions +we can save both time and space complexity. + +## RNNs Training Since [`RNNs`](#rnns) can be considered a `deep-layered` `NN`, then we firstly ***train the model @@ -126,15 +159,32 @@ derivatives along `time-steps` The thing is that is ***difficult to `train` [`RNNs`](#rnns)*** on -***long-range dependencies*** because either the -***gradient will `vanish` or `explode`***[^anelli-RNNs-8] +***long-range dependencies*** because the +***gradient will either `vanish` or `explode`***[^anelli-RNNs-8] + +### Mitigating training problems in RNNs + +In order to mitigate these gradient problems that impairs our network ability +to gain valuable information over long term dependencies, we have these +solutions: + +- **LSTM**:\ + Make the model out of little modules crafted to keep values for long time +- **Hessian Free Optimizers**:\ + Use optimizers that can see the gradient direction over smaller curvatures +- **Echo State Networks**[^unipi-esn]:\ + The idea is to use a **`sparsely connected large untrained network`** to keep track of + inputs for long time, while eventually be forgotten, and have a + **`trained readout network`** that converts the **`echo`** output into something usable +- **Good Initialization and Momentum**:\ + Same thing as before, but we learn all connections using momentum > [!WARNING] > -> `long-range dependencies` tend to have a smaller -> impact on the system than `short-range` ones +> `long-range dependencies` are more difficult to learn than `short-range` ones +> because of the gradient problem. -### Gated Cells +## Gated Cells These are `neurons` that can be controlled to make them `learn` or `forget` chosen pieces of information @@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information > With ***chosen*** we intend choosing from the > `hyperspace`, so it's not really precise. -#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia] +### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia] This `cell` has a ***separate signal***, namely the `cell-state`, ***which controls `gates` of this `cells`, always initialized to `1`***. +![LSTM cell](./pngs/lstm-cell.png) + > [!NOTE] > > $W$ will be weights associated with $\vec{x}$ and @@ -162,9 +214,11 @@ initialized to `1`***. > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the > ***pointwise product*** - +![detailed LSTM cell](./pngs/lstm-cell-detailed.png) -##### Forget Gate | Keep Gate + + +#### Forget Gate | Keep Gate This `gate` ***controls the `cell-state`***: @@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more the `cell-state` will forget that value, and opposite for values closer to $1$. -##### Input Gate | Write Gate +#### Input Gate | Write Gate ***controls how much of the `input` gets into the `cell-state`*** @@ -203,7 +257,7 @@ the importance given to that info. > [`input-gate`](#input-gate--write-gate) are 2 phases > of the `update-phase`. -##### Output Gate | Read Gate +#### Output Gate | Read Gate ***Controls how much of the `hidden-state` is forwarded*** @@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way simpler for the `cell-states` as they ***require only elementwise multiplications*** -#### GRU[^anelli-RNNs-10][^GRU-wikipedia] +### GRU[^anelli-RNNs-10][^GRU-wikipedia] It is another type of [`gated-cell`](#gated-cells), but, on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm), @@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm), the `hidden-state`***, while keeping ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***. +![GRU cell](./pngs/gru-cell.png) + > [!NOTE] > [`GRU`](#gru) doesn't have any `output-gate` and > $h_0 = 0$ - +![detailed GRU cell](./pngs/gru-cell-detailed.png) -##### Update Gate +#### Update Gate This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate) @@ -254,7 +310,7 @@ $$ \end{aligned} $$ -##### Reset Gate +#### Reset Gate This is what breaks the `information` flow from the previous `hidden-state`. @@ -267,7 +323,7 @@ $$ \end{aligned} $$ -##### New `hidden-state` +#### New `hidden-state` $$ \begin{aligned} @@ -308,13 +364,15 @@ understanding*** -#### Pros +### References -#### Cons - -- ***hard to train*** - -#### Quirks +- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html) +- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf) +- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/) +- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) +- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) +- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) +- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) @@ -352,3 +410,5 @@ understanding*** [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm) + +[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)