From 99607d5882ab539de03bdb00f5659d093a52fd0f Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Sat, 25 Oct 2025 18:23:43 +0200
Subject: [PATCH] Added images and revised notes

---
 Chapters/8-Recurrent-Networks/INDEX.md | 132 ++++++++++++++++++-------
 1 file changed, 96 insertions(+), 36 deletions(-)

diff --git a/Chapters/8-Recurrent-Networks/INDEX.md b/Chapters/8-Recurrent-Networks/INDEX.md
index 2b3c8fd..86ecfb0 100644
--- a/Chapters/8-Recurrent-Networks/INDEX.md
+++ b/Chapters/8-Recurrent-Networks/INDEX.md
@@ -1,5 +1,15 @@
 # Recurrent Networks | RNNs[^anelli-RNNs]
 
+## Why would we want Recurrent Networks?
+
+To deal with **sequence** related jobs of **arbitrary length**.
+
+In fact they can deal with inputs of varying length while being fast and memory efficient.
+
+While **autoregressive models** always needs to analyse all past inputs
+(or a window of most recent ones), at each computation,
+`RNNs` don't, making them a great tool when the situation permits it.
+
 <!-- TODO: add images -->
 
 ## A bit of History[^anelli-RNNs-1]
@@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.
 
 The idea behind [`RNNs`](#rnns) is to add ***memory***
 as a `hidden-state`. This helps the `model` to
-***"remember"*** things for "long time", but it
-is ***noisy***, and as such, the best we can do is
+***"remember"*** things for "long time", but since it
+is ***noisy***, the best we can do is
 to ***infer its probability distribution***, doable only
 for:
 
 - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
 - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
 
-While these models are `stochastic`,
-***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
+While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
+**deterministic**.
+
+Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
+
+<!-- TODO: check this:
+, plus they are ***`non-linear`*** and their
 ***`hidden-state` is `distributed`***[^anelli-RNNs-3]
+-->
 
 ### Neurons with Memory[^anelli-RNNs-4]
 
@@ -83,7 +99,11 @@ $$
 > Technically speaking, we could consider
 > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
 
-#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
+## Different RNNs configurations
+
+![RNNs different configurations](./pngs/rnns-configurations.png)
+
+### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
 
 - Specify `initial-states` of ***all*** `units`
 - Specify `initial-states` for a ***subset*** of `units`
@@ -91,27 +111,40 @@ $$
     `units` for ***each `timestep`*** (Which is the most
     naural way to model sequential data)
 
-#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
+In other words, it depends on how you need to model data according to your sequence.
+
+### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
 
 - Specify ***desired final activity*** for ***all***
     `units`
 - Specify ***desired final activity*** for ***all***
     `units` ofr the ***last few `steps`***
-    - This is good to learn `attractors`
-    - Makes it easy to add ***extra error derivatives***
+  - This is good to learn `attractors`
+  - Makes it easy to add ***extra error derivatives***
 - Speficfy the ***desired activity of a subset of
     `units`***
-    - The other `units` will be either `inputs` or
+  - The other `units` will be either `inputs` or
         `hidden-states`, as ***we fixed these***
 
-#### Transforming `Data` to be used in [`RNNs`](#rnns)
+In other words, it depends on which kind of output you need to be produced.
 
-- One-hot encoding: Here each `token` is a $1$ over
-    the `input` array
-- Learned embeddings: Here each `token` is a `point`
-    of a ***learned hyperspace***
+for example, a sentimental analysis would need to have just one output, while
+a `seq2seq` job would require a full sequence.
 
-### Backpropagation
+## Transforming `Data` for [`RNNs`](#rnns)
+
+Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
+having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
+**`1-hot`** encode them.
+
+While this is may be enough, there's a better way where we transform each
+1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
+
+To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
+16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
+we can save both time and space complexity.
+
+## RNNs Training
 
 Since [`RNNs`](#rnns) can be considered a `deep-layered`
 `NN`, then we firstly ***train the model
@@ -126,15 +159,32 @@ derivatives along `time-steps`
 
 The thing is that is ***difficult to `train`
 [`RNNs`](#rnns)*** on
-***long-range dependencies*** because either the
-***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
+***long-range dependencies*** because the
+***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
+
+### Mitigating training problems in RNNs
+
+In order to mitigate these gradient problems that impairs our network ability
+to gain valuable information over long term dependencies, we have these
+solutions:
+
+- **LSTM**:\
+    Make the model out of little modules crafted to keep values for long time
+- **Hessian Free Optimizers**:\
+    Use optimizers that can see the gradient direction over smaller curvatures
+- **Echo State Networks**[^unipi-esn]:\
+    The idea is to use a **`sparsely connected large untrained network`** to keep track of
+    inputs for long time, while eventually be forgotten, and have a
+    **`trained readout network`** that converts the **`echo`** output into something usable
+- **Good Initialization and Momentum**:\
+    Same thing as before, but we learn all connections using momentum
 
 > [!WARNING]
 >
-> `long-range dependencies` tend to have a smaller
-> impact on the system than `short-range` ones
+> `long-range dependencies` are more difficult to learn than `short-range` ones
+> because of the gradient problem.
 
-### Gated Cells
+## Gated Cells
 
 These are `neurons` that can be controlled to make
 them `learn` or `forget` chosen pieces of information
@@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
 > With ***chosen*** we intend choosing from the
 > `hyperspace`, so it's not really precise.
 
-#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
+### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
 
 This `cell` has a ***separate signal***, namely the
 `cell-state`,
 ***which controls `gates` of this `cells`, always
 initialized to `1`***.
 
+![LSTM cell](./pngs/lstm-cell.png)
+
 > [!NOTE]
 >
 > $W$ will be weights associated with $\vec{x}$ and
@@ -162,9 +214,11 @@ initialized to `1`***.
 > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
 > ***pointwise product***
 
-<!-- TODO: Add images -->
+![detailed LSTM cell](./pngs/lstm-cell-detailed.png)
 
-##### Forget Gate | Keep Gate
+<!-- TODO: revice formulas -->
+
+#### Forget Gate | Keep Gate
 
 This `gate` ***controls the `cell-state`***:
 
@@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
 the `cell-state` will forget that value, and opposite
 for values closer to $1$.
 
-##### Input Gate | Write Gate
+#### Input Gate | Write Gate
 
 ***controls how much of the `input` gets into the
 `cell-state`***
@@ -203,7 +257,7 @@ the importance given to that info.
 > [`input-gate`](#input-gate--write-gate) are 2 phases
 > of the `update-phase`.
 
-##### Output Gate | Read Gate
+#### Output Gate | Read Gate
 
 ***Controls how much of the
 `hidden-state` is forwarded***
@@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
 simpler for the `cell-states` as they ***require only
 elementwise multiplications***
 
-#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
+### GRU[^anelli-RNNs-10][^GRU-wikipedia]
 
 It is another type of [`gated-cell`](#gated-cells), but,
 on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
@@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
 the `hidden-state`***, while keeping
 ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
 
+![GRU cell](./pngs/gru-cell.png)
+
 > [!NOTE]
 > [`GRU`](#gru) doesn't have any `output-gate` and
 > $h_0 = 0$
 
-<!-- TODO: Add images -->
+![detailed GRU cell](./pngs/gru-cell-detailed.png)
 
-##### Update Gate
+#### Update Gate
 
 This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
 
@@ -254,7 +310,7 @@ $$
 \end{aligned}
 $$
 
-##### Reset Gate
+#### Reset Gate
 
 This is what breaks the `information` flow from the
 previous `hidden-state`.
@@ -267,7 +323,7 @@ $$
 \end{aligned}
 $$
 
-##### New `hidden-state`
+#### New `hidden-state`
 
 $$
 \begin{aligned}
@@ -308,13 +364,15 @@ understanding***
 
 <!-- TODO: Finish this part -->
 
-#### Pros
+### References
 
-#### Cons
-
-- ***hard to train***
-
-#### Quirks
+- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
+- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
+- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
+- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
+- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
+- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
+- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
 
 <!-- TODO: PDF 8 pg. 24 -->
 
@@ -352,3 +410,5 @@ understanding***
 [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
 
 [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
+
+[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)