Added images and revised notes

2025-10-25 18:23:43 +02:00
parent 307a3f7c5d
commit 99607d5882
1 changed files with 96 additions and 36 deletions
--- a/Chapters/8-Recurrent-Networks/INDEX.md
+++ b/Chapters/8-Recurrent-Networks/INDEX.md
@@ -1,5 +1,15 @@
 # Recurrent Networks | RNNs[^anelli-RNNs]

+## Why would we want Recurrent Networks?
+
+To deal with **sequence** related jobs of **arbitrary length**.
+
+In fact they can deal with inputs of varying length while being fast and memory efficient.
+
+While **autoregressive models** always needs to analyse all past inputs
+(or a window of most recent ones), at each computation,
+`RNNs` don't, making them a great tool when the situation permits it.
+
 <!-- TODO: add images -->

 ## A bit of History[^anelli-RNNs-1]
@@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.

 The idea behind [`RNNs`](#rnns) is to add ***memory***
 as a `hidden-state`. This helps the `model` to
-***"remember"*** things for "long time", but it
-is ***noisy***, and as such, the best we can do is
+***"remember"*** things for "long time", but since it
+is ***noisy***, the best we can do is
 to ***infer its probability distribution***, doable only
 for:

 - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
 - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)

-While these models are `stochastic`,
-***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
+While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
+**deterministic**.
+
+Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
+
+<!-- TODO: check this:
+, plus they are ***`non-linear`*** and their
 ***`hidden-state` is `distributed`***[^anelli-RNNs-3]
+-->

 ### Neurons with Memory[^anelli-RNNs-4]

@@ -83,7 +99,11 @@ $$
 > Technically speaking, we could consider
 > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]

-#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
+## Different RNNs configurations
+
+![RNNs different configurations](./pngs/rnns-configurations.png)
+
+### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]

 - Specify `initial-states` of ***all*** `units`
 - Specify `initial-states` for a ***subset*** of `units`
@@ -91,27 +111,40 @@ $$
    `units` for ***each `timestep`*** (Which is the most
    naural way to model sequential data)

-#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
+In other words, it depends on how you need to model data according to your sequence.
+
+### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]

 - Specify ***desired final activity*** for ***all***
    `units`
 - Specify ***desired final activity*** for ***all***
    `units` ofr the ***last few `steps`***
-    - This is good to learn `attractors`
-    - Makes it easy to add ***extra error derivatives***
+  - This is good to learn `attractors`
+  - Makes it easy to add ***extra error derivatives***
 - Speficfy the ***desired activity of a subset of
    `units`***
-    - The other `units` will be either `inputs` or
+  - The other `units` will be either `inputs` or
        `hidden-states`, as ***we fixed these***

-#### Transforming `Data` to be used in [`RNNs`](#rnns)
+In other words, it depends on which kind of output you need to be produced.

- One-hot encoding: Here each `token` is a $1$ over
-    the `input` array
- Learned embeddings: Here each `token` is a `point`
-    of a ***learned hyperspace***
+for example, a sentimental analysis would need to have just one output, while
+a `seq2seq` job would require a full sequence.

-### Backpropagation
+## Transforming `Data` for [`RNNs`](#rnns)
+
+Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
+having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
+**`1-hot`** encode them.
+
+While this is may be enough, there's a better way where we transform each
+1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
+
+To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
+16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
+we can save both time and space complexity.
+
+## RNNs Training

 Since [`RNNs`](#rnns) can be considered a `deep-layered`
 `NN`, then we firstly ***train the model
@@ -126,15 +159,32 @@ derivatives along `time-steps`

 The thing is that is ***difficult to `train`
 [`RNNs`](#rnns)*** on
-***long-range dependencies*** because either the
-***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
+***long-range dependencies*** because the
+***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
+
+### Mitigating training problems in RNNs
+
+In order to mitigate these gradient problems that impairs our network ability
+to gain valuable information over long term dependencies, we have these
+solutions:
+
+- **LSTM**:\
+    Make the model out of little modules crafted to keep values for long time
+- **Hessian Free Optimizers**:\
+    Use optimizers that can see the gradient direction over smaller curvatures
+- **Echo State Networks**[^unipi-esn]:\
+    The idea is to use a **`sparsely connected large untrained network`** to keep track of
+    inputs for long time, while eventually be forgotten, and have a
+    **`trained readout network`** that converts the **`echo`** output into something usable
+- **Good Initialization and Momentum**:\
+    Same thing as before, but we learn all connections using momentum

 > [!WARNING]
 >
-> `long-range dependencies` tend to have a smaller
-> impact on the system than `short-range` ones
+> `long-range dependencies` are more difficult to learn than `short-range` ones
+> because of the gradient problem.

-### Gated Cells
+## Gated Cells

 These are `neurons` that can be controlled to make
 them `learn` or `forget` chosen pieces of information
@@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
 > With ***chosen*** we intend choosing from the
 > `hyperspace`, so it's not really precise.

-#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
+### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]

 This `cell` has a ***separate signal***, namely the
 `cell-state`,
 ***which controls `gates` of this `cells`, always
 initialized to `1`***.

+![LSTM cell](./pngs/lstm-cell.png)
+
 > [!NOTE]
 >
 > $W$ will be weights associated with $\vec{x}$ and
@@ -162,9 +214,11 @@ initialized to `1`***.
 > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
 > ***pointwise product***

-<!-- TODO: Add images -->
+![detailed LSTM cell](./pngs/lstm-cell-detailed.png)

-##### Forget Gate | Keep Gate
+<!-- TODO: revice formulas -->
+
+#### Forget Gate | Keep Gate

 This `gate` ***controls the `cell-state`***:

@@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
 the `cell-state` will forget that value, and opposite
 for values closer to $1$.

-##### Input Gate | Write Gate
+#### Input Gate | Write Gate

 ***controls how much of the `input` gets into the
 `cell-state`***
@@ -203,7 +257,7 @@ the importance given to that info.
 > [`input-gate`](#input-gate--write-gate) are 2 phases
 > of the `update-phase`.

-##### Output Gate | Read Gate
+#### Output Gate | Read Gate

 ***Controls how much of the
 `hidden-state` is forwarded***
@@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
 simpler for the `cell-states` as they ***require only
 elementwise multiplications***

-#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
+### GRU[^anelli-RNNs-10][^GRU-wikipedia]

 It is another type of [`gated-cell`](#gated-cells), but,
 on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
@@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
 the `hidden-state`***, while keeping
 ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.

+![GRU cell](./pngs/gru-cell.png)
+
 > [!NOTE]
 > [`GRU`](#gru) doesn't have any `output-gate` and
 > $h_0 = 0$

-<!-- TODO: Add images -->
+![detailed GRU cell](./pngs/gru-cell-detailed.png)

-##### Update Gate
+#### Update Gate

 This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)

@@ -254,7 +310,7 @@ $$
 \end{aligned}
 $$

-##### Reset Gate
+#### Reset Gate

 This is what breaks the `information` flow from the
 previous `hidden-state`.
@@ -267,7 +323,7 @@ $$
 \end{aligned}
 $$

-##### New `hidden-state`
+#### New `hidden-state`

 $$
 \begin{aligned}
@@ -308,13 +364,15 @@ understanding***

 <!-- TODO: Finish this part -->

-#### Pros
+### References

-#### Cons
-
- ***hard to train***
-
-#### Quirks
+- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
+- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
+- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
+- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
+- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
+- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
+- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

 <!-- TODO: PDF 8 pg. 24 -->

@@ -352,3 +410,5 @@ understanding***
 [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136

 [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
+
+[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)