Added images and revised notes

2025-10-25 18:23:43 +02:00
parent 307a3f7c5d
commit 99607d5882
1 changed files with 96 additions and 36 deletions
--- a/Chapters/8-Recurrent-Networks/INDEX.md
+++ b/Chapters/8-Recurrent-Networks/INDEX.md
@@ -1,5 +1,15 @@
 # Recurrent Networks | RNNs[^anelli-RNNs]
 ## Why would we want Recurrent Networks?
 To deal with **sequence** related jobs of **arbitrary length**.
 In fact they can deal with inputs of varying length while being fast and memory efficient.
 While **autoregressive models** always needs to analyse all past inputs
 (or a window of most recent ones), at each computation,
 `RNNs` don't, making them a great tool when the situation permits it.
 <!-- TODO: add images -->
 ## A bit of History[^anelli-RNNs-1]
@@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.
 The idea behind [`RNNs`](#rnns) is to add ***memory***
 as a `hidden-state`. This helps the `model` to
-***"remember"*** things for "long time", but it
+***"remember"*** things for "long time", but since it
-is ***noisy***, and as such, the best we can do is
+is ***noisy***, the best we can do is
 to ***infer its probability distribution***, doable only
 for:
 - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
 - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
-While these models are `stochastic`,
+While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
-***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
+**deterministic**.
 Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
 <!-- TODO: check this:
 , plus they are ***`non-linear`*** and their
 ***`hidden-state` is `distributed`***[^anelli-RNNs-3]
 -->
 ### Neurons with Memory[^anelli-RNNs-4]
@@ -83,7 +99,11 @@ $$
 > Technically speaking, we could consider
 > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
-#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
+## Different RNNs configurations
 ![RNNs different configurations](./pngs/rnns-configurations.png)
 ### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
 - Specify `initial-states` of ***all*** `units`
 - Specify `initial-states` for a ***subset*** of `units`
@@ -91,27 +111,40 @@ $$
    `units` for ***each `timestep`*** (Which is the most
    naural way to model sequential data)
-#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
+In other words, it depends on how you need to model data according to your sequence.
 ### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
 - Specify ***desired final activity*** for ***all***
    `units`
 - Specify ***desired final activity*** for ***all***
    `units` ofr the ***last few `steps`***
-    - This is good to learn `attractors`
+  - This is good to learn `attractors`
-    - Makes it easy to add ***extra error derivatives***
+  - Makes it easy to add ***extra error derivatives***
 - Speficfy the ***desired activity of a subset of
    `units`***
-    - The other `units` will be either `inputs` or
+  - The other `units` will be either `inputs` or
        `hidden-states`, as ***we fixed these***
-#### Transforming `Data` to be used in [`RNNs`](#rnns)
+In other words, it depends on which kind of output you need to be produced.
- One-hot encoding: Here each `token` is a $1$ over
+for example, a sentimental analysis would need to have just one output, while
-    the `input` array
+a `seq2seq` job would require a full sequence.
 - Learned embeddings: Here each `token` is a `point`
    of a ***learned hyperspace***
-### Backpropagation
+## Transforming `Data` for [`RNNs`](#rnns)
 Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
 having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
 **`1-hot`** encode them.
 While this is may be enough, there's a better way where we transform each
 1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
 To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
 16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
 we can save both time and space complexity.
 ## RNNs Training
 Since [`RNNs`](#rnns) can be considered a `deep-layered`
 `NN`, then we firstly ***train the model
@@ -126,15 +159,32 @@ derivatives along `time-steps`
 The thing is that is ***difficult to `train`
 [`RNNs`](#rnns)*** on
-***long-range dependencies*** because either the
+***long-range dependencies*** because the
-***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
+***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
 ### Mitigating training problems in RNNs
 In order to mitigate these gradient problems that impairs our network ability
 to gain valuable information over long term dependencies, we have these
 solutions:
 - **LSTM**:\
    Make the model out of little modules crafted to keep values for long time
 - **Hessian Free Optimizers**:\
    Use optimizers that can see the gradient direction over smaller curvatures
 - **Echo State Networks**[^unipi-esn]:\
    The idea is to use a **`sparsely connected large untrained network`** to keep track of
    inputs for long time, while eventually be forgotten, and have a
    **`trained readout network`** that converts the **`echo`** output into something usable
 - **Good Initialization and Momentum**:\
    Same thing as before, but we learn all connections using momentum
 > [!WARNING]
 >
-> `long-range dependencies` tend to have a smaller
+> `long-range dependencies` are more difficult to learn than `short-range` ones
-> impact on the system than `short-range` ones
+> because of the gradient problem.
-### Gated Cells
+## Gated Cells
 These are `neurons` that can be controlled to make
 them `learn` or `forget` chosen pieces of information
@@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
 > With ***chosen*** we intend choosing from the
 > `hyperspace`, so it's not really precise.
-#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
+### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
 This `cell` has a ***separate signal***, namely the
 `cell-state`,
 ***which controls `gates` of this `cells`, always
 initialized to `1`***.
 ![LSTM cell](./pngs/lstm-cell.png)
 > [!NOTE]
 >
 > $W$ will be weights associated with $\vec{x}$ and
@@ -162,9 +214,11 @@ initialized to `1`***.
 > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
 > ***pointwise product***
-<!-- TODO: Add images -->
+![detailed LSTM cell](./pngs/lstm-cell-detailed.png)
-##### Forget Gate | Keep Gate
+<!-- TODO: revice formulas -->
 #### Forget Gate | Keep Gate
 This `gate` ***controls the `cell-state`***:
@@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
 the `cell-state` will forget that value, and opposite
 for values closer to $1$.
-##### Input Gate | Write Gate
+#### Input Gate | Write Gate
 ***controls how much of the `input` gets into the
 `cell-state`***
@@ -203,7 +257,7 @@ the importance given to that info.
 > [`input-gate`](#input-gate--write-gate) are 2 phases
 > of the `update-phase`.
-##### Output Gate | Read Gate
+#### Output Gate | Read Gate
 ***Controls how much of the
 `hidden-state` is forwarded***
@@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
 simpler for the `cell-states` as they ***require only
 elementwise multiplications***
-#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
+### GRU[^anelli-RNNs-10][^GRU-wikipedia]
 It is another type of [`gated-cell`](#gated-cells), but,
 on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
@@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
 the `hidden-state`***, while keeping
 ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
 ![GRU cell](./pngs/gru-cell.png)
 > [!NOTE]
 > [`GRU`](#gru) doesn't have any `output-gate` and
 > $h_0 = 0$
-<!-- TODO: Add images -->
+![detailed GRU cell](./pngs/gru-cell-detailed.png)
-##### Update Gate
+#### Update Gate
 This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
@@ -254,7 +310,7 @@ $$
 \end{aligned}
 $$
-##### Reset Gate
+#### Reset Gate
 This is what breaks the `information` flow from the
 previous `hidden-state`.
@@ -267,7 +323,7 @@ $$
 \end{aligned}
 $$
-##### New `hidden-state`
+#### New `hidden-state`
 $$
 \begin{aligned}
@@ -308,13 +364,15 @@ understanding***
 <!-- TODO: Finish this part -->
-#### Pros
+### References
-#### Cons
+- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
-
+- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
- ***hard to train***
+- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
-
+- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
-#### Quirks
+- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
 - [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
 - [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
 <!-- TODO: PDF 8 pg. 24 -->
@@ -352,3 +410,5 @@ understanding***
 [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
 [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
 [^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)