Added Chapter for RNNs

2025-04-27 14:48:13 +02:00 · 2025-04-27 14:48:13 +02:00 · 4dee495423
commit 4dee495423
parent 28b6d24da6
1 changed files with 354 additions and 0 deletions
--- a/Chapters/8-Recurrent-Networks/INDEX.md
+++ b/Chapters/8-Recurrent-Networks/INDEX.md
@ -0,0 +1,354 @@
+# Recurrent Networks | RNNs[^anelli-RNNs]
+
+<!-- TODO: add images -->
+
+## A bit of History[^anelli-RNNs-1]
+
+In order to ***predict the future***, we need
+***information of the past***. This is the idea behind
+`RNNs` for
+***predicting the next item in a `sequence`***.
+
+While it has been attempted to accomplish this prediction
+through the use of `memoryless models`, they didn't hold
+up to expectations and ***had several limitations***
+such as the ***dimension of the "past" window***.
+
+- [`Autoregressive Models`](https://en.wikipedia.org/wiki/Autoregressive_model)
+- [`Feed-Forward Neural Networks`](https://en.wikipedia.org/wiki/Feedforward_neural_network)
+
+### Shortcomings of previous attempts[^anelli-RNNs-1]
+
+- The `context window` was ***small***, thus the `model`
+    couldn't use ***distant past dependencies***
+- Some tried to ***count words***, but it
+    ***doesn't preserve meaning***
+- Some tried to ***make the `context window` bigger***
+    but this
+    ***caused words to be considered differently based
+    on their position***, making it
+    ***impossible to reuse `weights` for same words***.
+
+## RNNs[^anelli-RNNs-2]
+
+The idea behind [`RNNs`](#rnns) is to add ***memory***
+as a `hidden-state`. This helps the `model` to
+***"remember"*** things for "long time", but it
+is ***noisy***, and as such, the best we can do is
+to ***infer its probability distribution***, doable only
+for:
+
+- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
+- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
+
+While these models are `stochastic`,
+***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
+***`hidden-state` is `distributed`***[^anelli-RNNs-3]
+
+### Neurons with Memory[^anelli-RNNs-4]
+
+While in normal `NNs` we have no ***memory***, ***these
+`neurons` have a `hidden-state`,*** $\vec{h}$ ***,
+which is <u>fed back</u> to the `neuron` itself***.
+
+<!-- TODO: Add image -->
+
+The formula of this `hidden-state` is:
+
+$$
+\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1})
+$$
+
+In other words, ***The `hidden-state` is influenced by
+a function modified by `weights`*** and
+***dependent by current `inputs` and preious step
+`hidden-states`***.
+
+For example, let's say we use a $\tanh$
+`activation-function`:
+
+$$
+\vec{h}_t = \tanh(
+    W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t}
+)
+$$
+
+And the `output` becomes:
+
+$$
+\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t}
+$$
+
+> [!NOTE]
+> Technically speaking, we could consider
+> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
+
+#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
+
+- Specify `initial-states` of ***all*** `units`
+- Specify `initial-states` for a ***subset*** of `units`
+- Specify `initial-states` for the same ***subset*** of
+    `units` for ***each `timestep`*** (Which is the most
+    naural way to model sequential data)
+
+#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
+
+- Specify ***desired final activity*** for ***all***
+    `units`
+- Specify ***desired final activity*** for ***all***
+    `units` ofr the ***last few `steps`***
+    - This is good to learn `attractors`
+    - Makes it easy to add ***extra error derivatives***
+- Speficfy the ***desired activity of a subset of
+    `units`***
+    - The other `units` will be either `inputs` or
+        `hidden-states`, as ***we fixed these***
+
+#### Transforming `Data` to be used in [`RNNs`](#rnns)
+
+- One-hot encoding: Here each `token` is a $1$ over
+    the `input` array
+- Learned embeddings: Here each `token` is a `point`
+    of a ***learned hyperspace***
+
+### Backpropagation
+
+Since [`RNNs`](#rnns) can be considered a `deep-layered`
+`NN`, then we firstly ***train the model
+over the sequence*** and ***then `backpropagate`***,
+keeping track of the ***training stack***, adding
+derivatives along `time-steps`
+
+> [!CAUTION]
+>
+> If you have ***big gradients***, remember to `clip`
+> them
+
+The thing is that is ***difficult to `train`
+[`RNNs`](#rnns)*** on
+***long-range dependencies*** because either the
+***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
+
+> [!WARNING]
+>
+> `long-range dependencies` tend to have a smaller
+> impact on the system than `short-range` ones
+
+### Gated Cells
+
+These are `neurons` that can be controlled to make
+them `learn` or `forget` chosen pieces of information
+
+> [!CAUTION]
+>
+> With ***chosen*** we intend choosing from the
+> `hyperspace`, so it's not really precise.
+
+#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
+
+This `cell` has a ***separate signal***, namely the
+`cell-state`,
+***which controls `gates` of this `cells`, always
+initialized to `1`***.
+
+> [!NOTE]
+>
+> $W$ will be weights associated with $\vec{x}$ and
+> $U$ with $\vec{h}$.
+>
+> The `cell-state` has the same dimension as the
+> `hidden-state`
+>
+> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
+> ***pointwise product***
+
+<!-- TODO: Add images -->
+
+##### Forget Gate | Keep Gate
+
+This `gate` ***controls the `cell-state`***:
+
+$$
+\hat{c}_{t} = \sigma \left(
+    U_fh_{t-1} + W_fx_t + b_f
+\right) \odot c_{t-1}
+$$
+
+The closer the result of $\sigma$ is to $0$, the more
+the `cell-state` will forget that value, and opposite
+for values closer to $1$.
+
+##### Input Gate | Write Gate
+
+***controls how much of the `input` gets into the
+`cell-state`***
+
+$$
+c_{t} = \left(
+    \sigma \left(
+        U_ih_{t-1} + W_ix_t + b_i
+    \right) \odot \tanh \left(
+        U_ch_{t-1} + W_cx_t + b_c
+    \right)
+\right) + \hat{c}_{t}
+$$
+
+The results of $\tanh$ are ***new pieces of
+`information`***. The higher the $\sigma_i$, the higher
+the importance given to that info.
+
+> [!NOTE]
+>
+> The [`forget gate`](#forget-gate--keep-gate) and the
+> [`input-gate`](#input-gate--write-gate) are 2 phases
+> of the `update-phase`.
+
+##### Output Gate | Read Gate
+
+***Controls how much of the
+`hidden-state` is forwarded***
+
+$$
+h_{t} = \tanh (c_{t}) \odot \sigma \left(
+    U_oh_{t-1} + W_ox_t + b_o
+\right)
+$$
+
+This produces the ***new `hidden-state`***.
+***Notice that
+the `info` comes from the `cell-state`,
+`gated` by the `input` and `previous-hidden-state`***
+
+---
+
+Here the `backpropagation` of the ***gradient*** is way
+simpler for the `cell-states` as they ***require only
+elementwise multiplications***
+
+#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
+
+It is another type of [`gated-cell`](#gated-cells), but,
+on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
+***it doesn't have a separate `cell-state`, but only
+the `hidden-state`***, while keeping
+***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
+
+> [!NOTE]
+> [`GRU`](#gru) doesn't have any `output-gate` and
+> $h_0 = 0$
+
+<!-- TODO: Add images -->
+
+##### Update Gate
+
+This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
+
+$$
+\begin{aligned}
+
+    \hat{h}_t &= \left(
+        1 - \sigma \left(
+            U_z h_{t-1} + W_z x_{t} + b_z
+        \right)
+    \, \right) \odot h_{t-1}
+\end{aligned}
+$$
+
+##### Reset Gate
+
+This is what breaks the `information` flow from the
+previous `hidden-state`.
+
+$$
+\begin{aligned}
+    \bar{h}_t &= \sigma\left(
+            U_r h_{t-1} + W_r x_{t} + b_r
+        \right) \odot h_{t-1}
+\end{aligned}
+$$
+
+##### New `hidden-state`
+
+$$
+\begin{aligned}
+    h_t = \hat{h}_t + (\sigma \left(
+        U_z h_{t-1} + W_z x_{t} + b_z
+    \right) \odot \tanh \left(
+        U_h \bar{h}_t + W_h x_t + b_h
+    \right))
+\end{aligned}
+$$
+
+> [!TIP]
+>
+> There's no clear winner between [`GRU`](#gru) and
+> [`LSTM`](#long-short-term-memory--lstm), so
+> try them both, however the
+> former is ***easier to compute***
+
+### Bi-LSTM[^anelli-RNNs-12][^Bi-LSTM-stackoverflow]
+
+It is a technique in which we put 2 `LSTM` `networks`,
+***one to remember the `past` and one to remember the
+`future`***.
+
+This type of `networks` ***improve context
+understanding***
+
+### Applications[^anelli-RNNs-11]
+
+- Music Generation
+- Sentiment Classification
+- Machine Translation
+- Attention Mechanisms
+
+<!-- TODO: research about Attention for RNNs -->
+
+### Pros, Cons and Quirks
+
+<!-- TODO: Finish this part -->
+
+#### Pros
+
+#### Cons
+
+- ***hard to train***
+
+#### Quirks
+
+<!-- TODO: PDF 8 pg. 24 -->
+
+<!-- Footnotes -->
+
+[^anelli-RNNs]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8
+
+[^anelli-RNNs-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20
+
+[^anelli-RNNs-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22
+
+[^anelli-RNNs-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23
+
+<!-- TODO: find bounds of topic -->
+[^anelli-RNNs-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25
+
+[^anelli-RNNs-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47
+
+[^anelli-RNNs-6]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50
+
+[^anelli-RNNs-7]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51
+
+[^anelli-RNNs-8]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87
+
+[^anelli-RNNs-9]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112
+
+[^LSTM-wikipedia]: [LSTM | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Long_short-term_memory)
+
+[^anelli-RNNs-10]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118
+
+[^GRU-wikipedia]: [GRU | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Gated_recurrent_unit)
+
+[^anelli-RNNs-11]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126
+
+[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
+
+[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)