# Recurrent Networks | RNNs[^anelli-RNNs] ## A bit of History[^anelli-RNNs-1] In order to ***predict the future***, we need ***information of the past***. This is the idea behind `RNNs` for ***predicting the next item in a `sequence`***. While it has been attempted to accomplish this prediction through the use of `memoryless models`, they didn't hold up to expectations and ***had several limitations*** such as the ***dimension of the "past" window***. - [`Autoregressive Models`](https://en.wikipedia.org/wiki/Autoregressive_model) - [`Feed-Forward Neural Networks`](https://en.wikipedia.org/wiki/Feedforward_neural_network) ### Shortcomings of previous attempts[^anelli-RNNs-1] - The `context window` was ***small***, thus the `model` couldn't use ***distant past dependencies*** - Some tried to ***count words***, but it ***doesn't preserve meaning*** - Some tried to ***make the `context window` bigger*** but this ***caused words to be considered differently based on their position***, making it ***impossible to reuse `weights` for same words***. ## RNNs[^anelli-RNNs-2] The idea behind [`RNNs`](#rnns) is to add ***memory*** as a `hidden-state`. This helps the `model` to ***"remember"*** things for "long time", but it is ***noisy***, and as such, the best we can do is to ***infer its probability distribution***, doable only for: - [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system) - [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model) While these models are `stochastic`, ***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their ***`hidden-state` is `distributed`***[^anelli-RNNs-3] ### Neurons with Memory[^anelli-RNNs-4] While in normal `NNs` we have no ***memory***, ***these `neurons` have a `hidden-state`,*** $\vec{h}$ ***, which is fed back to the `neuron` itself***. The formula of this `hidden-state` is: $$ \vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1}) $$ In other words, ***The `hidden-state` is influenced by a function modified by `weights`*** and ***dependent by current `inputs` and preious step `hidden-states`***. For example, let's say we use a $\tanh$ `activation-function`: $$ \vec{h}_t = \tanh( W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t} ) $$ And the `output` becomes: $$ \vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t} $$ > [!NOTE] > Technically speaking, we could consider > [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5] #### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6] - Specify `initial-states` of ***all*** `units` - Specify `initial-states` for a ***subset*** of `units` - Specify `initial-states` for the same ***subset*** of `units` for ***each `timestep`*** (Which is the most naural way to model sequential data) #### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7] - Specify ***desired final activity*** for ***all*** `units` - Specify ***desired final activity*** for ***all*** `units` ofr the ***last few `steps`*** - This is good to learn `attractors` - Makes it easy to add ***extra error derivatives*** - Speficfy the ***desired activity of a subset of `units`*** - The other `units` will be either `inputs` or `hidden-states`, as ***we fixed these*** #### Transforming `Data` to be used in [`RNNs`](#rnns) - One-hot encoding: Here each `token` is a $1$ over the `input` array - Learned embeddings: Here each `token` is a `point` of a ***learned hyperspace*** ### Backpropagation Since [`RNNs`](#rnns) can be considered a `deep-layered` `NN`, then we firstly ***train the model over the sequence*** and ***then `backpropagate`***, keeping track of the ***training stack***, adding derivatives along `time-steps` > [!CAUTION] > > If you have ***big gradients***, remember to `clip` > them The thing is that is ***difficult to `train` [`RNNs`](#rnns)*** on ***long-range dependencies*** because either the ***gradient will `vanish` or `explode`***[^anelli-RNNs-8] > [!WARNING] > > `long-range dependencies` tend to have a smaller > impact on the system than `short-range` ones ### Gated Cells These are `neurons` that can be controlled to make them `learn` or `forget` chosen pieces of information > [!CAUTION] > > With ***chosen*** we intend choosing from the > `hyperspace`, so it's not really precise. #### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia] This `cell` has a ***separate signal***, namely the `cell-state`, ***which controls `gates` of this `cells`, always initialized to `1`***. > [!NOTE] > > $W$ will be weights associated with $\vec{x}$ and > $U$ with $\vec{h}$. > > The `cell-state` has the same dimension as the > `hidden-state` > > $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the > ***pointwise product*** ##### Forget Gate | Keep Gate This `gate` ***controls the `cell-state`***: $$ \hat{c}_{t} = \sigma \left( U_fh_{t-1} + W_fx_t + b_f \right) \odot c_{t-1} $$ The closer the result of $\sigma$ is to $0$, the more the `cell-state` will forget that value, and opposite for values closer to $1$. ##### Input Gate | Write Gate ***controls how much of the `input` gets into the `cell-state`*** $$ c_{t} = \left( \sigma \left( U_ih_{t-1} + W_ix_t + b_i \right) \odot \tanh \left( U_ch_{t-1} + W_cx_t + b_c \right) \right) + \hat{c}_{t} $$ The results of $\tanh$ are ***new pieces of `information`***. The higher the $\sigma_i$, the higher the importance given to that info. > [!NOTE] > > The [`forget gate`](#forget-gate--keep-gate) and the > [`input-gate`](#input-gate--write-gate) are 2 phases > of the `update-phase`. ##### Output Gate | Read Gate ***Controls how much of the `hidden-state` is forwarded*** $$ h_{t} = \tanh (c_{t}) \odot \sigma \left( U_oh_{t-1} + W_ox_t + b_o \right) $$ This produces the ***new `hidden-state`***. ***Notice that the `info` comes from the `cell-state`, `gated` by the `input` and `previous-hidden-state`*** --- Here the `backpropagation` of the ***gradient*** is way simpler for the `cell-states` as they ***require only elementwise multiplications*** #### GRU[^anelli-RNNs-10][^GRU-wikipedia] It is another type of [`gated-cell`](#gated-cells), but, on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm), ***it doesn't have a separate `cell-state`, but only the `hidden-state`***, while keeping ***similar performances to [`LSTM`](#long-short-term-memory--lstm)***. > [!NOTE] > [`GRU`](#gru) doesn't have any `output-gate` and > $h_0 = 0$ ##### Update Gate This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate) $$ \begin{aligned} \hat{h}_t &= \left( 1 - \sigma \left( U_z h_{t-1} + W_z x_{t} + b_z \right) \, \right) \odot h_{t-1} \end{aligned} $$ ##### Reset Gate This is what breaks the `information` flow from the previous `hidden-state`. $$ \begin{aligned} \bar{h}_t &= \sigma\left( U_r h_{t-1} + W_r x_{t} + b_r \right) \odot h_{t-1} \end{aligned} $$ ##### New `hidden-state` $$ \begin{aligned} h_t = \hat{h}_t + (\sigma \left( U_z h_{t-1} + W_z x_{t} + b_z \right) \odot \tanh \left( U_h \bar{h}_t + W_h x_t + b_h \right)) \end{aligned} $$ > [!TIP] > > There's no clear winner between [`GRU`](#gru) and > [`LSTM`](#long-short-term-memory--lstm), so > try them both, however the > former is ***easier to compute*** ### Bi-LSTM[^anelli-RNNs-12][^Bi-LSTM-stackoverflow] It is a technique in which we put 2 `LSTM` `networks`, ***one to remember the `past` and one to remember the `future`***. This type of `networks` ***improve context understanding*** ### Applications[^anelli-RNNs-11] - Music Generation - Sentiment Classification - Machine Translation - Attention Mechanisms ### Pros, Cons and Quirks #### Pros #### Cons - ***hard to train*** #### Quirks [^anelli-RNNs]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 [^anelli-RNNs-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20 [^anelli-RNNs-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22 [^anelli-RNNs-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23 [^anelli-RNNs-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25 [^anelli-RNNs-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47 [^anelli-RNNs-6]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50 [^anelli-RNNs-7]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51 [^anelli-RNNs-8]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87 [^anelli-RNNs-9]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112 [^LSTM-wikipedia]: [LSTM | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Long_short-term_memory) [^anelli-RNNs-10]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118 [^GRU-wikipedia]: [GRU | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Gated_recurrent_unit) [^anelli-RNNs-11]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126 [^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 [^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)