diff --git a/Chapters/8-Recurrent-Networks/INDEX.md b/Chapters/8-Recurrent-Networks/INDEX.md index e69de29..2b3c8fd 100644 --- a/Chapters/8-Recurrent-Networks/INDEX.md +++ b/Chapters/8-Recurrent-Networks/INDEX.md @@ -0,0 +1,354 @@ +# Recurrent Networks | RNNs[^anelli-RNNs] + + + +## A bit of History[^anelli-RNNs-1] + +In order to ***predict the future***, we need +***information of the past***. This is the idea behind +`RNNs` for +***predicting the next item in a `sequence`***. + +While it has been attempted to accomplish this prediction +through the use of `memoryless models`, they didn't hold +up to expectations and ***had several limitations*** +such as the ***dimension of the "past" window***. + +- [`Autoregressive Models`](https://en.wikipedia.org/wiki/Autoregressive_model) +- [`Feed-Forward Neural Networks`](https://en.wikipedia.org/wiki/Feedforward_neural_network) + +### Shortcomings of previous attempts[^anelli-RNNs-1] + +- The `context window` was ***small***, thus the `model` + couldn't use ***distant past dependencies*** +- Some tried to ***count words***, but it + ***doesn't preserve meaning*** +- Some tried to ***make the `context window` bigger*** + but this + ***caused words to be considered differently based + on their position***, making it + ***impossible to reuse `weights` for same words***. + +## RNNs[^anelli-RNNs-2] + +The idea behind [`RNNs`](#rnns) is to add ***memory*** +as a `hidden-state`. This helps the `model` to +***"remember"*** things for "long time", but it +is ***noisy***, and as such, the best we can do is +to ***infer its probability distribution***, doable only +for: + +- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system) +- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model) + +While these models are `stochastic`, +***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their +***`hidden-state` is `distributed`***[^anelli-RNNs-3] + +### Neurons with Memory[^anelli-RNNs-4] + +While in normal `NNs` we have no ***memory***, ***these +`neurons` have a `hidden-state`,*** $\vec{h}$ ***, +which is fed back to the `neuron` itself***. + + + +The formula of this `hidden-state` is: + +$$ +\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1}) +$$ + +In other words, ***The `hidden-state` is influenced by +a function modified by `weights`*** and +***dependent by current `inputs` and preious step +`hidden-states`***. + +For example, let's say we use a $\tanh$ +`activation-function`: + +$$ +\vec{h}_t = \tanh( + W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t} +) +$$ + +And the `output` becomes: + +$$ +\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t} +$$ + +> [!NOTE] +> Technically speaking, we could consider +> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5] + +#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6] + +- Specify `initial-states` of ***all*** `units` +- Specify `initial-states` for a ***subset*** of `units` +- Specify `initial-states` for the same ***subset*** of + `units` for ***each `timestep`*** (Which is the most + naural way to model sequential data) + +#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7] + +- Specify ***desired final activity*** for ***all*** + `units` +- Specify ***desired final activity*** for ***all*** + `units` ofr the ***last few `steps`*** + - This is good to learn `attractors` + - Makes it easy to add ***extra error derivatives*** +- Speficfy the ***desired activity of a subset of + `units`*** + - The other `units` will be either `inputs` or + `hidden-states`, as ***we fixed these*** + +#### Transforming `Data` to be used in [`RNNs`](#rnns) + +- One-hot encoding: Here each `token` is a $1$ over + the `input` array +- Learned embeddings: Here each `token` is a `point` + of a ***learned hyperspace*** + +### Backpropagation + +Since [`RNNs`](#rnns) can be considered a `deep-layered` +`NN`, then we firstly ***train the model +over the sequence*** and ***then `backpropagate`***, +keeping track of the ***training stack***, adding +derivatives along `time-steps` + +> [!CAUTION] +> +> If you have ***big gradients***, remember to `clip` +> them + +The thing is that is ***difficult to `train` +[`RNNs`](#rnns)*** on +***long-range dependencies*** because either the +***gradient will `vanish` or `explode`***[^anelli-RNNs-8] + +> [!WARNING] +> +> `long-range dependencies` tend to have a smaller +> impact on the system than `short-range` ones + +### Gated Cells + +These are `neurons` that can be controlled to make +them `learn` or `forget` chosen pieces of information + +> [!CAUTION] +> +> With ***chosen*** we intend choosing from the +> `hyperspace`, so it's not really precise. + +#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia] + +This `cell` has a ***separate signal***, namely the +`cell-state`, +***which controls `gates` of this `cells`, always +initialized to `1`***. + +> [!NOTE] +> +> $W$ will be weights associated with $\vec{x}$ and +> $U$ with $\vec{h}$. +> +> The `cell-state` has the same dimension as the +> `hidden-state` +> +> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the +> ***pointwise product*** + + + +##### Forget Gate | Keep Gate + +This `gate` ***controls the `cell-state`***: + +$$ +\hat{c}_{t} = \sigma \left( + U_fh_{t-1} + W_fx_t + b_f +\right) \odot c_{t-1} +$$ + +The closer the result of $\sigma$ is to $0$, the more +the `cell-state` will forget that value, and opposite +for values closer to $1$. + +##### Input Gate | Write Gate + +***controls how much of the `input` gets into the +`cell-state`*** + +$$ +c_{t} = \left( + \sigma \left( + U_ih_{t-1} + W_ix_t + b_i + \right) \odot \tanh \left( + U_ch_{t-1} + W_cx_t + b_c + \right) +\right) + \hat{c}_{t} +$$ + +The results of $\tanh$ are ***new pieces of +`information`***. The higher the $\sigma_i$, the higher +the importance given to that info. + +> [!NOTE] +> +> The [`forget gate`](#forget-gate--keep-gate) and the +> [`input-gate`](#input-gate--write-gate) are 2 phases +> of the `update-phase`. + +##### Output Gate | Read Gate + +***Controls how much of the +`hidden-state` is forwarded*** + +$$ +h_{t} = \tanh (c_{t}) \odot \sigma \left( + U_oh_{t-1} + W_ox_t + b_o +\right) +$$ + +This produces the ***new `hidden-state`***. +***Notice that +the `info` comes from the `cell-state`, +`gated` by the `input` and `previous-hidden-state`*** + +--- + +Here the `backpropagation` of the ***gradient*** is way +simpler for the `cell-states` as they ***require only +elementwise multiplications*** + +#### GRU[^anelli-RNNs-10][^GRU-wikipedia] + +It is another type of [`gated-cell`](#gated-cells), but, +on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm), +***it doesn't have a separate `cell-state`, but only +the `hidden-state`***, while keeping +***similar performances to [`LSTM`](#long-short-term-memory--lstm)***. + +> [!NOTE] +> [`GRU`](#gru) doesn't have any `output-gate` and +> $h_0 = 0$ + + + +##### Update Gate + +This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate) + +$$ +\begin{aligned} + + \hat{h}_t &= \left( + 1 - \sigma \left( + U_z h_{t-1} + W_z x_{t} + b_z + \right) + \, \right) \odot h_{t-1} +\end{aligned} +$$ + +##### Reset Gate + +This is what breaks the `information` flow from the +previous `hidden-state`. + +$$ +\begin{aligned} + \bar{h}_t &= \sigma\left( + U_r h_{t-1} + W_r x_{t} + b_r + \right) \odot h_{t-1} +\end{aligned} +$$ + +##### New `hidden-state` + +$$ +\begin{aligned} + h_t = \hat{h}_t + (\sigma \left( + U_z h_{t-1} + W_z x_{t} + b_z + \right) \odot \tanh \left( + U_h \bar{h}_t + W_h x_t + b_h + \right)) +\end{aligned} +$$ + +> [!TIP] +> +> There's no clear winner between [`GRU`](#gru) and +> [`LSTM`](#long-short-term-memory--lstm), so +> try them both, however the +> former is ***easier to compute*** + +### Bi-LSTM[^anelli-RNNs-12][^Bi-LSTM-stackoverflow] + +It is a technique in which we put 2 `LSTM` `networks`, +***one to remember the `past` and one to remember the +`future`***. + +This type of `networks` ***improve context +understanding*** + +### Applications[^anelli-RNNs-11] + +- Music Generation +- Sentiment Classification +- Machine Translation +- Attention Mechanisms + + + +### Pros, Cons and Quirks + + + +#### Pros + +#### Cons + +- ***hard to train*** + +#### Quirks + + + + + +[^anelli-RNNs]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 + +[^anelli-RNNs-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20 + +[^anelli-RNNs-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22 + +[^anelli-RNNs-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23 + + +[^anelli-RNNs-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25 + +[^anelli-RNNs-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47 + +[^anelli-RNNs-6]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50 + +[^anelli-RNNs-7]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51 + +[^anelli-RNNs-8]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87 + +[^anelli-RNNs-9]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112 + +[^LSTM-wikipedia]: [LSTM | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Long_short-term_memory) + +[^anelli-RNNs-10]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118 + +[^GRU-wikipedia]: [GRU | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Gated_recurrent_unit) + +[^anelli-RNNs-11]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126 + +[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 + +[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)