Deep-Learning/Chapters/8-Recurrent-Networks/INDEX.md

# Recurrent Networks | RNNs[^anelli-RNNs]

## Why would we want Recurrent Networks?

To deal with **sequence** related jobs of **arbitrary length**.

In fact they can deal with inputs of varying length while being fast and memory efficient.

While **autoregressive models** always needs to analyse all past inputs
(or a window of most recent ones), at each computation,
`RNNs` don't, making them a great tool when the situation permits it.

<!-- TODO: add images -->

## A bit of History[^anelli-RNNs-1]

In order to ***predict the future***, we need
***information of the past***. This is the idea behind
`RNNs` for
***predicting the next item in a `sequence`***.

While it has been attempted to accomplish this prediction
through the use of `memoryless models`, they didn't hold
up to expectations and ***had several limitations***
such as the ***dimension of the "past" window***.

- [`Autoregressive Models`](https://en.wikipedia.org/wiki/Autoregressive_model)
- [`Feed-Forward Neural Networks`](https://en.wikipedia.org/wiki/Feedforward_neural_network)

### Shortcomings of previous attempts[^anelli-RNNs-1]

- The `context window` was ***small***, thus the `model`
    couldn't use ***distant past dependencies***
- Some tried to ***count words***, but it
    ***doesn't preserve meaning***
- Some tried to ***make the `context window` bigger***
    but this
    ***caused words to be considered differently based
    on their position***, making it
    ***impossible to reuse `weights` for same words***.

## RNNs[^anelli-RNNs-2]

The idea behind [`RNNs`](#rnns) is to add ***memory***
as a `hidden-state`. This helps the `model` to
***"remember"*** things for "long time", but since it
is ***noisy***, the best we can do is
to ***infer its probability distribution***, doable only
for:

- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)

While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
**deterministic**.

Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**

<!-- TODO: check this:
, plus they are ***`non-linear`*** and their
***`hidden-state` is `distributed`***[^anelli-RNNs-3]
-->

### Neurons with Memory[^anelli-RNNs-4]

While in normal `NNs` we have no ***memory***, ***these
`neurons` have a `hidden-state`,*** $\vec{h}$ ***,
which is <u>fed back</u> to the `neuron` itself***.

<!-- TODO: Add image -->

The formula of this `hidden-state` is:

$$
\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1})
$$

In other words, ***The `hidden-state` is influenced by
a function modified by `weights`*** and
***dependent by current `inputs` and preious step
`hidden-states`***.

For example, let's say we use a $\tanh$
`activation-function`:

$$
\vec{h}_t = \tanh(
    W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t}
)
$$

And the `output` becomes:

$$
\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t}
$$

> [!NOTE]
> Technically speaking, we could consider
> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]

## Different RNNs configurations

![RNNs different configurations](./pngs/rnns-configurations.png)

### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]

- Specify `initial-states` of ***all*** `units`
- Specify `initial-states` for a ***subset*** of `units`
- Specify `initial-states` for the same ***subset*** of
    `units` for ***each `timestep`*** (Which is the most
    naural way to model sequential data)

In other words, it depends on how you need to model data according to your sequence.

### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]

- Specify ***desired final activity*** for ***all***
    `units`
- Specify ***desired final activity*** for ***all***
    `units` ofr the ***last few `steps`***
  - This is good to learn `attractors`
  - Makes it easy to add ***extra error derivatives***
- Speficfy the ***desired activity of a subset of
    `units`***
  - The other `units` will be either `inputs` or
        `hidden-states`, as ***we fixed these***

In other words, it depends on which kind of output you need to be produced.

for example, a sentimental analysis would need to have just one output, while
a `seq2seq` job would require a full sequence.

## Transforming `Data` for [`RNNs`](#rnns)

Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
**`1-hot`** encode them.

While this is may be enough, there's a better way where we transform each
1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.

To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
we can save both time and space complexity.

## RNNs Training

Since [`RNNs`](#rnns) can be considered a `deep-layered`
`NN`, then we firstly ***train the model
over the sequence*** and ***then `backpropagate`***,
keeping track of the ***training stack***, adding
derivatives along `time-steps`

> [!CAUTION]
>
> If you have ***big gradients***, remember to `clip`
> them

The thing is that is ***difficult to `train`
[`RNNs`](#rnns)*** on
***long-range dependencies*** because the
***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]

### Mitigating training problems in RNNs

In order to mitigate these gradient problems that impairs our network ability
to gain valuable information over long term dependencies, we have these
solutions:

- **LSTM**:\
    Make the model out of little modules crafted to keep values for long time
- **Hessian Free Optimizers**:\
    Use optimizers that can see the gradient direction over smaller curvatures
- **Echo State Networks**[^unipi-esn]:\
    The idea is to use a **`sparsely connected large untrained network`** to keep track of
    inputs for long time, while eventually be forgotten, and have a
    **`trained readout network`** that converts the **`echo`** output into something usable
- **Good Initialization and Momentum**:\
    Same thing as before, but we learn all connections using momentum

> [!WARNING]
>
> `long-range dependencies` are more difficult to learn than `short-range` ones
> because of the gradient problem.

## Gated Cells

These are `neurons` that can be controlled to make
them `learn` or `forget` chosen pieces of information

> [!CAUTION]
>
> With ***chosen*** we intend choosing from the
> `hyperspace`, so it's not really precise.

### High Level Scheme

The point of an `RNN` cell is to be able to modify its internal state.

As for the image, this can be implemented by having gates (AND operations) to read, write and
keep (remember) pieces of information in the memory.

![high level cell](./pngs/high-level-cell.png)

Even though this is a *high level and simplistic architecture*, it gives a rough idea of
how to implement it.

- First of all, instead of using `AND` gates, we can substitute with an elementwise
multiplication.

- Secondly we can implement an elementwise addition to take combine a new written element with
a past one

### Standard RNN Cell

This is the most simple type of implementation. Here all signals are set to 1

![simple rnn cell](./pngs/standard-cell.png)

### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]

This `cell` has a ***separate signal***, namely the
`cell-state`,
***which controls `gates` of this `cells`, always
initialized to `1`***.

![LSTM cell](./pngs/lstm-cell.png)

As for the image, we can identify a `keep (or forget) gate`, `write gate` and `read gate`.

- **Forget Gate**:\
    The previous `read state` ($h_{t-1}$) concatenated with `input` ($x_{t}$) is what controls
    how much of the `previous cell state` keeps being remembered. Since it has values
    $\in [0, 1]$, it has been called **forget gate**
- **Input Gate**:\
    It is controlled by a `sigmoid` with the same inputs as the forget gate, but with different
    weights. It regulates how much of the `tanh` of same inputs goes into the cell state.

    `tanh` here has an advantage over the `sigmoid` for the value as it admits values
    $\in [-1, 1]$

- **Output Gate**:\
    It is controlled by a `sigmoid` with the same inputs as the previous gates, but different
    weights. It regulates how much of the `tanh` of the `current state cell` goes over the
    `output`

![detailed LSTM cell](./pngs/lstm-cell-detailed.png)

<!-- TODO: revise formulas -->

> [!NOTE]
>
> $W$ will be weights associated with $\vec{x}$ and
> $U$ with $\vec{h}$.
>
> The `cell-state` has the same dimension as the
> `hidden-state`
>
> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
> ***pointwise product***

#### Forget Gate | Keep Gate

This `gate` ***controls the `cell-state`***:

$$
\hat{c}_{t} = \sigma \left(
    U_fh_{t-1} + W_fx_t + b_f
\right) \odot c_{t-1}
$$

The closer the result of $\sigma$ is to $0$, the more
the `cell-state` will forget that value, and opposite
for values closer to $1$.

#### Input Gate | Write Gate

***controls how much of the `input` gets into the
`cell-state`***

$$
c_{t} = \left(
    \sigma \left(
        U_ih_{t-1} + W_ix_t + b_i
    \right) \odot \tanh \left(
        U_ch_{t-1} + W_cx_t + b_c
    \right)
\right) + \hat{c}_{t}
$$

The results of $\tanh$ are ***new pieces of
`information`***. The higher the $\sigma_i$, the higher
the importance given to that info.

> [!NOTE]
>
> The [`forget gate`](#forget-gate--keep-gate) and the
> [`input-gate`](#input-gate--write-gate) are 2 phases
> of the `update-phase`.

#### Output Gate | Read Gate

***Controls how much of the
`hidden-state` is forwarded***

$$
h_{t} = \tanh (c_{t}) \odot \sigma \left(
    U_oh_{t-1} + W_ox_t + b_o
\right)
$$

This produces the ***new `hidden-state`***.
***Notice that
the `info` comes from the `cell-state`,
`gated` by the `input` and `previous-hidden-state`***

---

Here the `backpropagation` of the ***gradient*** is way
simpler for the `cell-states` as they ***require only
elementwise multiplications***

### GRU[^anelli-RNNs-10][^GRU-wikipedia]

It is another type of [`gated-cell`](#gated-cells), but,
on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
***it doesn't have a separate `cell-state`, but only
the `hidden-state`***, while keeping
***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.

![GRU cell](./pngs/gru-cell.png)

As for the image, we have only 2 gates:

- **Reset Gate**:\
    Tells us **how much of the old information should pass with the input**. It is controlled
    by the `old state` and the `input`.
- **Update Gate**:\
    Tells us
    **how much of the old info will be kept and how much of the new info will be learnt**.
    It is controlled by a concatenation of the `output of reset gate` and the `input` passing
    from a `tanh`.

![detailed GRU cell](./pngs/gru-cell-detailed.png)

> [!NOTE]
> [`GRU`](#gru) doesn't have any `output-gate` and
> $h_0 = 0$
>
#### Update Gate

This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)

$$
\begin{aligned}

    \hat{h}_t &= \left(
        1 - \sigma \left(
            U_z h_{t-1} + W_z x_{t} + b_z
        \right)
    \, \right) \odot h_{t-1}
\end{aligned}
$$

#### Reset Gate

This is what breaks the `information` flow from the
previous `hidden-state`.

$$
\begin{aligned}
    \bar{h}_t &= \sigma\left(
            U_r h_{t-1} + W_r x_{t} + b_r
        \right) \odot h_{t-1}
\end{aligned}
$$

#### New `hidden-state`

$$
\begin{aligned}
    h_t = \hat{h}_t + (\sigma \left(
        U_z h_{t-1} + W_z x_{t} + b_z
    \right) \odot \tanh \left(
        U_h \bar{h}_t + W_h x_t + b_h
    \right))
\end{aligned}
$$

> [!TIP]
>
> There's no clear winner between [`GRU`](#gru) and
> [`LSTM`](#long-short-term-memory--lstm), so
> try them both, however the
> former is ***easier to compute***

### Bi-LSTM[^anelli-RNNs-12][^Bi-LSTM-stackoverflow]

We implement 2 networks, one that takes hidden states coming from computing info over in order,
while the other one taking hidden states in reverse order.

Then we take outputs from both networks and compute attention and other operations, such as
`softmax` and `linear` ones, to get the output.

In this way we gain info coming from both directions of a sequence.

### Applications[^anelli-RNNs-11]

- Music Generation
- Sentiment Classification
- Machine Translation
- Attention Mechanisms

<!-- TODO: research about Attention for RNNs -->

### Pros, Cons and Quirks

<!-- TODO: Finish this part -->

### References

- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

<!-- TODO: PDF 8 pg. 24 -->

<!-- Footnotes -->

[^anelli-RNNs]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8

[^anelli-RNNs-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20

[^anelli-RNNs-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22

[^anelli-RNNs-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23

<!-- TODO: find bounds of topic -->
[^anelli-RNNs-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25

[^anelli-RNNs-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47

[^anelli-RNNs-6]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50

[^anelli-RNNs-7]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51

[^anelli-RNNs-8]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87

[^anelli-RNNs-9]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112

[^LSTM-wikipedia]: [LSTM | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Long_short-term_memory)

[^anelli-RNNs-10]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118

[^GRU-wikipedia]: [GRU | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Gated_recurrent_unit)

[^anelli-RNNs-11]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126

[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136

[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)

[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)