# Recurrent Networks | RNNs[^anelli-RNNs]

<!-- TODO: add images -->

## A bit of History[^anelli-RNNs-1]

In order to ***predict the future***, we need
***information of the past***. This is the idea behind
`RNNs` for
***predicting the next item in a `sequence`***.

While it has been attempted to accomplish this prediction
through the use of `memoryless models`, they didn't hold
up to expectations and ***had several limitations***
such as the ***dimension of the "past" window***.

- [`Autoregressive Models`](https://en.wikipedia.org/wiki/Autoregressive_model)
- [`Feed-Forward Neural Networks`](https://en.wikipedia.org/wiki/Feedforward_neural_network)

### Shortcomings of previous attempts[^anelli-RNNs-1]

- The `context window` was ***small***, thus the `model`
    couldn't use ***distant past dependencies***
- Some tried to ***count words***, but it
    ***doesn't preserve meaning***
- Some tried to ***make the `context window` bigger***
    but this
    ***caused words to be considered differently based
    on their position***, making it
    ***impossible to reuse `weights` for same words***.

## RNNs[^anelli-RNNs-2]

The idea behind [`RNNs`](#rnns) is to add ***memory***
as a `hidden-state`. This helps the `model` to
***"remember"*** things for "long time", but it
is ***noisy***, and as such, the best we can do is
to ***infer its probability distribution***, doable only
for:

- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)

While these models are `stochastic`,
***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
***`hidden-state` is `distributed`***[^anelli-RNNs-3]

### Neurons with Memory[^anelli-RNNs-4]

While in normal `NNs` we have no ***memory***, ***these
`neurons` have a `hidden-state`,*** $\vec{h}$ ***,
which is <u>fed back</u> to the `neuron` itself***.

<!-- TODO: Add image -->

The formula of this `hidden-state` is:

$$
\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1})
$$

In other words, ***The `hidden-state` is influenced by
a function modified by `weights`*** and
***dependent by current `inputs` and preious step
`hidden-states`***.

For example, let's say we use a $\tanh$
`activation-function`:

$$
\vec{h}_t = \tanh(
    W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t}
)
$$

And the `output` becomes:

$$
\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t}
$$

> [!NOTE]
> Technically speaking, we could consider
> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]

#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]

- Specify `initial-states` of ***all*** `units`
- Specify `initial-states` for a ***subset*** of `units`
- Specify `initial-states` for the same ***subset*** of
    `units` for ***each `timestep`*** (Which is the most
    naural way to model sequential data)

#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]

- Specify ***desired final activity*** for ***all***
    `units`
- Specify ***desired final activity*** for ***all***
    `units` ofr the ***last few `steps`***
    - This is good to learn `attractors`
    - Makes it easy to add ***extra error derivatives***
- Speficfy the ***desired activity of a subset of
    `units`***
    - The other `units` will be either `inputs` or
        `hidden-states`, as ***we fixed these***

#### Transforming `Data` to be used in [`RNNs`](#rnns)

- One-hot encoding: Here each `token` is a $1$ over
    the `input` array
- Learned embeddings: Here each `token` is a `point`
    of a ***learned hyperspace***

### Backpropagation

Since [`RNNs`](#rnns) can be considered a `deep-layered`
`NN`, then we firstly ***train the model
over the sequence*** and ***then `backpropagate`***,
keeping track of the ***training stack***, adding
derivatives along `time-steps`

> [!CAUTION]
>
> If you have ***big gradients***, remember to `clip`
> them

The thing is that is ***difficult to `train`
[`RNNs`](#rnns)*** on
***long-range dependencies*** because either the
***gradient will `vanish` or `explode`***[^anelli-RNNs-8]

> [!WARNING]
>
> `long-range dependencies` tend to have a smaller
> impact on the system than `short-range` ones

### Gated Cells

These are `neurons` that can be controlled to make
them `learn` or `forget` chosen pieces of information

> [!CAUTION]
>
> With ***chosen*** we intend choosing from the
> `hyperspace`, so it's not really precise.

#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]

This `cell` has a ***separate signal***, namely the
`cell-state`,
***which controls `gates` of this `cells`, always
initialized to `1`***.

> [!NOTE]
>
> $W$ will be weights associated with $\vec{x}$ and
> $U$ with $\vec{h}$.
>
> The `cell-state` has the same dimension as the
> `hidden-state`
>
> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
> ***pointwise product***

<!-- TODO: Add images -->

##### Forget Gate | Keep Gate

This `gate` ***controls the `cell-state`***:

$$
\hat{c}_{t} = \sigma \left(
    U_fh_{t-1} + W_fx_t + b_f
\right) \odot c_{t-1}
$$

The closer the result of $\sigma$ is to $0$, the more
the `cell-state` will forget that value, and opposite
for values closer to $1$.

##### Input Gate | Write Gate

***controls how much of the `input` gets into the
`cell-state`***

$$
c_{t} = \left(
    \sigma \left(
        U_ih_{t-1} + W_ix_t + b_i
    \right) \odot \tanh \left(
        U_ch_{t-1} + W_cx_t + b_c
    \right)
\right) + \hat{c}_{t}
$$

The results of $\tanh$ are ***new pieces of
`information`***. The higher the $\sigma_i$, the higher
the importance given to that info.

> [!NOTE]
>
> The [`forget gate`](#forget-gate--keep-gate) and the
> [`input-gate`](#input-gate--write-gate) are 2 phases
> of the `update-phase`.

##### Output Gate | Read Gate

***Controls how much of the
`hidden-state` is forwarded***

$$
h_{t} = \tanh (c_{t}) \odot \sigma \left(
    U_oh_{t-1} + W_ox_t + b_o
\right)
$$

This produces the ***new `hidden-state`***.
***Notice that
the `info` comes from the `cell-state`,
`gated` by the `input` and `previous-hidden-state`***

---

Here the `backpropagation` of the ***gradient*** is way
simpler for the `cell-states` as they ***require only
elementwise multiplications***

#### GRU[^anelli-RNNs-10][^GRU-wikipedia]

It is another type of [`gated-cell`](#gated-cells), but,
on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
***it doesn't have a separate `cell-state`, but only
the `hidden-state`***, while keeping
***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.

> [!NOTE]
> [`GRU`](#gru) doesn't have any `output-gate` and
> $h_0 = 0$

<!-- TODO: Add images -->

##### Update Gate

This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)

$$
\begin{aligned}

    \hat{h}_t &= \left(
        1 - \sigma \left(
            U_z h_{t-1} + W_z x_{t} + b_z
        \right)
    \, \right) \odot h_{t-1}
\end{aligned}
$$

##### Reset Gate

This is what breaks the `information` flow from the
previous `hidden-state`.

$$
\begin{aligned}
    \bar{h}_t &= \sigma\left(
            U_r h_{t-1} + W_r x_{t} + b_r
        \right) \odot h_{t-1}
\end{aligned}
$$

##### New `hidden-state`

$$
\begin{aligned}
    h_t = \hat{h}_t + (\sigma \left(
        U_z h_{t-1} + W_z x_{t} + b_z
    \right) \odot \tanh \left(
        U_h \bar{h}_t + W_h x_t + b_h
    \right))
\end{aligned}
$$

> [!TIP]
>
> There's no clear winner between [`GRU`](#gru) and
> [`LSTM`](#long-short-term-memory--lstm), so
> try them both, however the
> former is ***easier to compute***

### Bi-LSTM[^anelli-RNNs-12][^Bi-LSTM-stackoverflow]

It is a technique in which we put 2 `LSTM` `networks`,
***one to remember the `past` and one to remember the
`future`***.

This type of `networks` ***improve context
understanding***

### Applications[^anelli-RNNs-11]

- Music Generation
- Sentiment Classification
- Machine Translation
- Attention Mechanisms

<!-- TODO: research about Attention for RNNs -->

### Pros, Cons and Quirks

<!-- TODO: Finish this part -->

#### Pros

#### Cons

- ***hard to train***

#### Quirks

<!-- TODO: PDF 8 pg. 24 -->

<!-- Footnotes -->

[^anelli-RNNs]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8

[^anelli-RNNs-1]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20

[^anelli-RNNs-2]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22

[^anelli-RNNs-3]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23

<!-- TODO: find bounds of topic -->
[^anelli-RNNs-4]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25

[^anelli-RNNs-5]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47

[^anelli-RNNs-6]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50

[^anelli-RNNs-7]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51

[^anelli-RNNs-8]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87

[^anelli-RNNs-9]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112

[^LSTM-wikipedia]: [LSTM | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Long_short-term_memory)

[^anelli-RNNs-10]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118

[^GRU-wikipedia]: [GRU | Wikipedia | 27th April 2025](https://en.wikipedia.org/wiki/Gated_recurrent_unit)

[^anelli-RNNs-11]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126

[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136

[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)