2025-04-27 14:48:13 +02:00

9.6 KiB

Recurrent Networks | RNNs1

A bit of History2

In order to predict the future, we need information of the past. This is the idea behind RNNs for predicting the next item in a sequence.

While it has been attempted to accomplish this prediction through the use of memoryless models, they didn't hold up to expectations and had several limitations such as the dimension of the "past" window.

Shortcomings of previous attempts2

  • The context window was small, thus the model couldn't use distant past dependencies
  • Some tried to count words, but it doesn't preserve meaning
  • Some tried to make the context window bigger but this caused words to be considered differently based on their position, making it impossible to reuse weights for same words.

RNNs3

The idea behind RNNs is to add memory as a hidden-state. This helps the model to "remember" things for "long time", but it is noisy, and as such, the best we can do is to infer its probability distribution, doable only for:

While these models are stochastic, RNNs are deterministic, plus they are non-linear and their hidden-state is distributed4

Neurons with Memory5

While in normal NNs we have no memory, these neurons have a hidden-state, \vec{h} , which is fed back to the neuron itself.

The formula of this hidden-state is:


\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1})

In other words, The hidden-state is influenced by a function modified by weights and dependent by current inputs and preious step hidden-states.

For example, let's say we use a \tanh activation-function:


\vec{h}_t = \tanh(
    W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t}
)

And the output becomes:


\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t}

Note

Technically speaking, we could consider RNNs as deep NNs6

Providing initial-states for the hidden-states7

  • Specify initial-states of all units
  • Specify initial-states for a subset of units
  • Specify initial-states for the same subset of units for each timestep (Which is the most naural way to model sequential data)

Teaching signals for RNNs8

  • Specify desired final activity for all units
  • Specify desired final activity for all units ofr the last few steps
    • This is good to learn attractors
    • Makes it easy to add extra error derivatives
  • Speficfy the desired activity of a subset of units
    • The other units will be either inputs or hidden-states, as we fixed these

Transforming Data to be used in RNNs

  • One-hot encoding: Here each token is a 1 over the input array
  • Learned embeddings: Here each token is a point of a learned hyperspace

Backpropagation

Since RNNs can be considered a deep-layered NN, then we firstly train the model over the sequence and then backpropagate, keeping track of the training stack, adding derivatives along time-steps

Caution

If you have big gradients, remember to clip them

The thing is that is difficult to train RNNs on long-range dependencies because either the gradient will vanish or explode9

Warning

long-range dependencies tend to have a smaller impact on the system than short-range ones

Gated Cells

These are neurons that can be controlled to make them learn or forget chosen pieces of information

Caution

With chosen we intend choosing from the hyperspace, so it's not really precise.

Long Short Term Memory | LSTM1011

This cell has a separate signal, namely the cell-state, which controls gates of this cells, always initialized to 1.

Note

W will be weights associated with \vec{x} and U with \vec{h}.

The cell-state has the same dimension as the hidden-state

\odot is the Hadamard Product, also called the pointwise product

Forget Gate | Keep Gate

This gate controls the cell-state:


\hat{c}_{t} = \sigma \left(
    U_fh_{t-1} + W_fx_t + b_f
\right) \odot c_{t-1}

The closer the result of \sigma is to 0, the more the cell-state will forget that value, and opposite for values closer to 1.

Input Gate | Write Gate

controls how much of the input gets into the cell-state


c_{t} = \left(
    \sigma \left(
        U_ih_{t-1} + W_ix_t + b_i
    \right) \odot \tanh \left(
        U_ch_{t-1} + W_cx_t + b_c
    \right)
\right) + \hat{c}_{t}

The results of \tanh are new pieces of information. The higher the \sigma_i, the higher the importance given to that info.

Note

The forget gate and the input-gate are 2 phases of the update-phase.

Output Gate | Read Gate

Controls how much of the hidden-state is forwarded


h_{t} = \tanh (c_{t}) \odot \sigma \left(
    U_oh_{t-1} + W_ox_t + b_o
\right)

This produces the new hidden-state. Notice that the info comes from the cell-state, gated by the input and previous-hidden-state


Here the backpropagation of the gradient is way simpler for the cell-states as they require only elementwise multiplications

GRU1213

It is another type of gated-cell, but, on the contrary of LSTM-cells, it doesn't have a separate cell-state, but only the hidden-state, while keeping similar performances to LSTM.

Note

GRU doesn't have any output-gate and h_0 = 0

Update Gate

This gate unifies forget gate and input gate


\begin{aligned}

    \hat{h}_t &= \left(
        1 - \sigma \left(
            U_z h_{t-1} + W_z x_{t} + b_z
        \right)
    \, \right) \odot h_{t-1}
\end{aligned}
Reset Gate

This is what breaks the information flow from the previous hidden-state.


\begin{aligned}
    \bar{h}_t &= \sigma\left(
            U_r h_{t-1} + W_r x_{t} + b_r
        \right) \odot h_{t-1}
\end{aligned}
New hidden-state

\begin{aligned}
    h_t = \hat{h}_t + (\sigma \left(
        U_z h_{t-1} + W_z x_{t} + b_z
    \right) \odot \tanh \left(
        U_h \bar{h}_t + W_h x_t + b_h
    \right))
\end{aligned}

Tip

There's no clear winner between GRU and LSTM, so try them both, however the former is easier to compute

Bi-LSTM1415

It is a technique in which we put 2 LSTM networks, one to remember the past and one to remember the future.

This type of networks improve context understanding

Applications16

  • Music Generation
  • Sentiment Classification
  • Machine Translation
  • Attention Mechanisms

Pros, Cons and Quirks

Pros

Cons

  • hard to train

Quirks


  1. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 ↩︎

  2. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20 ↩︎

  3. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22 ↩︎

  4. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23 ↩︎

  5. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25 ↩︎

  6. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47 ↩︎

  7. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50 ↩︎

  8. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51 ↩︎

  9. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87 ↩︎

  10. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112 ↩︎

  11. LSTM | Wikipedia | 27th April 2025 ↩︎

  12. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118 ↩︎

  13. GRU | Wikipedia | 27th April 2025 ↩︎

  14. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 ↩︎

  15. Bi-LSTM | StackOverflow | 27th April 2025 ↩︎

  16. Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126 ↩︎