9.6 KiB
Recurrent Networks | RNNs1
A bit of History2
In order to predict the future, we need
information of the past. This is the idea behind
RNNs for
predicting the next item in a sequence.
While it has been attempted to accomplish this prediction
through the use of memoryless models, they didn't hold
up to expectations and had several limitations
such as the dimension of the "past" window.
Shortcomings of previous attempts2
- The
context windowwas small, thus themodelcouldn't use distant past dependencies - Some tried to count words, but it doesn't preserve meaning
- Some tried to make the
context windowbigger but this caused words to be considered differently based on their position, making it impossible to reuseweightsfor same words.
RNNs3
The idea behind RNNs is to add memory
as a hidden-state. This helps the model to
"remember" things for "long time", but it
is noisy, and as such, the best we can do is
to infer its probability distribution, doable only
for:
While these models are stochastic,
RNNs are deterministic, plus they are non-linear and their
hidden-state is distributed4
Neurons with Memory5
While in normal NNs we have no memory, these
neurons have a hidden-state, \vec{h} ,
which is fed back to the neuron itself.
The formula of this hidden-state is:
\vec{h}_t = f_{W}(\vec{x}_t, \vec{h}_{t-1})
In other words, The hidden-state is influenced by
a function modified by weights and
dependent by current inputs and preious step
hidden-states.
For example, let's say we use a \tanh
activation-function:
\vec{h}_t = \tanh(
W_{h, h}^T \vec{h}_{t-1} + W_{x, h}^T \vec{x}_{t}
)
And the output becomes:
\vec{\bar{y}}_t = W_{h, y}^T \vec{h}_{t}
Note
Providing initial-states for the hidden-states7
- Specify
initial-statesof allunits - Specify
initial-statesfor a subset ofunits - Specify
initial-statesfor the same subset ofunitsfor eachtimestep(Which is the most naural way to model sequential data)
Teaching signals for RNNs8
- Specify desired final activity for all
units - Specify desired final activity for all
unitsofr the last fewsteps- This is good to learn
attractors - Makes it easy to add extra error derivatives
- This is good to learn
- Speficfy the desired activity of a subset of
units- The other
unitswill be eitherinputsorhidden-states, as we fixed these
- The other
Transforming Data to be used in RNNs
- One-hot encoding: Here each
tokenis a1over theinputarray - Learned embeddings: Here each
tokenis apointof a learned hyperspace
Backpropagation
Since RNNs can be considered a deep-layered
NN, then we firstly train the model
over the sequence and then backpropagate,
keeping track of the training stack, adding
derivatives along time-steps
Caution
If you have big gradients, remember to
clipthem
The thing is that is difficult to train
RNNs on
long-range dependencies because either the
gradient will vanish or explode9
Warning
long-range dependenciestend to have a smaller impact on the system thanshort-rangeones
Gated Cells
These are neurons that can be controlled to make
them learn or forget chosen pieces of information
Caution
With chosen we intend choosing from the
hyperspace, so it's not really precise.
Long Short Term Memory | LSTM1011
This cell has a separate signal, namely the
cell-state,
which controls gates of this cells, always
initialized to 1.
Note
Wwill be weights associated with\vec{x}andUwith\vec{h}.The
cell-statehas the same dimension as thehidden-state
\odotis the Hadamard Product, also called the pointwise product
Forget Gate | Keep Gate
This gate controls the cell-state:
\hat{c}_{t} = \sigma \left(
U_fh_{t-1} + W_fx_t + b_f
\right) \odot c_{t-1}
The closer the result of \sigma is to 0, the more
the cell-state will forget that value, and opposite
for values closer to 1.
Input Gate | Write Gate
controls how much of the input gets into the
cell-state
c_{t} = \left(
\sigma \left(
U_ih_{t-1} + W_ix_t + b_i
\right) \odot \tanh \left(
U_ch_{t-1} + W_cx_t + b_c
\right)
\right) + \hat{c}_{t}
The results of \tanh are new pieces of
information. The higher the \sigma_i, the higher
the importance given to that info.
Note
The
forget gateand theinput-gateare 2 phases of theupdate-phase.
Output Gate | Read Gate
Controls how much of the
hidden-state is forwarded
h_{t} = \tanh (c_{t}) \odot \sigma \left(
U_oh_{t-1} + W_ox_t + b_o
\right)
This produces the new hidden-state.
Notice that
the info comes from the cell-state,
gated by the input and previous-hidden-state
Here the backpropagation of the gradient is way
simpler for the cell-states as they require only
elementwise multiplications
GRU1213
It is another type of gated-cell, but,
on the contrary of LSTM-cells,
it doesn't have a separate cell-state, but only
the hidden-state, while keeping
similar performances to LSTM.
Note
GRUdoesn't have anyoutput-gateandh_0 = 0
Update Gate
This gate unifies forget gate and input gate
\begin{aligned}
\hat{h}_t &= \left(
1 - \sigma \left(
U_z h_{t-1} + W_z x_{t} + b_z
\right)
\, \right) \odot h_{t-1}
\end{aligned}
Reset Gate
This is what breaks the information flow from the
previous hidden-state.
\begin{aligned}
\bar{h}_t &= \sigma\left(
U_r h_{t-1} + W_r x_{t} + b_r
\right) \odot h_{t-1}
\end{aligned}
New hidden-state
\begin{aligned}
h_t = \hat{h}_t + (\sigma \left(
U_z h_{t-1} + W_z x_{t} + b_z
\right) \odot \tanh \left(
U_h \bar{h}_t + W_h x_t + b_h
\right))
\end{aligned}
Tip
There's no clear winner between
GRUandLSTM, so try them both, however the former is easier to compute
Bi-LSTM1415
It is a technique in which we put 2 LSTM networks,
one to remember the past and one to remember the
future.
This type of networks improve context
understanding
Applications16
- Music Generation
- Sentiment Classification
- Machine Translation
- Attention Mechanisms
Pros, Cons and Quirks
Pros
Cons
- hard to train
Quirks
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 11 to 20 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 21 to 22 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 23 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 25 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 43 to 47 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 50 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 51 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 69 to 87 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 91 to 112 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 113 to 118 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136 ↩︎
-
Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 119 to 126 ↩︎