Added images and revised notes
This commit is contained in:
parent
307a3f7c5d
commit
99607d5882
@ -1,5 +1,15 @@
|
||||
# Recurrent Networks | RNNs[^anelli-RNNs]
|
||||
|
||||
## Why would we want Recurrent Networks?
|
||||
|
||||
To deal with **sequence** related jobs of **arbitrary length**.
|
||||
|
||||
In fact they can deal with inputs of varying length while being fast and memory efficient.
|
||||
|
||||
While **autoregressive models** always needs to analyse all past inputs
|
||||
(or a window of most recent ones), at each computation,
|
||||
`RNNs` don't, making them a great tool when the situation permits it.
|
||||
|
||||
<!-- TODO: add images -->
|
||||
|
||||
## A bit of History[^anelli-RNNs-1]
|
||||
@ -33,17 +43,23 @@ such as the ***dimension of the "past" window***.
|
||||
|
||||
The idea behind [`RNNs`](#rnns) is to add ***memory***
|
||||
as a `hidden-state`. This helps the `model` to
|
||||
***"remember"*** things for "long time", but it
|
||||
is ***noisy***, and as such, the best we can do is
|
||||
***"remember"*** things for "long time", but since it
|
||||
is ***noisy***, the best we can do is
|
||||
to ***infer its probability distribution***, doable only
|
||||
for:
|
||||
|
||||
- [`Linear Dynamical Systems`](https://en.wikipedia.org/wiki/Linear_dynamical_system)
|
||||
- [`Hidden Markov Model`](https://en.wikipedia.org/wiki/Hidden_Markov_model)
|
||||
|
||||
While these models are `stochastic`,
|
||||
***[`RNNs`](#rnns) are `deterministic`***, plus they are ***`non-linear`*** and their
|
||||
While these models are `stochastic`, technically the **a posteriori probability ditstibution** is
|
||||
**deterministic**.
|
||||
|
||||
Since we can think of [`RNNs`](#rnns) `hidden state` equivalent to a **a posteriori probability ditstibution**, they are **`deterministic`**
|
||||
|
||||
<!-- TODO: check this:
|
||||
, plus they are ***`non-linear`*** and their
|
||||
***`hidden-state` is `distributed`***[^anelli-RNNs-3]
|
||||
-->
|
||||
|
||||
### Neurons with Memory[^anelli-RNNs-4]
|
||||
|
||||
@ -83,7 +99,11 @@ $$
|
||||
> Technically speaking, we could consider
|
||||
> [`RNNs`](#rnns) as deep `NNs`[^anelli-RNNs-5]
|
||||
|
||||
#### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
|
||||
## Different RNNs configurations
|
||||
|
||||

|
||||
|
||||
### Providing `initial-states` for the `hidden-states`[^anelli-RNNs-6]
|
||||
|
||||
- Specify `initial-states` of ***all*** `units`
|
||||
- Specify `initial-states` for a ***subset*** of `units`
|
||||
@ -91,27 +111,40 @@ $$
|
||||
`units` for ***each `timestep`*** (Which is the most
|
||||
naural way to model sequential data)
|
||||
|
||||
#### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
|
||||
In other words, it depends on how you need to model data according to your sequence.
|
||||
|
||||
### Teaching signals for [`RNNs`](#rnns)[^anelli-RNNs-7]
|
||||
|
||||
- Specify ***desired final activity*** for ***all***
|
||||
`units`
|
||||
- Specify ***desired final activity*** for ***all***
|
||||
`units` ofr the ***last few `steps`***
|
||||
- This is good to learn `attractors`
|
||||
- Makes it easy to add ***extra error derivatives***
|
||||
- This is good to learn `attractors`
|
||||
- Makes it easy to add ***extra error derivatives***
|
||||
- Speficfy the ***desired activity of a subset of
|
||||
`units`***
|
||||
- The other `units` will be either `inputs` or
|
||||
- The other `units` will be either `inputs` or
|
||||
`hidden-states`, as ***we fixed these***
|
||||
|
||||
#### Transforming `Data` to be used in [`RNNs`](#rnns)
|
||||
In other words, it depends on which kind of output you need to be produced.
|
||||
|
||||
- One-hot encoding: Here each `token` is a $1$ over
|
||||
the `input` array
|
||||
- Learned embeddings: Here each `token` is a `point`
|
||||
of a ***learned hyperspace***
|
||||
for example, a sentimental analysis would need to have just one output, while
|
||||
a `seq2seq` job would require a full sequence.
|
||||
|
||||
### Backpropagation
|
||||
## Transforming `Data` for [`RNNs`](#rnns)
|
||||
|
||||
Since `RNNs` need vectors, the ideal way to transform inputs into vectors is either
|
||||
having **`1-hot`** encoding over **whole words** or transform them into **tokens** and then
|
||||
**`1-hot`** encode them.
|
||||
|
||||
While this is may be enough, there's a better way where we transform each
|
||||
1-hot encoded vector into a learned vector of fixed size (usually of smaller dimensions) during the embedding phase.
|
||||
|
||||
To better understand this, imagine a vocabulary of either 16K words or Tokens. we would have
|
||||
16K dimensions for each vector, which is massive. Instead, by embedding it into 256 dimensions
|
||||
we can save both time and space complexity.
|
||||
|
||||
## RNNs Training
|
||||
|
||||
Since [`RNNs`](#rnns) can be considered a `deep-layered`
|
||||
`NN`, then we firstly ***train the model
|
||||
@ -126,15 +159,32 @@ derivatives along `time-steps`
|
||||
|
||||
The thing is that is ***difficult to `train`
|
||||
[`RNNs`](#rnns)*** on
|
||||
***long-range dependencies*** because either the
|
||||
***gradient will `vanish` or `explode`***[^anelli-RNNs-8]
|
||||
***long-range dependencies*** because the
|
||||
***gradient will either `vanish` or `explode`***[^anelli-RNNs-8]
|
||||
|
||||
### Mitigating training problems in RNNs
|
||||
|
||||
In order to mitigate these gradient problems that impairs our network ability
|
||||
to gain valuable information over long term dependencies, we have these
|
||||
solutions:
|
||||
|
||||
- **LSTM**:\
|
||||
Make the model out of little modules crafted to keep values for long time
|
||||
- **Hessian Free Optimizers**:\
|
||||
Use optimizers that can see the gradient direction over smaller curvatures
|
||||
- **Echo State Networks**[^unipi-esn]:\
|
||||
The idea is to use a **`sparsely connected large untrained network`** to keep track of
|
||||
inputs for long time, while eventually be forgotten, and have a
|
||||
**`trained readout network`** that converts the **`echo`** output into something usable
|
||||
- **Good Initialization and Momentum**:\
|
||||
Same thing as before, but we learn all connections using momentum
|
||||
|
||||
> [!WARNING]
|
||||
>
|
||||
> `long-range dependencies` tend to have a smaller
|
||||
> impact on the system than `short-range` ones
|
||||
> `long-range dependencies` are more difficult to learn than `short-range` ones
|
||||
> because of the gradient problem.
|
||||
|
||||
### Gated Cells
|
||||
## Gated Cells
|
||||
|
||||
These are `neurons` that can be controlled to make
|
||||
them `learn` or `forget` chosen pieces of information
|
||||
@ -144,13 +194,15 @@ them `learn` or `forget` chosen pieces of information
|
||||
> With ***chosen*** we intend choosing from the
|
||||
> `hyperspace`, so it's not really precise.
|
||||
|
||||
#### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
|
||||
### Long Short Term Memory | LSTM[^anelli-RNNs-9][^LSTM-wikipedia]
|
||||
|
||||
This `cell` has a ***separate signal***, namely the
|
||||
`cell-state`,
|
||||
***which controls `gates` of this `cells`, always
|
||||
initialized to `1`***.
|
||||
|
||||

|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> $W$ will be weights associated with $\vec{x}$ and
|
||||
@ -162,9 +214,11 @@ initialized to `1`***.
|
||||
> $\odot$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)), also called the
|
||||
> ***pointwise product***
|
||||
|
||||
<!-- TODO: Add images -->
|
||||

|
||||
|
||||
##### Forget Gate | Keep Gate
|
||||
<!-- TODO: revice formulas -->
|
||||
|
||||
#### Forget Gate | Keep Gate
|
||||
|
||||
This `gate` ***controls the `cell-state`***:
|
||||
|
||||
@ -178,7 +232,7 @@ The closer the result of $\sigma$ is to $0$, the more
|
||||
the `cell-state` will forget that value, and opposite
|
||||
for values closer to $1$.
|
||||
|
||||
##### Input Gate | Write Gate
|
||||
#### Input Gate | Write Gate
|
||||
|
||||
***controls how much of the `input` gets into the
|
||||
`cell-state`***
|
||||
@ -203,7 +257,7 @@ the importance given to that info.
|
||||
> [`input-gate`](#input-gate--write-gate) are 2 phases
|
||||
> of the `update-phase`.
|
||||
|
||||
##### Output Gate | Read Gate
|
||||
#### Output Gate | Read Gate
|
||||
|
||||
***Controls how much of the
|
||||
`hidden-state` is forwarded***
|
||||
@ -225,7 +279,7 @@ Here the `backpropagation` of the ***gradient*** is way
|
||||
simpler for the `cell-states` as they ***require only
|
||||
elementwise multiplications***
|
||||
|
||||
#### GRU[^anelli-RNNs-10][^GRU-wikipedia]
|
||||
### GRU[^anelli-RNNs-10][^GRU-wikipedia]
|
||||
|
||||
It is another type of [`gated-cell`](#gated-cells), but,
|
||||
on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
|
||||
@ -233,13 +287,15 @@ on the contrary of [`LSTM-cells`](#long-short-term-memory--lstm),
|
||||
the `hidden-state`***, while keeping
|
||||
***similar performances to [`LSTM`](#long-short-term-memory--lstm)***.
|
||||
|
||||

|
||||
|
||||
> [!NOTE]
|
||||
> [`GRU`](#gru) doesn't have any `output-gate` and
|
||||
> $h_0 = 0$
|
||||
|
||||
<!-- TODO: Add images -->
|
||||

|
||||
|
||||
##### Update Gate
|
||||
#### Update Gate
|
||||
|
||||
This `gate` unifies [`forget gate`](#forget-gate--keep-gate) and [`input gate`](#input-gate--write-gate)
|
||||
|
||||
@ -254,7 +310,7 @@ $$
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
##### Reset Gate
|
||||
#### Reset Gate
|
||||
|
||||
This is what breaks the `information` flow from the
|
||||
previous `hidden-state`.
|
||||
@ -267,7 +323,7 @@ $$
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
##### New `hidden-state`
|
||||
#### New `hidden-state`
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
@ -308,13 +364,15 @@ understanding***
|
||||
|
||||
<!-- TODO: Finish this part -->
|
||||
|
||||
#### Pros
|
||||
### References
|
||||
|
||||
#### Cons
|
||||
|
||||
- ***hard to train***
|
||||
|
||||
#### Quirks
|
||||
- [ai-master.gitbooks.io](https://ai-master.gitbooks.io/recurrent-neural-network/content/reference.html)
|
||||
- [stanford.edu - CS224d-Lecture8](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf)
|
||||
- [deepimagesent](http://cs.stanford.edu/people/karpathy/deepimagesent/)
|
||||
- [introduction-to-rnns](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
|
||||
- [implementing-a-language-model-rnn-with-python-numpy-and-theano](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)
|
||||
- [rnn-effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
|
||||
- [Understanding-LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
|
||||
|
||||
<!-- TODO: PDF 8 pg. 24 -->
|
||||
|
||||
@ -352,3 +410,5 @@ understanding***
|
||||
[^anelli-RNNs-12]: Vito Walter Anelli | Deep Learning Material 2024/2025 | PDF 8 pg. 127 to 136
|
||||
|
||||
[^Bi-LSTM-stackoverflow]: [Bi-LSTM | StackOverflow | 27th April 2025](https://stackoverflow.com/questions/43035827/whats-the-difference-between-a-bidirectional-lstm-and-an-lstm)
|
||||
|
||||
[^unipi-esn]: [UniPI | ESN | 25th October 2025](https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/aa2/rnn4-esn.pdf)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user