Added Transformers

This commit is contained in:
chris-admin 2025-09-08 17:54:01 +02:00
parent f6960c109b
commit 191dc0ff12
5 changed files with 3534 additions and 0 deletions

View File

@ -0,0 +1,187 @@
# Transformers
## Block Components
The idea is that each Transformer block is made of the **same number** of [`Encoders`](#encoder) and
[`Decoders`](#decoder)
![image](./Images/PNGs/transformer-high-level.png)
> [!NOTE]
> Input and output are vectors of **fixed size** with padding
Before feeding our input, we split and embed each word into a fixed vector size. This size depends on the length of
longest sentence in our training set
### Embedder
While this is not a real component per se, this is the first phase before even coming
to the first `encoder` and `decoder`.
Here we transform each word of the input into an ***embedding*** and add a vector to account for
position. This positional encoding can either be learnt or can follow this formula:
- Even size:
$$
\text{positional\_encoding}_{
(position, 2\text{size})
} = \sin\left(
\frac{
pos
}{
10000^{
\frac{
2\text{size}
}{
\text{model\_depth}
}
}
}
\right)
$$
- Odd size:
$$
\text{positional\_encoding}_{
(position, 2\text{size} + 1)
} = \cos\left(
\frac{
pos
}{
10000^{
\frac{
2\text{size}
}{
\text{model\_depth}
}
}
}
\right)
$$
### Encoder
> [!CAUTION]
> Weights are not shared between `encoders` or `decoders`
Each phase happens for each word. In other words, if our embed size is 512, we have 512 `Self Attentions` and
512 `Feed Forward NN` **per `encoder`**
![Image](./Images/PNGs/encoder.png)
#### Encoder Self Attention
> [!WARNING]
> This step is the most expensive one as it involves many computations
Self Attention is a step in which each ***token*** gets the knowledge of previous ones.
During this step, we produce 3 vectors that are **usually smaller**, for example 64 instead of 512:
- **Queries** $\rightarrow q_{i}$
- **Keys** $\rightarrow k_{i}$
- **Values** $\rightarrow v_{i}$
We use these values to compute a **score** that will tell us **how much to focus on certain parts of the sentence
while encoding a token**
In order to compute the final encoding we do these for each encoding word $i$:
- Compute score for each word $j$ : $\text{score}_{j} = q_{i} \cdot k_{j}$
- Divide each score by the square root of the size of these *helping vectors*:
$\text{score}_{j} = \frac{\text{score}_{j}}{\sqrt{\text{size}}}$
- Compute softmax of all scores
- Multiply softmax each score per its value: $\text{score}_{j} = \text{score}_{j} \cdot v_{j}$
- Sum them all: $\text{encoding}_{i} = \sum_{j}^{N} \text{score}_{j}$
> [!NOTE]
> These steps will be done with matrices, not in this sequential way
##### Multi-Headed Attention
Instead of doing the Attention operation once, we do it more times, by having differente matrices to produce
our *helping vectors*.
This produces N encodings for each ***token***, or N matrices of encodings.
The trick here is to **concatenate all encoding matrices** and **learn a new weight matrix** that will
**combine them**
#### Residuals
In order no to lose some information along the path, after each `Feed Forward` and `Self-Attention`
we add inputs to each ***sublayer*** `outputs` and we do a `Layer Normalization`
#### Encoder Feed Forward NN
> [!TIP]
> This step is mostly **parallel** as there's no dependency between *neighbour vectors*
### Decoder
> [!NOTE]
> The decoding phase is slower than the encoding one, as it is sequential, producing a token for each iteration.
> However it can be sped up by producing several tokerns at once
After the **last `Encoder`** has produced its `output`, $K$ and $V$ vectors, these are then used by
**all `Decoders`** during their self attention step, meaning they are **shared** among all `Decoders`.
All `Decoders` steps are then **repeated until we get a `<eos>` token** which will tell the decoder to stop.
#### Decoder Self Attention
It's almost the same as in the `encoding` phase, though here, since we have no future `outputs`, we can only take into
account only previous ***tokens***, by setting future ones to `-inf`.
Moreover, here the `Key` and `Values` Mappings come from the `encoder` pase, while the
`Queue` Mapping is learnt here.
#### Final Steps
##### Linear Layer
Produces a vector of ***logits***, one per each ***known words***.
##### Softmax Layer
We then score these ***logits*** over a `SoftMax` to get probabilities. We then take the highest one, usually.
If we implement ***Temperature***, though, we can take some `tokens` that are less probable, but having less predictability and
have some results that feel more natural.
## Training a Transformer
<!-- TODO: See PDF 12 pg. 58 to 65 -->
## Known Transformers
### BERT (Bidirectional Encoder Representations from Transformers)
Differently from other `Transformers`, it uses only `Encoder` blocks.
It can be used as a classifier and can be fine tuned.
The fine tuning happens by **masking** input and **predict** the **masked word**:
- 15% of total words in input are masked
- 80% will become a `[masked]` token
- 10% will become random words
- 10% will remain unchanged
#### Bert tasks
- **Classification**
- **Fine Tuning**
- **2 sentences tasks**
- **Are they paraphrases?**
- **Does one sentence follow from this other one?**
- **Feature Extraction**: "Allows us to extract feature to use in our model
### GPT-2
Differently from other `Transformers`, it uses only `Decoder` blocks.
Since it has no `encoders`, `GPT-2` takes `outputs` and append them to the original `input`. This is called **autoregression**.
This, however, limits `GPT-2` on how to learn context on `input` because of `masking`.
During `evaluation`, `GPT-2` does not recompute `V`, `K` and `Q` for previous tokens, but hold on their previosu values.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB