From 65288793ce7ae27cc7e52d0b10df9857e2bac00b Mon Sep 17 00:00:00 2001 From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com> Date: Sun, 2 Nov 2025 16:49:01 +0100 Subject: [PATCH] Revised notes --- Chapters/12-Transformers/INDEX.md | 431 +++++++++++++++++++----------- 1 file changed, 272 insertions(+), 159 deletions(-) diff --git a/Chapters/12-Transformers/INDEX.md b/Chapters/12-Transformers/INDEX.md index e6d75af..75bd064 100644 --- a/Chapters/12-Transformers/INDEX.md +++ b/Chapters/12-Transformers/INDEX.md @@ -1,188 +1,301 @@ # Transformers -## Block Components +Transformers are very similar to [`RNNs`](./../8-Recurrent-Networks/INDEX.md) +in terms of usage (machine translation, text generation, sequence to +sequence, sentimental analysis, word prediction, ...), +but differ for how they process data. -The idea is that each Transformer block is made of the **same number** of [`Encoders`](#encoder) and -[`Decoders`](#decoder) +While `RNNs` have a recurrent part that computes the input +sequentially, `Transformers` computes it all at once, making it easier to +parallelize and make it **effectively faster despite being quadratically +complex**. -![image](./Images/PNGs/transformer-high-level.png) +However this comes at the cost of not having an *infinite context* for +tranformers. They have no memory, +usually[^infinite-transformer-context][^transformer-xl], +meaning that they need to resort to tricks as **autoregressiveness** or +fixed context windows. + +## Basic Technologies + +### Positional Encoding + +When words are processed in our Transformer, since they are processed at +once, they may lose their positional information, making them less informative. + +By using a Positional Encoding, we add back this information to the word. + +There are several ways to add such encoding to words, but among these we find: + +- **Learnt One**:\ + Use another network to learn how to add a positional encoding + to the word embedding +- **Positional Encoding[^attention-is-all-you-need]**:\ + This comes from ***"Attention Is All You Need"***[^attention-is-all-you-need] + and it's a fixed function that adds alternately the sine and cosine + to word embeddings + $$ + \begin{aligned} + PE_{(pos, 2i)} &= \sin{\left( + \frac{ + pos + }{ + 10000^{2i/d_{model}} + } + \right)} + \\ + PE_{(pos, 2i+1)} &= \cos{\left( + \frac{ + pos + }{ + 10000^{2i/d_{model}} + } + \right)} + \end{aligned} + $$ +- **RoPE[^rope-paper][^hugging-face-pe]**:\ + This algorithm uses the same function as above, but it doesn't add it, rather + it uses it to rotate (multiply) vectors. The idea is that by rotating a + vector, it doesn't change its magnitude and possibly its latent meaning. + +### Feed Forward + +This is just a couple of linear layers where the first one expands +dimensionality (usually by 4 times) of the embedding size, due to +Cover Theorem, and then it shrinks it back to the original embedding +size. + +### Self Attention + +This Layer employs 3 matrices, for each attention head, that computes +Query, Key and Value vectors for each word embedding. + +#### Steps + +- Compute $Q, K, V$ matrices for each embedding + +$$ +\begin{aligned} + Q_{i} &= S \times W_{Qi} \in \R^{S \times H}\\ + K_{i} &= S \times W_{Ki} \in \R^{S \times H}\\ + V_{i} &= S \times W_{Vi} \in \R^{S \times H} +\end{aligned} +$$ + +- Compute the head value + +$$ +\begin{aligned} + Head_i = softmax\left( + \frac{ + Q_{i} \times K_{i} + }{ + \sqrt{H} + } + \right) \times V_{i} + \in \R^{S \times H} +\end{aligned} +$$ + +- Concatenate all heads and multiply for a learnt matrix + +$$ +\begin{aligned} + Heads &= concat(Head_1, \dots, Head_n) \in \R^{S \times (n \cdot H)} \\ + Out &= Heads \times W_{Heads} \in \R^{S \times Em} +\end{aligned} +$$ > [!NOTE] -> Input and output are vectors of **fixed size** with padding +> Legend for each notation: +> +> - $H$: Head dimension +> - $S$: Sentence length (number of tokens) +> - $i$: head index -Before feeding our input, we split and embed each word into a fixed vector size. This size depends on the length of -longest sentence in our training set +> [!TIP] +> $H$ is usually smaller (makes computation faster and memory efficient), however +> it's not necessary. +> +> Here we shown several operations, however, instead of making many small tensor +> multiplications, it's better to perform +> one (computationally more efficient) and then split ist result into +> its components + +### Cross-Attention + +It's the same as the Self Attention, however we only compute $Q_{i}$ for what +comes from the encoder, while $K_i$ and $V_i$ come from inputs coming +from the last encoder: + +$$ +\begin{aligned} + Q_{i} &= S_{dec} \times W_{Qi} \in \R^{S \times H}\\ + K_{i} &= S_{enc} \times W_{Ki} \in \R^{S \times H}\\ + V_{i} &= S_{enc} \times W_{Vi} \in \R^{S \times H} +\end{aligned} +$$ + +### Masking + +In order to make it sure that a decoder doesn't attent future info for past +words, we have masks that makes it sure that information doesn't leak +to parts of the networks. + +We usually implement 4 kind of masks + +![masks](./pngs/masks.png) + +- **Padding Mask**:\ + This mask is useful to avoid computing attention for paddings +- **Full Attention**: + This mask is useful in encoders. It allows the attention to have a double + directed attention by making words on the right add info to words on + the left and vice-versa. +- **Causal Attention**:\ + This mask is useful in decoders. It denies the attention of words on + the right to leak over leftwards ones. In other words, it prevents + that future words can affect the past meaning. +- **Prefix Attention**:\ + This mask is useful for some task in decoders. It allows some words to + add info over the past. These words however are not generated by the decoder, + but are part of its initial input. + +## Basic Blocks ### Embedder -While this is not a real component per se, this is the first phase before even coming -to the first `encoder` and `decoder`. +This layer is responsible of transforming the input (usually tokens) into +word embeddings following these steps: -Here we transform each word of the input into an ***embedding*** and add a vector to account for -position. This positional encoding can either be learnt or can follow this formula: - -- Even size: - -$$ -\text{positional\_encoding}_{ - (position, 2\text{size}) -} = \sin\left( - \frac{ - pos - }{ - 10000^{ - \frac{ - 2\text{size} - }{ - \text{model\_depth} - } - } - } - \right) -$$ - -- Odd size: - -$$ -\text{positional\_encoding}_{ - (position, 2\text{size} + 1) -} = \cos\left( - \frac{ - pos - }{ - 10000^{ - \frac{ - 2\text{size} - }{ - \text{model\_depth} - } - } - } - \right) -$$ +- one-hot encoding +- matrix multiplication to get the desired embedding +- inclusion of positional info ### Encoder -> [!CAUTION] -> Weights are not shared between `encoders` or `decoders` +It takes meanings from embedded vectors both on the right and left part: -Each phase happens for each word. In other words, if our embed size is 512, we have 512 `Self Attentions` and -512 `Feed Forward NN` **per `encoder`** +- Self Attention +- Residual Connection +- Layer Normalization +- Feed Forward +- Residual Connection +- Layer Normalization (sometimes it is done before going to self attention) -![Image](./Images/PNGs/encoder.png) +![encoder picture](./pngs/encoder.png) -#### Encoder Self Attention - -> [!WARNING] -> This step is the most expensive one as it involves many computations - -Self Attention is a step in which each ***token*** gets the knowledge of previous ones. - -During this step, we produce 3 vectors that are **usually smaller**, for example 64 instead of 512: - -- **Queries** $\rightarrow q_{i}$ -- **Keys** $\rightarrow k_{i}$ -- **Values** $\rightarrow v_{i}$ - -We use these values to compute a **score** that will tell us **how much to focus on certain parts of the sentence -while encoding a token** - -In order to compute the final encoding we do these for each encoding word $i$: - -- Compute score for each word $j$ : $\text{score}_{j} = q_{i} \cdot k_{j}$ -- Divide each score by the square root of the size of these *helping vectors*: - $\text{score}_{j} = \frac{\text{score}_{j}}{\sqrt{\text{size}}}$ -- Compute softmax of all scores -- Multiply softmax each score per its value: $\text{score}_{j} = \text{score}_{j} \cdot v_{j}$ -- Sum them all: $\text{encoding}_{i} = \sum_{j}^{N} \text{score}_{j}$ - -> [!NOTE] -> These steps will be done with matrices, not in this sequential way - -##### Multi-Headed Attention - -Instead of doing the Attention operation once, we do it more times, by having differente matrices to produce -our *helping vectors*. - -This produces N encodings for each ***token***, or N matrices of encodings. - -The trick here is to **concatenate all encoding matrices** and **learn a new weight matrix** that will -**combine them** - -#### Residuals - -In order no to lose some information along the path, after each `Feed Forward` and `Self-Attention` -we add inputs to each ***sublayer*** `outputs` and we do a `Layer Normalization` - -#### Encoder Feed Forward NN - -> [!TIP] -> This step is mostly **parallel** as there's no dependency between *neighbour vectors* +Usually it used to condition all decoders, however, if connected to a +`De-Embedding` block or other layers, it can be used stand alone to +generate outputs ### Decoder +It takes meaning from output embedded vectors, usually left to right, and +condition them with last encoder output. + +- Self Attention +- Residual Connection +- Layer Normalization +- Cross Attention +- Residual Connection +- Layer Normalization +- Feed Forward +- Residual Connection +- Layer Normalization (sometimes it is done before going to self attention) + +![decoder picture](./pngs/decoder.png) + +Usually this block is used to generate outptus autoregressively, meaning +that we'll only take $out_{k}$ as the actual output and append it as $in_{k+1}$ +during inference time. + +> [!WARNING] +> During train time, we are going to feed it all expected sequence, but shifted +> by a `start` token, predicting the whole sequence again. +> +> So, **it isn't trained autoregressively** + +### De-Embedding + +Before having a result, we de-embed results, coming to a known format. + +Usually, for text generation, this makes the whole problem a classification one +as we need to predict the right token among all available ones. + +Usually, for text, it is implemented as this: + +- Linear layer -> Go back to token space dimensions +- Softmax -> Transform into probabilities +- Argmax -> Take the most probable one + +However, depending on the objectives, this is subject to change + +## Basic Architectures + +### Full Transformer (aka Encoder Decoder - Encoder Conditioned) + +This architecture is very powerful, but ***"with great power, comes a great +energy bill"***. While it has been successfully used in models like `T5`[^t5], +it comes with additional complexity, both over the coding part and the +computational one. + +![full architecture picture](./pngs/full-transformer.png) + +This is the basic architecture proposed in ***"Attention Is All You Need"*** +[^attention-is-all-you-need], but nowadays has been supersided by decoder only +architectures, for tasks as text generation. + +### Encoder Only + +This architecture is done by only employing encoders at its base. A model using +this architecture is `BERT`[^bert], which is capable of tasks such as Masked +Langauge Model, Sentimental Analysis, Feature Extraction and General +Classification (as for e-mails). + +![encoder only picture](./pngs/encoder-only.png) + +However this architecture comes at the cost of not being good at text and +sequence generation, but has the advantage of being able to process everything +in one step. + +### Decoder Only + +This architecture employs only decoders, which are modified to get ridden of +cross-attention. Usually this is at the base of modern LLMS such as +`GPT`[^gpt-2]. This architecture is capable of generating text, summarizing and +music (converting into MIDI format). + +![decoder only picture](./pngs/decoder-only.png) + +However this architecture needs time to generate, due to its autoregressive +nature. + +## Curiosities + > [!NOTE] -> The decoding phase is slower than the encoding one, as it is sequential, producing a token for each iteration. -> However it can be sped up by producing several tokerns at once +> We call $Q$ as `Query`, $K$ as `Key` and $V$ as `Value`. Their names +> come from an interpretation given to how they interact together, like +> if were to search for `Key` over the `Query` Box and the found items is +> `Value`. +> +> However this is only an analogy and not the actual process. -After the **last `Encoder`** has produced its `output`, $K$ and $V$ vectors, these are then used by -**all `Decoders`** during their self attention step, meaning they are **shared** among all `Decoders`. + +[^infinite-transformer-context]: [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention](https://arxiv.org/pdf/2404.07143) -All `Decoders` steps are then **repeated until we get a `` token** which will tell the decoder to stop. +[^transformer-xl]: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860) -#### Decoder Self Attention +[^attention-is-all-you-need]: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762) -It's almost the same as in the `encoding` phase, though here, since we have no future `outputs`, we can only take into -account only previous ***tokens***, by setting future ones to `-inf`. -Moreover, here the `Key` and `Values` Mappings come from the `encoder` pase, while the -`Queue` Mapping is learnt here. +[^rope-paper]: [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864) -#### Final Steps +[^hugging-face-pe]: [Hugging Face | Positional Encoding | 2nd November 2025](https://huggingface.co/blog/designing-positional-encoding) -##### Linear Layer +[^t5]: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683) -Produces a vector of ***logits***, one per each ***known words***. +[^bert]: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) -##### Softmax Layer - -We then score these ***logits*** over a `SoftMax` to get probabilities. We then take the highest one, usually. - -If we implement ***Temperature***, though, we can take some `tokens` that are less probable, but having less predictability and -have some results that feel more natural. - -## Training a Transformer - - - -## Known Transformers - -### BERT (Bidirectional Encoder Representations from Transformers) - -Differently from other `Transformers`, it uses only `Encoder` blocks. - -It can be used as a classifier and can be fine tuned. - -The fine tuning happens by **masking** input and **predict** the **masked word**: - -- 15% of total words in input are masked - - 80% will become a `[masked]` token - - 10% will become random words - - 10% will remain unchanged - -#### Bert tasks - -- **Classification** -- **Fine Tuning** -- **2 sentences tasks** - - **Are they paraphrases?** - - **Does one sentence follow from this other one?** -- **Feature Extraction**: "Allows us to extract feature to use in our model - -### GPT-2 - -Differently from other `Transformers`, it uses only `Decoder` blocks. - -Since it has no `encoders`, `GPT-2` takes `outputs` and append them to the original `input`. This is called **autoregression**. -This, however, limits `GPT-2` on how to learn context on `input` because of `masking`. - -During `evaluation`, `GPT-2` does not recompute `V`, `K` and `Q` for previous tokens, but hold on their previosu values. +[^gpt-2]: [Release Strategies and the Social Impacts of Language Models | GPT 2](https://arxiv.org/pdf/1908.09203) \ No newline at end of file