diff --git a/Chapters/1-Basic-Architecture/INDEX.md b/Chapters/1-Basic-Architecture/INDEX.md index 41a42a0..6661e84 100644 --- a/Chapters/1-Basic-Architecture/INDEX.md +++ b/Chapters/1-Basic-Architecture/INDEX.md @@ -1,15 +1,127 @@ -# Index +# Basic Architecture -$g()$ is any ***Non-Linear Function*** +> [!NOTE] +> Here $g(\vec{x})$ is any +> [activation function](./../3-Activation-Functions/INDEX.md) -## Basic Architecture +## Multiplicative Modules -### Multiplicative Modules +These modules lets us combine outputs from other networks to modify +a behaviour. + +### Sigma-Pi Unit + +> [!NOTE] +> This module takes his name for its sum ($\sum$ - sigma) and muliplication +> ($\prod$ - pi) operations + +Thise module multiply the input for the output of another network: + +$$ +\begin{aligned} + W &= \vec{z} \times U & + \vec{z} \in \R^{1 \times b}, \,\, U \in \R^{b \times c \times d}\\ + \vec{y} &= \vec{x} \times W & + \vec{x} \in \R^{1 \times c}, \,\, W \in \R^{c \times d} +\end{aligned} +$$ + +This is equivalent to: + +$$ +\begin{aligned} + w_{i,j} &= \sum_{h = 1}^{b} z_h u_{h,i,j} \\ + y_{j} &= \sum_{i = 1}^{c} x_i w_{i,j} = \sum_{h, i}x_i z_h u_{h,i,j} +\end{aligned} +$$ + +As per this paper[^stanford-sigma-pi] from Stanford University, `sigma-pi` +units can be represented as this: + +![stanford university sigma-pi](./pngs/stanford-sigma-pi.png) + +Assuming $a_b$ and $a_d$ elements of $\vec{a}_1$ and $a_c$ and $a_e$ elements of $\vec{a}_2$, this becomes + +$$ +\hat{y}_i = \sum_{j} w_{i,j} \prod_{k \in \{1, 2\}} a_{j, k} +$$ + +In other words, once you can mix outputs coming from other networks via +element-wise products and then combine the result via weights like normal. + +### Mixture of experts + +If you have different networks tranined for the same objective, you can +multiply their output by a weight vector coming from another controlling +network. + +The controller network has the objective of giving a score to each expert +based on which is the most *"experienced"* in that context. The more +*"experienced"* an expert, the higher its influence over the output. + +$$ +\begin{aligned} + \vec{w} &= \text{softmax}\left( + \vec{z} + \right) + \\ + \hat{y} &= \sum_{j} \text{expert\_out}_j \cdot w_j +\end{aligned} +$$ + +> [!NOTE] +> While we used a [`softmax`](./../3-Activation-Functions/INDEX.md#softmax), +> this can be replaced by a `softmin` or any other scoring function. + +### Switch Like + +> [!NOTE] +> I call them switch like because if we put $z_i = 1$, element +> of $\vec{z}$ and all the others to 0, it results $\hat{y} = \vec{x}_i$ + +We can use another network to produce a signal to mix outputs of other +networks through a matmul + +$$ +\begin{aligned} + X &= \text{concat}(\vec{x}_1, \dots, \vec{x}_n) \\ + \hat{y} &= \vec{z} \times X = \sum_{i=1}^{n} z_i \cdot \vec{x}_i +\end{aligned} +$$ + +Even though it's difficult to see here, this means that each $z_i$ is a +mixing weight for each output vector $\vec{x}_i$ + +### Parameter Transformation + +This is when we use the output of a **fixed function** as weights for out +network + +$$ +\begin{aligned} + W &= f(\vec{z}) \\ + \hat{y} &= g(\vec{x}W) +\end{aligned} +$$ + +#### Weight Sharing + +This is a special case of parameter transformation + +$$ +f(\vec{x}) = \begin{bmatrix} + x_1, & x_1, & \dots, & x_n, & x_n +\end{bmatrix} +$$ + +or similar, replicating elements of $\vec{x}$ across its output. + + [^simga-pi]: [University of Pretoria | sigma-pi | pg. 2](https://repository.up.ac.za/bitstream/handle/2263/29715/03chapter3.pdf?sequence=4#:~:text=A%20pi%2Dsigma%20network%20\(PSN,of%20sums%20of%20input%20components.) -[^simga-pi-2]:[] + [^product-unit]: doi: 10.13053/CyS-20-2-2218 + [^mixture-of-experts]: [Wikipedia | 1st April 2025](https://en.wikipedia.org/wiki/Mixture_of_experts) + +[^stanford-sigma-pi]: [D. E. Ruhmelhart, G. E. Hinton, J. L. McClelland | A General Framework for Paralled Distributed Processing | Ch. 2 pg. 73](https://stanford.edu/~jlmcc/papers/PDP/Chapter2.pdf)