223 lines
6.0 KiB
Markdown
223 lines
6.0 KiB
Markdown
# Basic Architecture
|
|
|
|
> [!NOTE]
|
|
> Here $g(\vec{x})$ is any
|
|
> [activation function](./../3-Activation-Functions/INDEX.md)
|
|
|
|
## Multiplicative Modules
|
|
|
|
These modules lets us combine outputs from other networks to modify
|
|
a behaviour.
|
|
|
|
### Sigma-Pi Unit
|
|
|
|
> [!NOTE]
|
|
> This module takes his name for its sum ($\sum$ - sigma) and muliplication
|
|
> ($\prod$ - pi) operations
|
|
|
|
Thise module multiply the input for the output of another network:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
W &= \vec{z} \times U &
|
|
\vec{z} \in \R^{1 \times b}, \,\, U \in \R^{b \times c \times d}\\
|
|
\vec{y} &= \vec{x} \times W &
|
|
\vec{x} \in \R^{1 \times c}, \,\, W \in \R^{c \times d}
|
|
\end{aligned}
|
|
$$
|
|
|
|
This is equivalent to:
|
|
|
|
$$
|
|
\begin{aligned}
|
|
w_{i,j} &= \sum_{h = 1}^{b} z_h u_{h,i,j} \\
|
|
y_{j} &= \sum_{i = 1}^{c} x_i w_{i,j} = \sum_{h, i}x_i z_h u_{h,i,j}
|
|
\end{aligned}
|
|
$$
|
|
|
|
As per this paper[^stanford-sigma-pi] from Stanford University, `sigma-pi`
|
|
units can be represented as this:
|
|
|
|

|
|
|
|
Assuming $a_b$ and $a_d$ elements of $\vec{a}_1$ and $a_c$ and $a_e$ elements of $\vec{a}_2$, this becomes
|
|
|
|
$$
|
|
\hat{y}_i = \sum_{j} w_{i,j} \prod_{k \in \{1, 2\}} a_{j, k}
|
|
$$
|
|
|
|
In other words, once you can mix outputs coming from other networks via
|
|
element-wise products and then combine the result via weights like normal.
|
|
|
|
### Mixture of experts
|
|
|
|
If you have different networks tranined for the same objective, you can
|
|
multiply their output by a weight vector coming from another controlling
|
|
network.
|
|
|
|
The controller network has the objective of giving a score to each expert
|
|
based on which is the most *"experienced"* in that context. The more
|
|
*"experienced"* an expert, the higher its influence over the output.
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\vec{w} &= \text{softmax}\left(
|
|
\vec{z}
|
|
\right)
|
|
\\
|
|
\hat{y} &= \sum_{j} \text{expert\_out}_j \cdot w_j
|
|
\end{aligned}
|
|
$$
|
|
|
|
> [!NOTE]
|
|
> While we used a [`softmax`](./../3-Activation-Functions/INDEX.md#softmax),
|
|
> this can be replaced by a `softmin` or any other scoring function.
|
|
|
|
### Switch Like
|
|
|
|
> [!NOTE]
|
|
> I call them switch like because if we put $z_i = 1$, element
|
|
> of $\vec{z}$ and all the others to 0, it results $\hat{y} = \vec{x}_i$
|
|
|
|
We can use another network to produce a signal to mix outputs of other
|
|
networks through a matmul
|
|
|
|
$$
|
|
\begin{aligned}
|
|
X &= \text{concat}(\vec{x}_1, \dots, \vec{x}_n) \\
|
|
\hat{y} &= \vec{z} \times X = \sum_{i=1}^{n} z_i \cdot \vec{x}_i
|
|
\end{aligned}
|
|
$$
|
|
|
|
Even though it's difficult to see here, this means that each $z_i$ is a
|
|
mixing weight for each output vector $\vec{x}_i$
|
|
|
|
### Parameter Transformation
|
|
|
|
This is when we use the output of a **fixed function** as weights for out
|
|
network
|
|
|
|
$$
|
|
\begin{aligned}
|
|
W &= f(\vec{z}) \\
|
|
\hat{y} &= g(\vec{x}W)
|
|
\end{aligned}
|
|
$$
|
|
|
|
#### Weight Sharing
|
|
|
|
This is a special case of parameter transformation
|
|
|
|
$$
|
|
f(\vec{x}) = \begin{bmatrix}
|
|
x_1, & x_1, & \dots, & x_n, & x_n
|
|
\end{bmatrix}
|
|
$$
|
|
|
|
or similar, replicating elements of $\vec{x}$ across its output.
|
|
|
|
<!--
|
|
|
|
With these modules we can modify our ***traditional*** ways of ***neural networks***
|
|
and implement ***switch-like*** functions
|
|
|
|
### Professor's one
|
|
|
|
Basically here we want a ***way to modify `weights` with `inputs`***.
|
|
|
|
Here $\vec{z}$ and $\vec{x}$ are both `inputs`
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\vec{y} &= \sum_{w_{i,j}x{j}} \\
|
|
\vec{w} &= \sum_{k} u_{i,j,k} z_{k} \rightarrow \\
|
|
\rightarrow \vec{y} &= \sum_{j,k} u_{i,j,k} z_{k} x_{j}
|
|
\end{aligned}
|
|
$$
|
|
|
|
As we can see here, $z_{k}$ modifies, along $u$, $x_{j}$.
|
|
|
|
#### Quadratic Layer
|
|
|
|
This layer expands data by applying the **quadratic formula**
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\vec{v} &= [a_1, a_2, a_3] \\
|
|
|
|
quad\_layer(\vec{v}) &= [ a_1 \cdot a_1, a_1 \cdot a_2, a_1 \cdot a_3, ... , a_3 \cdot a_3 ]
|
|
\end{aligned}
|
|
$$
|
|
|
|
#### Product Unit[^product-unit]
|
|
|
|
$$
|
|
o_k = \sum_{j}^{m} v_{k,j} \cdot \left( \prod_{i=1}^{n} x_{i}^{w_{j,i}}\right) + v_{k,0}
|
|
$$
|
|
|
|
#### Sigma-Pi Unit[^simga-pi][^simga-pi-2]
|
|
|
|
This *layer* is basically a product of `input` terms **times**
|
|
a `weight`, intead of a `matrix multiplication` of a `linear-layer`.
|
|
|
|
Moreover, this is ***not necessarily*** `fully-connected`
|
|
|
|
$$
|
|
o_k = g\left( \sum_{q \in conjunct} w_{q} \prod_{k=1}^{N} z_{q,k} \right)
|
|
$$
|
|
|
|
### Attention Modules
|
|
|
|
They define a way for our `model` to get what's ***more important***
|
|
|
|
#### Softmax
|
|
|
|
We use this function to output the ***importance*** of a certain
|
|
value over all the others.
|
|
|
|
$$
|
|
\begin{aligned}
|
|
\sigma(\vec{x})_{j} &= \frac{e^{x_{j}}}{\sum_{k} e^{x_{k}}} \;\; \forall k \in {0, ..., N} \\
|
|
|
|
\sigma(\vec{x})_{j} &\in [0, 1] \;\; \forall j \in {0, ..., N}
|
|
\end{aligned}
|
|
$$
|
|
|
|
## Mixture of Experts[^mixture-of-experts]
|
|
|
|
What happens if we have more `models` and we want to take their output?
|
|
|
|
Basically we have a set of `weights` over our `outputs` before the `output-layer`.
|
|
Both the **experts** and the **gating-function** need to be `trained`.
|
|
|
|
> [!TIP]
|
|
>
|
|
> Since we are talking about `weights` and `importance`, probably here it is better to use an [attention-model](#attention-modules)
|
|
|
|
## Parameter Transformation
|
|
|
|
It is basically when the `wheights` are the `output` of a ***function***
|
|
|
|
Since they are controlled by some other `parameters`, then we need to ***learn***
|
|
those instead
|
|
|
|
### Weights Sharing
|
|
|
|
Here we ***copy*** our weights over more ***basic components***.
|
|
Since we have ***more than one value*** for our `original weights`, then we need to ***sum*** those.
|
|
|
|
> [!TIP]
|
|
>
|
|
> This is used to find ***motifs*** on an `input`
|
|
-->
|
|
|
|
<!-- Footnotes -->
|
|
|
|
[^simga-pi]: [University of Pretoria | sigma-pi | pg. 2](https://repository.up.ac.za/bitstream/handle/2263/29715/03chapter3.pdf?sequence=4#:~:text=A%20pi%2Dsigma%20network%20\(PSN,of%20sums%20of%20input%20components.)
|
|
|
|
[^product-unit]: doi: 10.13053/CyS-20-2-2218
|
|
|
|
[^mixture-of-experts]: [Wikipedia | 1st April 2025](https://en.wikipedia.org/wiki/Mixture_of_experts)
|
|
|
|
[^stanford-sigma-pi]: [D. E. Ruhmelhart, G. E. Hinton, J. L. McClelland | A General Framework for Paralled Distributed Processing | Ch. 2 pg. 73](https://stanford.edu/~jlmcc/papers/PDP/Chapter2.pdf)
|