2025-09-11 10:02:02 +02:00
|
|
|
# Graph ML
|
|
|
|
|
|
|
|
|
|
## Graph Introduction
|
|
|
|
|
|
|
|
|
|
- **Nodes**: Pieces of Information
|
|
|
|
|
- **Edges**: Relationship between nodes
|
2025-11-09 18:54:35 +01:00
|
|
|
- **Mutual**
|
|
|
|
|
- **One-Sided**
|
2025-09-11 10:02:02 +02:00
|
|
|
- **Directionality**
|
2025-11-09 18:54:35 +01:00
|
|
|
- **Directed**: We care about the order of connections
|
|
|
|
|
- **Unidirectional**
|
|
|
|
|
- **Bidirectional**
|
|
|
|
|
- **Undirected**: We don't care about order of connections
|
2025-09-11 10:02:02 +02:00
|
|
|
|
|
|
|
|
Now, we can have attributes over
|
|
|
|
|
|
|
|
|
|
- **nodes**
|
2025-11-09 18:54:35 +01:00
|
|
|
- identity
|
|
|
|
|
- number of neighbours
|
2025-09-11 10:02:02 +02:00
|
|
|
- **edges**
|
2025-11-09 18:54:35 +01:00
|
|
|
- identity
|
|
|
|
|
- weight
|
2025-09-11 10:02:02 +02:00
|
|
|
- **master nodes** (a collection of nodes and edges)
|
2025-11-09 18:54:35 +01:00
|
|
|
- number of nodes
|
|
|
|
|
- longest path
|
2025-09-11 10:02:02 +02:00
|
|
|
|
|
|
|
|
for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones.
|
|
|
|
|
Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA)
|
|
|
|
|
|
|
|
|
|
### Adjacency Graph
|
|
|
|
|
|
|
|
|
|
Take a picture and make a matrix with dimension $\{0, 1\}^{(h \cdot w) \times (h \cdot w)}$ and we put a 1 if these
|
|
|
|
|
nodes are connected (share and edge), or 0 if they do not.
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> For a $300 \times 250$ image our matrix would be $\{0, 1\}^{(250 \cdot 300) \times (250 \cdot 300)}$
|
|
|
|
|
|
|
|
|
|
The way we put a 1 or a 0 has this rules:
|
|
|
|
|
- **Row element** has connection **towards** **Column element**
|
|
|
|
|
- **Column element** has a connection **coming** from **Row element**
|
|
|
|
|
|
|
|
|
|
### Tasks
|
|
|
|
|
|
|
|
|
|
#### Graph-Level
|
|
|
|
|
|
|
|
|
|
We want to predict a graph property
|
|
|
|
|
|
|
|
|
|
#### Node-Level
|
|
|
|
|
|
|
|
|
|
We want to predict a node property, such as classification
|
|
|
|
|
|
|
|
|
|
#### Edge-Level
|
|
|
|
|
|
|
|
|
|
We want to predict relationships between nodes such as if they share an edge, or the value of the edge they share.
|
|
|
|
|
|
|
|
|
|
For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a
|
|
|
|
|
sparse graph
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
### Challenges of dealing with graphs
|
2025-09-11 10:02:02 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
While graphs are very powerful at representing structures in a compact and
|
|
|
|
|
natural way, they have several challenges.
|
|
|
|
|
|
|
|
|
|
**The number of nodes in a graph may change wildly**, making difficult to
|
|
|
|
|
work with different graphs.
|
|
|
|
|
|
|
|
|
|
**Sometimes nodes have no meaningful order**, thus we need to treat
|
|
|
|
|
different orders in the same way.
|
|
|
|
|
|
|
|
|
|
**Graphs can be very large**, thus take a lot of space. However they are,
|
|
|
|
|
usually, sparse in nature, making it possible to find ways to compress their
|
|
|
|
|
representation.
|
2025-09-11 10:02:02 +02:00
|
|
|
|
|
|
|
|
## Representing Graphs
|
|
|
|
|
|
|
|
|
|
### Adjacency List
|
|
|
|
|
|
|
|
|
|
We store info about:
|
|
|
|
|
|
|
|
|
|
- **Nodes**: list of values. index $Node_k$ is the value of that node
|
|
|
|
|
- **Edges**: list of values. index $Edge_k$ is the value of that edge
|
|
|
|
|
- **Adjacent_list**: list of Tuples with indices over nodes. index $Tuple_k$
|
|
|
|
|
represent the Nodes involved in the $kth$ edge
|
|
|
|
|
- **Graph**: Value of graph
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
nodes: list[any] = [
|
2025-11-09 18:54:35 +01:00
|
|
|
"fork", "spaghetti", "knife", "spoon", "soup"
|
2025-09-11 10:02:02 +02:00
|
|
|
]
|
|
|
|
|
|
|
|
|
|
edges: list[any] = [
|
2025-11-09 18:54:35 +01:00
|
|
|
"needed to eat", "cutlery", "food",
|
|
|
|
|
"cutlery", "cutlery", "needed to eat"
|
2025-09-11 10:02:02 +02:00
|
|
|
]
|
|
|
|
|
|
|
|
|
|
adj_list: list[(int, int)] = [
|
|
|
|
|
(0, 1), (0, 2), (1, 4),
|
|
|
|
|
(0, 3), (2, 3), (3, 4)
|
|
|
|
|
]
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
graph: any = "dining table"
|
2025-09-11 10:02:02 +02:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
> [!CAUTION]
|
|
|
|
|
> Even if in this example we used single values, edges, nodes and graphs may
|
|
|
|
|
> be made of Tensors or Structured data
|
|
|
|
|
|
2025-09-11 10:02:02 +02:00
|
|
|
## Graph Neural Networks (GNNs)
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
At the simpkest form we take a **graph-in** and **graph-out** approach with `MLPs`
|
|
|
|
|
([Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron)
|
|
|
|
|
aka Neural Network)
|
|
|
|
|
separate for vertices, edges and master
|
|
|
|
|
nodes that we apply **one at a time** over each element
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
V_{i + 1} &= MLP_{V_{i}}(V_{i}) \\
|
|
|
|
|
E_{i + 1} &= MLP_{E_{i}}(E_{i}) \\
|
|
|
|
|
U_{i + 1} &= MLP_{U_{i}}(U_{i}) \\
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
This also means that its output is a graph and we need further refining
|
|
|
|
|
to get other kind of outputs. For example we can apply a classifier for each node
|
|
|
|
|
embedding to get a class of that node.
|
|
|
|
|
|
2025-09-12 22:26:25 +02:00
|
|
|
### Pooling
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
This is a step that allows us to take info from other graph element type, different
|
|
|
|
|
from the ones we need. For example, we would like to have info coming from
|
|
|
|
|
edges, but bring them over vertices.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
# Pseudo code for pooling
|
|
|
|
|
def pool(items):
|
|
|
|
|
|
|
|
|
|
# This is usually a tensor 1xD
|
|
|
|
|
# where D is the embedding dimension
|
|
|
|
|
pooled_value = init_from(items.type)
|
|
|
|
|
|
|
|
|
|
for item in items:
|
|
|
|
|
|
|
|
|
|
pooled_value += item.value
|
|
|
|
|
|
|
|
|
|
return pooled_value
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
However, to make it use of parallel computation, we can also think of doing this
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
```python
|
|
|
|
|
# Parallel pseudo code
|
|
|
|
|
def parallel_pool(items):
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
# NxD matrix
|
|
|
|
|
# Each row of this matrix is an embedding
|
|
|
|
|
item_embedding_matrix = items.concat()
|
|
|
|
|
|
|
|
|
|
# Sum over rows
|
|
|
|
|
return sum(item_embedding_matrix, dim=0)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This, at its core, is useful when we lack properties about a portion of data,
|
|
|
|
|
for example edges or hypnernode, and we use data coming from other parts to
|
|
|
|
|
enable us to do computations.
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
### Message Passing
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
However, if we already have some info, we can use [`pooling`](#pooling) to augment
|
|
|
|
|
those info, taking into account connectivity of the graph by taking **only
|
|
|
|
|
adjacent info of the same type**.
|
|
|
|
|
|
|
|
|
|
This means that at each step, a node receives info from its neighbours in the
|
|
|
|
|
same fashion. After $step_k$ our node will have received partial information
|
|
|
|
|
of a node locates $k$ steps away
|
|
|
|
|
|
|
|
|
|
### Weaving
|
|
|
|
|
|
|
|
|
|
If we combine previous techniques together, we can merge info coming from
|
|
|
|
|
many parts of the graph. However if embeddings are not over the same dimension,
|
|
|
|
|
we need a linear layer to make them match dimensions.
|
|
|
|
|
|
|
|
|
|
As we are going to use a linear layer, this means that we are using a **learned**
|
|
|
|
|
representation, **which must be trained**.
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> Because graphs may be very sparsely connected an the longest path may be over
|
|
|
|
|
> our layer number, to bypass this, we weave info over the hypernode as well.
|
|
|
|
|
>
|
|
|
|
|
> The graph node will be used as if it were a node connected to all nodes, giving
|
|
|
|
|
> each node an overview of all node information
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
### Special Layers
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Read PDF 14 Anelli pg 47 to 52 -->
|
|
|
|
|
|
|
|
|
|
## Polynomial Filters
|
|
|
|
|
|
2025-09-13 16:17:35 +02:00
|
|
|
Each polynomial filter is order invariant
|
|
|
|
|
|
2025-09-12 22:26:25 +02:00
|
|
|
### Graph Laplacian
|
|
|
|
|
|
|
|
|
|
Let's set an order over nodes of a graph, where $A$ is the adjacency matrix:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
D_{v,v} = \sum_{u} A_{v,u}
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
In other words, $D_{v, v}$ is the number of nodes connected to the node $v$. In
|
|
|
|
|
fact, notice that we are summing over rows of the **Adjacency Matrix** $A$.
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
The [**Graph Laplacian**](https://en.wikipedia.org/wiki/Laplacian_matrix)
|
|
|
|
|
$L$ of the graph will be
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
L = D - A
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
As we can see, $L$ will have all elements of $D$ untouched, as each node has no
|
|
|
|
|
connection to itself in the adjacency matrix, and all elements of the
|
|
|
|
|
adjacency matrix with opposite sign (which means all negative)
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
### Convolution throught Polynomials of Laplacian
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Study Laplacian Filters -->
|
|
|
|
|
|
|
|
|
|
Let's construct some polynomials by using the Laplacian Matrix:
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i}
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
Here each $w_i$ is a weight over a vector $\vec{w} = [w_0, \dots, w_n] \in \R^n$.
|
|
|
|
|
We then can define the convolution as $p_{\vec{w}}(L) \times \vec{x}$, where
|
|
|
|
|
**<ins>x is the vector of all stacked vertices embeddings</ins>**
|
2025-09-12 22:26:25 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
2025-11-09 18:54:35 +01:00
|
|
|
\vec{x} &\in \R^{l\times d}\\
|
|
|
|
|
\vec{x}' &= p_{\vec{w}}(L) \vec{x}
|
2025-09-12 22:26:25 +02:00
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
To explain the ***convolutional effect***, let's see what happens for a polynomial
|
|
|
|
|
of $\vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n$:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Here each item of $\vec{x}'$ is just a scaled version of itself. Now consider a
|
|
|
|
|
polynomial of $\vec{w}_1 = [0, w_1, \dots, 0] \in \R^n$:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
Here vertices merge info coming from their connected vertices plus a contribute of
|
|
|
|
|
themselves (scaled by the number of connections).
|
|
|
|
|
|
|
|
|
|
> [!TIP]
|
|
|
|
|
> As this may have not been very clear from these formulas, let's make a practical
|
|
|
|
|
> example, where vertex 0 is connected to both vertices 1 and 2:
|
2025-09-12 22:26:25 +02:00
|
|
|
>
|
|
|
|
|
> $$
|
|
|
|
|
> \begin{aligned}
|
2025-11-09 18:54:35 +01:00
|
|
|
> \vec{x} &= \begin{bmatrix}
|
|
|
|
|
> \vec{x}_0 \in \R^{1 \times d} \\
|
|
|
|
|
> \vec{x}_1 \in \R^{1 \times d} \\
|
|
|
|
|
> \vec{x}_2 \in \R^{1 \times d} \\
|
|
|
|
|
> \end{bmatrix}
|
|
|
|
|
> \\
|
|
|
|
|
> L &= \begin{bmatrix}
|
|
|
|
|
> 2 & -1 & -1 \\
|
|
|
|
|
> -1 & 1 & 0 \\
|
|
|
|
|
> -1 & 0 & 1
|
|
|
|
|
> \end{bmatrix}
|
|
|
|
|
> \\
|
|
|
|
|
> \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix}
|
|
|
|
|
> 2w_1 & -w_1 & -w_1 \\
|
|
|
|
|
> -w_1 & w_1 & 0 \\
|
|
|
|
|
> -w_1 & 0 & w_1
|
|
|
|
|
> \end{bmatrix}
|
|
|
|
|
> \times
|
|
|
|
|
> \begin{bmatrix}
|
|
|
|
|
> \vec{x}_0 \\
|
|
|
|
|
> \vec{x}_1 \\
|
|
|
|
|
> \vec{x}_2
|
|
|
|
|
> \end{bmatrix}
|
|
|
|
|
> \\
|
|
|
|
|
> &= \begin{bmatrix}
|
|
|
|
|
> 2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\
|
|
|
|
|
> -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\
|
|
|
|
|
> -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\
|
|
|
|
|
> \end{bmatrix}
|
2025-09-12 22:26:25 +02:00
|
|
|
> \end{aligned}
|
|
|
|
|
> $$
|
|
|
|
|
>
|
2025-11-09 18:54:35 +01:00
|
|
|
> For simplicity we wrote vertices as vectors rather than decomposing into their
|
|
|
|
|
> elements. As we can notice here, we combined adjacent vertices.
|
|
|
|
|
|
|
|
|
|
Now let's see what happens to the Laplacian Matrix for the power of 2:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
L &= \begin{bmatrix}
|
|
|
|
|
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
|
|
|
|
|
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
|
|
|
|
|
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
|
|
|
|
|
\end{bmatrix}
|
|
|
|
|
\\
|
|
|
|
|
L \times L &=
|
|
|
|
|
\begin{bmatrix}
|
|
|
|
|
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
|
|
|
|
|
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
|
|
|
|
|
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
|
|
|
|
|
\end{bmatrix}
|
|
|
|
|
\times
|
|
|
|
|
\begin{bmatrix}
|
|
|
|
|
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
|
|
|
|
|
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
|
|
|
|
|
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
|
|
|
|
|
\end{bmatrix} = \\
|
|
|
|
|
&= \begin{bmatrix}
|
|
|
|
|
l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} &
|
|
|
|
|
l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} &
|
|
|
|
|
l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\
|
|
|
|
|
l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} &
|
|
|
|
|
l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} &
|
|
|
|
|
l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\
|
|
|
|
|
l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} &
|
|
|
|
|
l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} &
|
|
|
|
|
l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2
|
|
|
|
|
\end{bmatrix}
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
As we can see, information from neighbour connections leak into the Laplacian
|
|
|
|
|
graph. Over the extreme cases, it's possible to demonstrate that if the distance
|
|
|
|
|
between 2 nodes is more than the power of the Laplacian Matrix, they are
|
|
|
|
|
considered deatached up to their distance.
|
|
|
|
|
|
|
|
|
|
This means that by manipulating the degree of the polynomial, we can choose the
|
|
|
|
|
diffusion distance for infomation. For example, if we want info from vertices
|
|
|
|
|
at max 3 steps away, we will use a polynomial of 3rd degree.
|
|
|
|
|
|
|
|
|
|
Another interesting properties of these polynomials is that they do not depend on
|
|
|
|
|
the order of connections, but only over their existence, so they are order
|
|
|
|
|
equivariant.
|
|
|
|
|
|
|
|
|
|
> [!TIP]
|
|
|
|
|
> Go [here](./python-experiments/laplacian_graph.ipynb) to see some experiments.
|
2025-09-12 22:26:25 +02:00
|
|
|
>
|
2025-11-09 18:54:35 +01:00
|
|
|
> In particular, see the second one
|
2025-09-12 22:26:25 +02:00
|
|
|
|
2025-09-13 16:17:35 +02:00
|
|
|
### ChebNet
|
|
|
|
|
|
|
|
|
|
The polynomial in ChebNet becomes:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
p_{\vec{w}}(L) &= \sum_{i = 1}^{d} w_{i} T_{i}(\tilde{L}) \\
|
|
|
|
|
T_{i} &= cos(i\theta) \\
|
|
|
|
|
\tilde{L} &= \frac{2L}{\lambda_{\max}(L)} - I_{n}
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
- $T_{i}$ is Chebischev first kind polynomial
|
2025-11-09 18:54:35 +01:00
|
|
|
- $\theta$ are the weights to be learnt[^chebnet]
|
2025-09-13 16:17:35 +02:00
|
|
|
- $\tilde{L}$ is a reduced version of $L$ because we divide for its max eigenvalue,
|
|
|
|
|
keeping it in range $[-1, 1]$. Moreover $L$ ha no negative eigenvalues, so it's
|
|
|
|
|
positive semi-definite
|
|
|
|
|
|
|
|
|
|
These polynomials are more stable as they do not explode with higher powers
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
> [!NOTE]
|
|
|
|
|
>
|
|
|
|
|
> Even though we said that we could control the radius of neighbours leaking info,
|
|
|
|
|
> technically speaking, if we stack many GCN together, we get info from nodes at
|
|
|
|
|
> $N \cdot hops$, assuming each layer takes from max $hops$.
|
|
|
|
|
>
|
|
|
|
|
> So, let's say we have 2 layers taking from distance 3, after the computation a
|
|
|
|
|
> node gets info from nodes at 6 hops distance.
|
|
|
|
|
|
2025-09-13 16:17:35 +02:00
|
|
|
### Embedding Computation
|
|
|
|
|
|
|
|
|
|
<!-- TODO: Read PDF 14 Anelli from 81 to 83 -->
|
2025-11-09 18:54:35 +01:00
|
|
|
Whenever we are computing over nodes, we firstly need to compute their embedding.
|
|
|
|
|
From there on, all embeddings will be treated as inputs for our networks.
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\begin{aligned}
|
|
|
|
|
g(x) &\coloneqq \text{Any non linear function} \\
|
|
|
|
|
\vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right)
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
For each layer the same weights are applied, like in
|
|
|
|
|
[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-layer)
|
2025-09-13 16:17:35 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
## Real Graph Networks Configurations
|
2025-09-13 16:17:35 +02:00
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
To sum up all we have seen until now, to make computations over graphs we
|
|
|
|
|
usually take info from neighbours (Node Aggregation) and transform old info
|
|
|
|
|
of our node and combine them (Node Update).
|
|
|
|
|
|
|
|
|
|
- <span style="color:skyblue">(Potentially) Learnable parameters</span>
|
2025-09-13 16:17:35 +02:00
|
|
|
- <span style="color:orange">Embeddings of node v</span>
|
|
|
|
|
- <span style="color:violet">Embeddings of neighbours of v</span>
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
### Graph Neural Network
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{orange}{h_v^{(k)}} =
|
|
|
|
|
\textcolor{skyblue}{f_2^{(k)}} \left(
|
|
|
|
|
\underbrace{
|
|
|
|
|
\sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}(
|
|
|
|
|
\textcolor{orange}{h_n^{(k-1)}},
|
|
|
|
|
\textcolor{orange}{h_i^{(k-1)}},
|
|
|
|
|
\textcolor{orange}{e_{i, n}}
|
|
|
|
|
),
|
|
|
|
|
}_{\text{message passing of edges to nodes}}\,\,
|
|
|
|
|
\textcolor{orange}{h_v^{(k -1)}}
|
|
|
|
|
\right)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
### Graph Convolutional Network
|
2025-09-13 16:17:35 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k)}} =
|
|
|
|
|
\textcolor{skyblue}{f^{(k)}} \left(
|
|
|
|
|
\underbrace{\textcolor{skyblue}{W^{(k)}} \cdot
|
|
|
|
|
\frac{
|
|
|
|
|
\sum_{u \in \mathcal{N}(v)} \textcolor{violet}{h_{u}^{(k-1)}}
|
|
|
|
|
}{
|
|
|
|
|
|\mathcal{N}(v)|
|
|
|
|
|
}}_{\text{mean of previous neighbour embeddings}} + \underbrace{\textcolor{skyblue}{B^{(k)}} \cdot
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k - 1)}}}_{\text{previous embeddings}}
|
|
|
|
|
\right) \forall v \in V
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
> [!TIP]
|
|
|
|
|
>
|
|
|
|
|
> $\textcolor{skyblue}{f^{(k)}}$ here is a
|
|
|
|
|
> **<ins>Non-Linear Activation function</ins>**
|
|
|
|
|
|
2025-09-13 16:17:35 +02:00
|
|
|
### Graph Attention Networks
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k)}} =
|
|
|
|
|
\textcolor{skyblue}{f^{(k)}} \left(
|
|
|
|
|
\textcolor{skyblue}{W^{(k)}} \cdot \left[
|
|
|
|
|
\underbrace{
|
|
|
|
|
\sum_{u \in \mathcal{N}(v)} \alpha^{(k-1)}_{v,u}
|
|
|
|
|
\textcolor{violet}{h_{u}^{(k-1)}}
|
|
|
|
|
}_{\text{weighted mean of previous neighbour embeddings}} +
|
|
|
|
|
\underbrace{\alpha^{(k-1)}_{v,v}
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k-1)}}}_{\text{previous embeddings}}
|
|
|
|
|
\right] \right) \forall v \in V
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
where $\alpha^{(k)}_{v,u}$ are weights generated by an attention mechanism
|
|
|
|
|
$A^{(k)}$ that is normalized to make all sums of $\alpha^{(k)}_{v,u}$ be 1 for each
|
|
|
|
|
node.
|
2025-09-13 16:17:35 +02:00
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\alpha^{(k)}_{v,u} = \frac{
|
|
|
|
|
\textcolor{skyblue}{A^{(k)}}(
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k)}},
|
2025-11-09 18:54:35 +01:00
|
|
|
\textcolor{violet}{h_{u}^{(k)}}
|
2025-09-13 16:17:35 +02:00
|
|
|
)
|
|
|
|
|
}{
|
|
|
|
|
\sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}(
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k)}},
|
2025-11-09 18:54:35 +01:00
|
|
|
\textcolor{violet}{h_{w}^{(k)}}
|
2025-09-13 16:17:35 +02:00
|
|
|
)
|
|
|
|
|
} \forall (v, u) \in E
|
|
|
|
|
$$
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
> [!TIP]
|
|
|
|
|
> To make computations faster, we can just compute all
|
|
|
|
|
> $\textcolor{skyblue}{A^{(k)}}(
|
|
|
|
|
> \textcolor{orange}{h_{v}^{(k)}},
|
|
|
|
|
> \textcolor{violet}{h_{u}^{(k)}}
|
|
|
|
|
> )$ and then sum them up at a later stage to compute $\alpha^{(k)}_{v,u}$
|
|
|
|
|
>
|
|
|
|
|
> Also, this system (GAT) is flexible enough to make us change attention
|
|
|
|
|
> mechanisms and include more heads, as the one above is done with just an
|
|
|
|
|
> attention head
|
|
|
|
|
|
2025-09-13 16:17:35 +02:00
|
|
|
### Graph Sample and Aggregate (GraphSAGE)
|
|
|
|
|
|
2025-11-09 18:54:35 +01:00
|
|
|
Here, instead of taking all neighbours, we take a fixed size of neighbour, called
|
|
|
|
|
`pool`, taken at random, increasing variance but allowing this method to be
|
|
|
|
|
applied to large graphs
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{orange}{h_v^{(k)}} =
|
|
|
|
|
\textcolor{skyblue}{f^{(k)}} \left(
|
|
|
|
|
\textcolor{skyblue}{W^{(k)}}
|
|
|
|
|
\cdot
|
|
|
|
|
\underbrace{
|
|
|
|
|
\left[
|
|
|
|
|
\underbrace{
|
|
|
|
|
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
|
|
|
|
|
\textcolor{violet}{h_u^{(k-1)}}
|
|
|
|
|
)
|
|
|
|
|
}_{\text{Aggregation of v's neighbours}}
|
|
|
|
|
,
|
|
|
|
|
\textcolor{orange}{h_v^{(k-1)}}
|
|
|
|
|
\right]
|
|
|
|
|
}_{\text{Concatenation of aggr. and prev. embedding}}
|
|
|
|
|
|
|
|
|
|
\right) \forall v \in V
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
where
|
|
|
|
|
$\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
|
|
|
|
|
\textcolor{violet}{h_u^{(k-1)}}
|
|
|
|
|
)$ may be one of the followings:
|
|
|
|
|
|
|
|
|
|
#### Mean
|
|
|
|
|
|
|
|
|
|
This is similar to what we had on GCN above
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
|
|
|
|
|
\textcolor{violet}{h_u^{(k-1)}}
|
|
|
|
|
) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
|
|
|
|
|
\frac{
|
|
|
|
|
\textcolor{orange}{h_v^{(k-1)}}+
|
|
|
|
|
\sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}}
|
|
|
|
|
|
|
|
|
|
}{
|
|
|
|
|
1 + |\mathcal{N(v)}|
|
|
|
|
|
}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
#### Dimension Wise Maximum
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
|
|
|
|
|
\textcolor{violet}{h_u^{(k-1)}}
|
|
|
|
|
) = \max_{u \in \mathcal{N}(v)} \left\{
|
|
|
|
|
\sigma(
|
|
|
|
|
\textcolor{skyblue}{W_{pool}^{(k)}} \cdot
|
|
|
|
|
\textcolor{violet}{h_u^{(k-1)}} +
|
|
|
|
|
\textcolor{skyblue}{b}
|
|
|
|
|
)
|
|
|
|
|
\right\}
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
#### LSTM Aggregator
|
|
|
|
|
|
|
|
|
|
This is another network based on
|
|
|
|
|
[LSTM](./../8-Recurrent-Networks/INDEX.md#long-short-term-memory--lstm). This
|
|
|
|
|
methos **requires ordering the sequence of neighbours**
|
2025-09-13 16:17:35 +02:00
|
|
|
|
|
|
|
|
### Graph Isomorphism Network (GIN)
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
\textcolor{orange}{h_{v}^{(k)}} =
|
|
|
|
|
\textcolor{skyblue}{f^{(k)}}
|
|
|
|
|
\left(
|
|
|
|
|
\sum_{u \in \mathcal{N}(v)}
|
|
|
|
|
\textcolor{violet}{h_{u}^{(k - 1)}} +
|
|
|
|
|
(
|
|
|
|
|
1 +
|
|
|
|
|
\textcolor{skyblue}{\epsilon^{(k)}}
|
|
|
|
|
) \cdot \textcolor{orange}{h_{v}^{(k - 1)}}
|
|
|
|
|
\right)
|
|
|
|
|
\forall v \in V
|
2025-11-09 18:54:35 +01:00
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
[^chebnet]: [Data Warrior | Graph Convolutional Neural Network (Part II) | 9th November 2025](https://datawarrior.wordpress.com/2018/08/12/graph-convolutional-neural-network-part-ii/)
|