From 8a023a607963715f2317dfb02208ee3f399b31ce Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Sun, 9 Nov 2025 18:54:35 +0100
Subject: [PATCH] Revised Notes
---
Chapters/14-GNN-GCN/INDEX.md | 421 +++++++++++++++++++++++++++++------
1 file changed, 353 insertions(+), 68 deletions(-)
diff --git a/Chapters/14-GNN-GCN/INDEX.md b/Chapters/14-GNN-GCN/INDEX.md
index 04f410b..57f8a9e 100644
--- a/Chapters/14-GNN-GCN/INDEX.md
+++ b/Chapters/14-GNN-GCN/INDEX.md
@@ -4,19 +4,25 @@
- **Nodes**: Pieces of Information
- **Edges**: Relationship between nodes
- - **Mutual**
- - **One-Sided**
+ - **Mutual**
+ - **One-Sided**
- **Directionality**
- - **Directed**: We care about the order of connections
- - **Unidirectional**
- - **Bidirectional**
- - **Undirected**: We don't care about order of connections
+ - **Directed**: We care about the order of connections
+ - **Unidirectional**
+ - **Bidirectional**
+ - **Undirected**: We don't care about order of connections
Now, we can have attributes over
- **nodes**
+ - identity
+ - number of neighbours
- **edges**
+ - identity
+ - weight
- **master nodes** (a collection of nodes and edges)
+ - number of nodes
+ - longest path
for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones.
Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA)
@@ -50,11 +56,20 @@ We want to predict relationships between nodes such as if they share an edge, or
For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a
sparse graph
-### Downsides of Graphs
+### Challenges of dealing with graphs
-- They are not consistent in their structure and sometimes representing something as a graph is difficult
-- If we don't care about order of nodes, we need to find a way to represent this **node-order equivariance**
-- Graphs may be too large
+While graphs are very powerful at representing structures in a compact and
+natural way, they have several challenges.
+
+**The number of nodes in a graph may change wildly**, making difficult to
+work with different graphs.
+
+**Sometimes nodes have no meaningful order**, thus we need to treat
+different orders in the same way.
+
+**Graphs can be very large**, thus take a lot of space. However they are,
+usually, sparse in nature, making it possible to find ways to compress their
+representation.
## Representing Graphs
@@ -70,12 +85,12 @@ We store info about:
```python
nodes: list[any] = [
- "forchetta", "spaghetti", "coltello", "cucchiao", "brodo"
+ "fork", "spaghetti", "knife", "spoon", "soup"
]
edges: list[any] = [
- "serve per mangiare", "strumento", "cibo",
- "strumento", "strumento", "serve per mangiare"
+ "needed to eat", "cutlery", "food",
+ "cutlery", "cutlery", "needed to eat"
]
adj_list: list[(int, int)] = [
@@ -83,15 +98,22 @@ adj_list: list[(int, int)] = [
(0, 3), (2, 3), (3, 4)
]
-graph: any = "tavola"
+graph: any = "dining table"
```
If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts
+> [!CAUTION]
+> Even if in this example we used single values, edges, nodes and graphs may
+> be made of Tensors or Structured data
+
## Graph Neural Networks (GNNs)
-At the simpkest form we take a **graph-in** and **graph-out** approach with MLPs separate for
-vertices, edges and master nodes that we apply **one at a time** over each element
+At the simpkest form we take a **graph-in** and **graph-out** approach with `MLPs`
+([Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron)
+aka Neural Network)
+separate for vertices, edges and master
+nodes that we apply **one at a time** over each element
$$
\begin{aligned}
@@ -101,20 +123,74 @@ $$
\end{aligned}
$$
+This also means that its output is a graph and we need further refining
+to get other kind of outputs. For example we can apply a classifier for each node
+embedding to get a class of that node.
+
### Pooling
-> [!CAUTION]
-> This step comes after the embedding phase described above
+This is a step that allows us to take info from other graph element type, different
+from the ones we need. For example, we would like to have info coming from
+edges, but bring them over vertices.
-This is a step that can be used to take info about other elements, different from what we were considering
-(for example, taking info from edges while making the computation over vertices).
+```python
+# Pseudo code for pooling
+def pool(items):
-By using this approach we usually gather some info from edges of a vertex, then we concat them in a matrix and
-aggregate by summing them.
+ # This is usually a tensor 1xD
+ # where D is the embedding dimension
+ pooled_value = init_from(items.type)
+
+ for item in items:
+
+ pooled_value += item.value
+
+ return pooled_value
+```
+
+However, to make it use of parallel computation, we can also think of doing this
+
+```python
+# Parallel pseudo code
+def parallel_pool(items):
+
+ # NxD matrix
+ # Each row of this matrix is an embedding
+ item_embedding_matrix = items.concat()
+
+ # Sum over rows
+ return sum(item_embedding_matrix, dim=0)
+```
+
+This, at its core, is useful when we lack properties about a portion of data,
+for example edges or hypnernode, and we use data coming from other parts to
+enable us to do computations.
### Message Passing
-Take all node embeddings that are in the neighbouroud and do similar steps as the pooling function.
+However, if we already have some info, we can use [`pooling`](#pooling) to augment
+those info, taking into account connectivity of the graph by taking **only
+adjacent info of the same type**.
+
+This means that at each step, a node receives info from its neighbours in the
+same fashion. After $step_k$ our node will have received partial information
+of a node locates $k$ steps away
+
+### Weaving
+
+If we combine previous techniques together, we can merge info coming from
+many parts of the graph. However if embeddings are not over the same dimension,
+we need a linear layer to make them match dimensions.
+
+As we are going to use a linear layer, this means that we are using a **learned**
+representation, **which must be trained**.
+
+> [!CAUTION]
+> Because graphs may be very sparsely connected an the longest path may be over
+> our layer number, to bypass this, we weave info over the hypernode as well.
+>
+> The graph node will be used as if it were a node connected to all nodes, giving
+> each node an overview of all node information
### Special Layers
@@ -134,68 +210,152 @@ D_{v,v} = \sum_{u} A_{v,u}
$$
-In other words, $D_{v, v}$ is the number of nodes connected ot that one
+In other words, $D_{v, v}$ is the number of nodes connected to the node $v$. In
+fact, notice that we are summing over rows of the **Adjacency Matrix** $A$.
-The **graph Laplacian** of the graph will be
+The [**Graph Laplacian**](https://en.wikipedia.org/wiki/Laplacian_matrix)
+$L$ of the graph will be
$$
L = D - A
$$
-### Polynomials of Laplacian
+As we can see, $L$ will have all elements of $D$ untouched, as each node has no
+connection to itself in the adjacency matrix, and all elements of the
+adjacency matrix with opposite sign (which means all negative)
-These polynomials, which have the same dimensions of $L$, can be though as being **filter** like in
-[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-networks)
+### Convolution throught Polynomials of Laplacian
+
+
+
+Let's construct some polynomials by using the Laplacian Matrix:
$$
p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i}
$$
-We then can get a ***filtered node*** by simply multiplying the polynomial with the node value
+Here each $w_i$ is a weight over a vector $\vec{w} = [w_0, \dots, w_n] \in \R^n$.
+We then can define the convolution as $p_{\vec{w}}(L) \times \vec{x}$, where
+**x is the vector of all stacked vertices embeddings**
$$
\begin{aligned}
- \vec{x}' = p_{\vec{w}}(L) \vec{x}
+ \vec{x} &\in \R^{l\times d}\\
+ \vec{x}' &= p_{\vec{w}}(L) \vec{x}
\end{aligned}
$$
-> [!NOTE]
-> In order to extract new features for a single vertex, supposing only $w_1 \neq 0$
->
-> Observe that we are only taking $L_{v}$
+To explain the ***convolutional effect***, let's see what happens for a polynomial
+of $\vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n$:
+
+$$
+\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x}
+$$
+
+Here each item of $\vec{x}'$ is just a scaled version of itself. Now consider a
+polynomial of $\vec{w}_1 = [0, w_1, \dots, 0] \in \R^n$:
+
+$$
+\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x}
+$$
+
+Here vertices merge info coming from their connected vertices plus a contribute of
+themselves (scaled by the number of connections).
+
+> [!TIP]
+> As this may have not been very clear from these formulas, let's make a practical
+> example, where vertex 0 is connected to both vertices 1 and 2:
>
> $$
> \begin{aligned}
-> \vec{x}'_{v} &= (L\vec{x})_{v} \\
-> &= \sum_{u \in G} L_{v,u} \vec{x}_{u} \\
-> &= \sum_{u \in G} (D_{v,u} - A_{v,u}) \vec{x}_{u} \\
-> &= \sum_{u \in G} D_{v,u} \vec{x}_{u} - A_{v,u} \vec{x}_{u} \\
-> &= D_{v, v} \vec{x}_{v} - \sum_{u \in \mathcal{N}(v)} \vec{x}_{u}
+> \vec{x} &= \begin{bmatrix}
+> \vec{x}_0 \in \R^{1 \times d} \\
+> \vec{x}_1 \in \R^{1 \times d} \\
+> \vec{x}_2 \in \R^{1 \times d} \\
+> \end{bmatrix}
+> \\
+> L &= \begin{bmatrix}
+> 2 & -1 & -1 \\
+> -1 & 1 & 0 \\
+> -1 & 0 & 1
+> \end{bmatrix}
+> \\
+> \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix}
+> 2w_1 & -w_1 & -w_1 \\
+> -w_1 & w_1 & 0 \\
+> -w_1 & 0 & w_1
+> \end{bmatrix}
+> \times
+> \begin{bmatrix}
+> \vec{x}_0 \\
+> \vec{x}_1 \\
+> \vec{x}_2
+> \end{bmatrix}
+> \\
+> &= \begin{bmatrix}
+> 2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\
+> -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\
+> -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\
+> \end{bmatrix}
> \end{aligned}
> $$
>
-> Where the last step holds as $D$ is a diagonal matrix, and in the summatory we are only considering the neighbours
-> of v
+> For simplicity we wrote vertices as vectors rather than decomposing into their
+> elements. As we can notice here, we combined adjacent vertices.
+
+Now let's see what happens to the Laplacian Matrix for the power of 2:
+
+$$
+\begin{aligned}
+ L &= \begin{bmatrix}
+ l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+ l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+ l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+ \end{bmatrix}
+ \\
+ L \times L &=
+ \begin{bmatrix}
+ l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+ l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+ l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+ \end{bmatrix}
+ \times
+ \begin{bmatrix}
+ l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+ l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+ l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+ \end{bmatrix} = \\
+ &= \begin{bmatrix}
+ l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} &
+ l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} &
+ l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\
+ l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} &
+ l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} &
+ l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\
+ l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} &
+ l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} &
+ l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2
+ \end{bmatrix}
+\end{aligned}
+$$
+
+As we can see, information from neighbour connections leak into the Laplacian
+graph. Over the extreme cases, it's possible to demonstrate that if the distance
+between 2 nodes is more than the power of the Laplacian Matrix, they are
+considered deatached up to their distance.
+
+This means that by manipulating the degree of the polynomial, we can choose the
+diffusion distance for infomation. For example, if we want info from vertices
+at max 3 steps away, we will use a polynomial of 3rd degree.
+
+Another interesting properties of these polynomials is that they do not depend on
+the order of connections, but only over their existence, so they are order
+equivariant.
+
+> [!TIP]
+> Go [here](./python-experiments/laplacian_graph.ipynb) to see some experiments.
>
-> It can be demonstrated that in any graph
->
-> $$
-> dist_{G}(v, u) > i \rightarrow L_{v, u}^{i} = 0
-> $$
->
-> More in general it holds
->
-> $$
-> \begin{aligned}
-> \vec{x}'_{v} = (p_{\vec{w}}(L)\vec{x})_{v} &= (p_{\vec{w}}(L))_{v} \vec{x} \\
-> &= \sum_{i = 0}^{d} w_{i}L_{v}^{i} \vec{x} \\
-> &= \sum_{i = 0}^{d} w_{i} \sum_{u \in G} L_{v,u}^{i}\vec{x}_{u} \\
-> &= \sum_{i = 0}^{d} w_{i} \sum_{\substack{u \in G \\ dist_{G}(v, u) \leq i}} L_{v,u}^{i}\vec{x}_{u} \\
-> \end{aligned}
-> $$
->
-> So this shows that the degree of the polynomial decides the max number of hops
-> to be included during the filtering stage, like if it were defining a [kernel](./../7-Convolutional-Networks/INDEX.md#filters)
+> In particular, see the second one
### ChebNet
@@ -210,23 +370,65 @@ T_{i} &= cos(i\theta) \\
$$
- $T_{i}$ is Chebischev first kind polynomial
+- $\theta$ are the weights to be learnt[^chebnet]
- $\tilde{L}$ is a reduced version of $L$ because we divide for its max eigenvalue,
keeping it in range $[-1, 1]$. Moreover $L$ ha no negative eigenvalues, so it's
positive semi-definite
These polynomials are more stable as they do not explode with higher powers
+> [!NOTE]
+>
+> Even though we said that we could control the radius of neighbours leaking info,
+> technically speaking, if we stack many GCN together, we get info from nodes at
+> $N \cdot hops$, assuming each layer takes from max $hops$.
+>
+> So, let's say we have 2 layers taking from distance 3, after the computation a
+> node gets info from nodes at 6 hops distance.
+
### Embedding Computation
+Whenever we are computing over nodes, we firstly need to compute their embedding.
+From there on, all embeddings will be treated as inputs for our networks.
-## Other methods
+$$
+\begin{aligned}
+ g(x) &\coloneqq \text{Any non linear function} \\
+ \vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right)
+\end{aligned}
+$$
-- Learnable parameters
+For each layer the same weights are applied, like in
+[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-layer)
+
+## Real Graph Networks Configurations
+
+To sum up all we have seen until now, to make computations over graphs we
+usually take info from neighbours (Node Aggregation) and transform old info
+of our node and combine them (Node Update).
+
+- (Potentially) Learnable parameters
- Embeddings of node v
- Embeddings of neighbours of v
-### Graph Convolutional Networks
+### Graph Neural Network
+
+$$
+\textcolor{orange}{h_v^{(k)}} =
+\textcolor{skyblue}{f_2^{(k)}} \left(
+ \underbrace{
+ \sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}(
+ \textcolor{orange}{h_n^{(k-1)}},
+ \textcolor{orange}{h_i^{(k-1)}},
+ \textcolor{orange}{e_{i, n}}
+ ),
+ }_{\text{message passing of edges to nodes}}\,\,
+ \textcolor{orange}{h_v^{(k -1)}}
+\right)
+$$
+
+### Graph Convolutional Network
$$
\textcolor{orange}{h_{v}^{(k)}} =
@@ -241,6 +443,11 @@ $$
\right) \forall v \in V
$$
+> [!TIP]
+>
+> $\textcolor{skyblue}{f^{(k)}}$ here is a
+> **Non-Linear Activation function**
+
### Graph Attention Networks
$$
@@ -256,25 +463,101 @@ $$
\right] \right) \forall v \in V
$$
-where
+where $\alpha^{(k)}_{v,u}$ are weights generated by an attention mechanism
+$A^{(k)}$ that is normalized to make all sums of $\alpha^{(k)}_{v,u}$ be 1 for each
+node.
$$
\alpha^{(k)}_{v,u} = \frac{
\textcolor{skyblue}{A^{(k)}}(
\textcolor{orange}{h_{v}^{(k)}},
- \textcolor{violet}{h_{u}^{(k)}},
+ \textcolor{violet}{h_{u}^{(k)}}
)
}{
\sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}(
\textcolor{orange}{h_{v}^{(k)}},
- \textcolor{violet}{h_{w}^{(k)}},
+ \textcolor{violet}{h_{w}^{(k)}}
)
} \forall (v, u) \in E
$$
+> [!TIP]
+> To make computations faster, we can just compute all
+> $\textcolor{skyblue}{A^{(k)}}(
+> \textcolor{orange}{h_{v}^{(k)}},
+> \textcolor{violet}{h_{u}^{(k)}}
+> )$ and then sum them up at a later stage to compute $\alpha^{(k)}_{v,u}$
+>
+> Also, this system (GAT) is flexible enough to make us change attention
+> mechanisms and include more heads, as the one above is done with just an
+> attention head
+
### Graph Sample and Aggregate (GraphSAGE)
-
+Here, instead of taking all neighbours, we take a fixed size of neighbour, called
+`pool`, taken at random, increasing variance but allowing this method to be
+applied to large graphs
+
+$$
+\textcolor{orange}{h_v^{(k)}} =
+\textcolor{skyblue}{f^{(k)}} \left(
+ \textcolor{skyblue}{W^{(k)}}
+ \cdot
+ \underbrace{
+ \left[
+ \underbrace{
+ \textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+ \textcolor{violet}{h_u^{(k-1)}}
+ )
+ }_{\text{Aggregation of v's neighbours}}
+ ,
+ \textcolor{orange}{h_v^{(k-1)}}
+ \right]
+ }_{\text{Concatenation of aggr. and prev. embedding}}
+
+\right) \forall v \in V
+$$
+
+where
+$\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+ \textcolor{violet}{h_u^{(k-1)}}
+)$ may be one of the followings:
+
+#### Mean
+
+This is similar to what we had on GCN above
+$$
+\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+ \textcolor{violet}{h_u^{(k-1)}}
+) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
+\frac{
+ \textcolor{orange}{h_v^{(k-1)}}+
+ \sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}}
+
+}{
+ 1 + |\mathcal{N(v)}|
+}
+$$
+
+#### Dimension Wise Maximum
+
+$$
+\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+ \textcolor{violet}{h_u^{(k-1)}}
+) = \max_{u \in \mathcal{N}(v)} \left\{
+ \sigma(
+ \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
+ \textcolor{violet}{h_u^{(k-1)}} +
+ \textcolor{skyblue}{b}
+ )
+\right\}
+$$
+
+#### LSTM Aggregator
+
+This is another network based on
+[LSTM](./../8-Recurrent-Networks/INDEX.md#long-short-term-memory--lstm). This
+methos **requires ordering the sequence of neighbours**
### Graph Isomorphism Network (GIN)
@@ -290,4 +573,6 @@ $$
) \cdot \textcolor{orange}{h_{v}^{(k - 1)}}
\right)
\forall v \in V
-$$
\ No newline at end of file
+$$
+
+[^chebnet]: [Data Warrior | Graph Convolutional Neural Network (Part II) | 9th November 2025](https://datawarrior.wordpress.com/2018/08/12/graph-convolutional-neural-network-part-ii/)