From 8a023a607963715f2317dfb02208ee3f399b31ce Mon Sep 17 00:00:00 2001
From: Christian Risi <75698846+CnF-Gris@users.noreply.github.com>
Date: Sun, 9 Nov 2025 18:54:35 +0100
Subject: [PATCH] Revised Notes

---
 Chapters/14-GNN-GCN/INDEX.md | 421 +++++++++++++++++++++++++++++------
 1 file changed, 353 insertions(+), 68 deletions(-)

diff --git a/Chapters/14-GNN-GCN/INDEX.md b/Chapters/14-GNN-GCN/INDEX.md
index 04f410b..57f8a9e 100644
--- a/Chapters/14-GNN-GCN/INDEX.md
+++ b/Chapters/14-GNN-GCN/INDEX.md
@@ -4,19 +4,25 @@
 
 - **Nodes**: Pieces of Information
 - **Edges**: Relationship between nodes
-    - **Mutual**
-    - **One-Sided**
+  - **Mutual**
+  - **One-Sided**
 - **Directionality**
-    - **Directed**: We care about the order of connections
-        - **Unidirectional**
-        - **Bidirectional**
-    - **Undirected**: We don't care about order of connections
+  - **Directed**: We care about the order of connections
+    - **Unidirectional**
+    - **Bidirectional**
+  - **Undirected**: We don't care about order of connections
 
 Now, we can have attributes over
 
 - **nodes**
+  - identity
+  - number of neighbours
 - **edges**
+  - identity
+  - weight
 - **master nodes** (a collection of nodes and edges)
+  - number of nodes
+  - longest path
 
 for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones.
 Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA)
@@ -50,11 +56,20 @@ We want to predict relationships between nodes such as if they share an edge, or
 For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a
 sparse graph
 
-### Downsides of Graphs
+### Challenges of dealing with graphs
 
-- They are not consistent in their structure and sometimes representing something as a graph is difficult
-- If we don't care about order of nodes, we need to find a way to represent this **node-order equivariance**
-- Graphs may be too large
+While graphs are very powerful at representing structures in a compact and
+natural way, they have several challenges.
+
+**The number of nodes in a graph may change wildly**, making difficult to
+work with different graphs.
+
+**Sometimes nodes have no meaningful order**, thus we need to treat
+different orders in the same way.
+
+**Graphs can be very large**, thus take a lot of space. However they are,
+usually, sparse in nature, making it possible to find ways to compress their
+representation.
 
 ## Representing Graphs
 
@@ -70,12 +85,12 @@ We store info about:
 
 ```python
 nodes: list[any] = [
-    "forchetta", "spaghetti", "coltello", "cucchiao", "brodo"
+    "fork", "spaghetti", "knife", "spoon", "soup"
 ]
 
 edges: list[any] = [
-    "serve per mangiare", "strumento", "cibo",
-    "strumento", "strumento", "serve per mangiare"
+    "needed to eat", "cutlery", "food",
+    "cutlery", "cutlery", "needed to eat"
 ]
 
 adj_list: list[(int, int)] = [
@@ -83,15 +98,22 @@ adj_list: list[(int, int)] = [
     (0, 3), (2, 3), (3, 4)
 ]
 
-graph: any = "tavola"
+graph: any = "dining table"
 ```
 
 If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts
 
+> [!CAUTION]
+> Even if in this example we used single values, edges, nodes and graphs may
+> be made of Tensors or Structured data
+
 ## Graph Neural Networks (GNNs)
 
-At the simpkest form we take a **graph-in** and **graph-out** approach with MLPs separate for
-vertices, edges and master nodes that we apply **one at a time** over each element
+At the simpkest form we take a **graph-in** and **graph-out** approach with `MLPs`
+([Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron)
+aka Neural Network)
+separate for vertices, edges and master
+nodes that we apply **one at a time** over each element
 
 $$
 \begin{aligned}
@@ -101,20 +123,74 @@ $$
 \end{aligned}
 $$
 
+This also means that its output is a graph and we need further refining
+to get other kind of outputs. For example we can apply a classifier for each node
+embedding to get a class of that node.
+
 ### Pooling
 
-> [!CAUTION]
-> This step comes after the embedding phase described above
+This is a step that allows us to take info from other graph element type, different
+from the ones we need. For example, we would like to have info coming from
+edges, but bring them over vertices.
 
-This is a step that can be used to take info about other elements, different from what we were considering
-(for example, taking info from edges while making the computation over vertices).
+```python
+# Pseudo code for pooling
+def pool(items):
 
-By using this approach we usually gather some info from edges of a vertex, then we concat them in a matrix and
-aggregate by summing them.
+    # This is usually a tensor 1xD
+    #   where D is the embedding dimension
+    pooled_value = init_from(items.type)
+
+    for item in items:
+
+        pooled_value += item.value
+
+    return pooled_value
+```
+
+However, to make it use of parallel computation, we can also think of doing this
+
+```python
+# Parallel pseudo code
+def parallel_pool(items):
+
+    # NxD matrix
+    # Each row of this matrix is an embedding
+    item_embedding_matrix = items.concat()
+
+    # Sum over rows
+    return sum(item_embedding_matrix, dim=0)
+```
+
+This, at its core, is useful when we lack properties about a portion of data,
+for example edges or hypnernode, and we use data coming from other parts to
+enable us to do computations.
 
 ### Message Passing
 
-Take all node embeddings that are in the neighbouroud and do similar steps as the pooling function.
+However, if we already have some info, we can use [`pooling`](#pooling) to augment
+those info, taking into account connectivity of the graph by taking **only
+adjacent info of the same type**.
+
+This means that at each step, a node receives info from its neighbours in the
+same fashion. After $step_k$ our node will have received partial information
+of a node locates $k$ steps away
+
+### Weaving
+
+If we combine previous techniques together, we can merge info coming from
+many parts of the graph. However if embeddings are not over the same dimension,
+we need a linear layer to make them match dimensions.
+
+As we are going to use a linear layer, this means that we are using a **learned**
+representation, **which must be trained**.
+
+> [!CAUTION]
+> Because graphs may be very sparsely connected an the longest path may be over
+> our layer number, to bypass this, we weave info over the hypernode as well.
+>
+> The graph node will be used as if it were a node connected to all nodes, giving
+> each node an overview of all node information
 
 ### Special Layers
 
@@ -134,68 +210,152 @@ D_{v,v} = \sum_{u} A_{v,u}
 
 $$
 
-In other words, $D_{v, v}$ is the number of nodes connected ot that one
+In other words, $D_{v, v}$ is the number of nodes connected to the node $v$. In
+fact, notice that we are summing over rows of the **Adjacency Matrix** $A$.
 
-The **graph Laplacian** of the graph will be
+The [**Graph Laplacian**](https://en.wikipedia.org/wiki/Laplacian_matrix)
+$L$ of the graph will be
 
 $$
 L = D - A
 $$
 
-### Polynomials of Laplacian
+As we can see, $L$ will have all elements of $D$ untouched, as each node has no
+connection to itself in the adjacency matrix, and all elements of the
+adjacency matrix with opposite sign (which means all negative)
 
-These polynomials, which have the same dimensions of $L$, can be though as being **filter** like in
-[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-networks)
+### Convolution throught Polynomials of Laplacian
+
+<!-- TODO: Study Laplacian Filters -->
+
+Let's construct some polynomials by using the Laplacian Matrix:
 
 $$
 p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i}
 $$
 
-We then can get a ***filtered node*** by simply multiplying the polynomial with the node value
+Here each $w_i$ is a weight over a vector $\vec{w} = [w_0, \dots, w_n] \in \R^n$.
+We then can define the convolution as $p_{\vec{w}}(L) \times \vec{x}$, where
+**<ins>x is the vector of all stacked vertices embeddings</ins>**
 
 $$
 \begin{aligned}
-    \vec{x}' = p_{\vec{w}}(L) \vec{x}
+    \vec{x} &\in \R^{l\times d}\\
+    \vec{x}' &= p_{\vec{w}}(L) \vec{x}
 \end{aligned}
 $$
 
-> [!NOTE]
-> In order to extract new features for a single vertex, supposing only $w_1 \neq 0$
->
-> Observe that we are only taking $L_{v}$
+To explain the ***convolutional effect***, let's see what happens for a polynomial
+of $\vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n$:
+
+$$
+\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x}
+$$
+
+Here each item of $\vec{x}'$ is just a scaled version of itself. Now consider a
+polynomial of $\vec{w}_1 = [0, w_1, \dots, 0] \in \R^n$:
+
+$$
+\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x}
+$$
+
+Here vertices merge info coming from their connected vertices plus a contribute of
+themselves (scaled by the number of connections).
+
+> [!TIP]
+> As this may have not been very clear from these formulas, let's make a practical
+> example, where vertex 0 is connected to both vertices 1 and 2:
 >
 > $$
 > \begin{aligned}
->     \vec{x}'_{v} &= (L\vec{x})_{v}    \\
->       &= \sum_{u \in G} L_{v,u} \vec{x}_{u}   \\
->       &= \sum_{u \in G} (D_{v,u} - A_{v,u}) \vec{x}_{u}   \\
->       &= \sum_{u \in G} D_{v,u} \vec{x}_{u} - A_{v,u} \vec{x}_{u} \\
->       &= D_{v, v} \vec{x}_{v} - \sum_{u \in \mathcal{N}(v)} \vec{x}_{u}
+>     \vec{x} &= \begin{bmatrix}
+>         \vec{x}_0 \in \R^{1 \times d} \\
+>         \vec{x}_1 \in \R^{1 \times d} \\
+>         \vec{x}_2 \in \R^{1 \times d} \\
+>     \end{bmatrix}
+>     \\
+>     L &= \begin{bmatrix}
+>         2 & -1 & -1 \\
+>         -1 & 1 & 0 \\
+>         -1 & 0 & 1
+>     \end{bmatrix}
+>     \\
+>     \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix}
+>         2w_1 & -w_1 & -w_1 \\
+>         -w_1 & w_1 & 0 \\
+>         -w_1 & 0 & w_1
+>     \end{bmatrix}
+>     \times
+>     \begin{bmatrix}
+>         \vec{x}_0 \\
+>         \vec{x}_1 \\
+>         \vec{x}_2
+>     \end{bmatrix}
+>     \\
+>     &= \begin{bmatrix}
+>         2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\
+>         -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\
+>         -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\
+>     \end{bmatrix}
 > \end{aligned}
 > $$
 >
-> Where the last step holds as $D$ is a diagonal matrix, and in the summatory we are only considering the neighbours
-> of v
+> For simplicity we wrote vertices as vectors rather than decomposing into their
+> elements. As we can notice here, we combined adjacent vertices.
+
+Now let's see what happens to the Laplacian Matrix for the power of 2:
+
+$$
+\begin{aligned}
+    L &= \begin{bmatrix}
+        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+    \end{bmatrix}
+    \\
+    L \times L &=
+    \begin{bmatrix}
+        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+    \end{bmatrix}
+    \times
+    \begin{bmatrix}
+        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
+        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
+        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
+    \end{bmatrix} = \\
+    &= \begin{bmatrix}
+        l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} &
+        l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} &
+        l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\
+        l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} &
+        l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} &
+        l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\
+        l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} &
+        l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} &
+        l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2
+    \end{bmatrix}
+\end{aligned}
+$$
+
+As we can see, information from neighbour connections leak into the Laplacian
+graph. Over the extreme cases, it's possible to demonstrate that if the distance
+between 2 nodes is more than the power of the Laplacian Matrix, they are
+considered deatached up to their distance.
+
+This means that by manipulating the degree of the polynomial, we can choose the
+diffusion distance for infomation. For example, if we want info from vertices
+at max 3 steps away, we will use a polynomial of 3rd degree.
+
+Another interesting properties of these polynomials is that they do not depend on
+the order of connections, but only over their existence, so they are order
+equivariant.
+
+> [!TIP]
+> Go [here](./python-experiments/laplacian_graph.ipynb) to see some experiments.
 >
-> It can be demonstrated that in any graph
->
-> $$
-> dist_{G}(v, u) > i \rightarrow L_{v, u}^{i} = 0
-> $$
->
-> More in general it holds
->
-> $$
-> \begin{aligned}
->     \vec{x}'_{v} = (p_{\vec{w}}(L)\vec{x})_{v} &= (p_{\vec{w}}(L))_{v} \vec{x}   \\
->       &= \sum_{i = 0}^{d} w_{i}L_{v}^{i} \vec{x}   \\
->       &= \sum_{i = 0}^{d} w_{i} \sum_{u \in G} L_{v,u}^{i}\vec{x}_{u}   \\
->       &= \sum_{i = 0}^{d} w_{i} \sum_{\substack{u \in G \\ dist_{G}(v, u) \leq i}} L_{v,u}^{i}\vec{x}_{u}   \\
-> \end{aligned}
-> $$
->
-> So this shows that the degree of the polynomial decides the max number of hops
-> to be included during the filtering stage, like if it were defining a [kernel](./../7-Convolutional-Networks/INDEX.md#filters)
+> In particular, see the second one
 
 ### ChebNet
 
@@ -210,23 +370,65 @@ T_{i} &= cos(i\theta) \\
 $$
 
 - $T_{i}$ is Chebischev first kind polynomial
+- $\theta$ are the weights to be learnt[^chebnet]
 - $\tilde{L}$ is a reduced version of $L$ because we divide for its max eigenvalue,
     keeping it in range $[-1, 1]$. Moreover $L$ ha no negative eigenvalues, so it's
     positive semi-definite
 
 These polynomials are more stable as they do not explode with higher powers
 
+> [!NOTE]
+>
+> Even though we said that we could control the radius of neighbours leaking info,
+> technically speaking, if we stack many GCN together, we get info from nodes at
+> $N \cdot hops$, assuming each layer takes from max $hops$.
+>
+> So, let's say we have 2 layers taking from distance 3, after the computation a
+> node gets info from nodes at 6 hops distance.
+
 ### Embedding Computation
 
 <!-- TODO: Read PDF 14 Anelli from 81 to 83 -->
+Whenever we are computing over nodes, we firstly need to compute their embedding.
+From there on, all embeddings will be treated as inputs for our networks.
 
-## Other methods
+$$
+\begin{aligned}
+    g(x) &\coloneqq \text{Any non linear function} \\
+    \vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right)
+\end{aligned}
+$$
 
-- <span style="color:skyblue">Learnable parameters</span>
+For each layer the same weights are applied, like in
+[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-layer)
+
+## Real Graph Networks Configurations
+
+To sum up all we have seen until now, to make computations over graphs we
+usually take info from neighbours (Node Aggregation) and transform old info
+of our node and combine them (Node Update).
+
+- <span style="color:skyblue">(Potentially) Learnable parameters</span>
 - <span style="color:orange">Embeddings of node v</span>
 - <span style="color:violet">Embeddings of neighbours of v</span>
 
-### Graph Convolutional Networks
+### Graph Neural Network
+
+$$
+\textcolor{orange}{h_v^{(k)}} =
+\textcolor{skyblue}{f_2^{(k)}} \left(
+        \underbrace{
+            \sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}(
+                \textcolor{orange}{h_n^{(k-1)}},
+                \textcolor{orange}{h_i^{(k-1)}},
+                \textcolor{orange}{e_{i, n}}
+            ),
+        }_{\text{message passing of edges to nodes}}\,\,
+        \textcolor{orange}{h_v^{(k -1)}}
+\right)
+$$
+
+### Graph Convolutional Network
 
 $$
 \textcolor{orange}{h_{v}^{(k)}} =
@@ -241,6 +443,11 @@ $$
 \right) \forall v \in V
 $$
 
+> [!TIP]
+>
+> $\textcolor{skyblue}{f^{(k)}}$ here is a
+> **<ins>Non-Linear Activation function</ins>**
+
 ### Graph Attention Networks
 
 $$
@@ -256,25 +463,101 @@ $$
 \right] \right) \forall v \in V
 $$
 
-where
+where $\alpha^{(k)}_{v,u}$ are weights generated by an attention mechanism
+$A^{(k)}$ that is normalized to make all sums of $\alpha^{(k)}_{v,u}$ be 1 for each
+node.
 
 $$
 \alpha^{(k)}_{v,u} = \frac{
     \textcolor{skyblue}{A^{(k)}}(
         \textcolor{orange}{h_{v}^{(k)}},
-        \textcolor{violet}{h_{u}^{(k)}},
+        \textcolor{violet}{h_{u}^{(k)}}
     )
 }{
     \sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}(
         \textcolor{orange}{h_{v}^{(k)}},
-        \textcolor{violet}{h_{w}^{(k)}},
+        \textcolor{violet}{h_{w}^{(k)}}
     )
 } \forall (v, u) \in E
 $$
 
+> [!TIP]
+> To make computations faster, we can just compute all
+> $\textcolor{skyblue}{A^{(k)}}(
+>        \textcolor{orange}{h_{v}^{(k)}},
+>        \textcolor{violet}{h_{u}^{(k)}}
+> )$ and then sum them up at a later stage to compute $\alpha^{(k)}_{v,u}$
+>
+> Also, this system (GAT) is flexible enough to make us change attention
+> mechanisms and include more heads, as the one above is done with just an
+> attention head
+
 ### Graph Sample and Aggregate (GraphSAGE)
 
-<!-- TODO: See PDF 14 Anelli from 98 to 102 -->
+Here, instead of taking all neighbours, we take a fixed size of neighbour, called
+`pool`, taken at random, increasing variance but allowing this method to be
+applied to large graphs
+
+$$
+\textcolor{orange}{h_v^{(k)}} =
+\textcolor{skyblue}{f^{(k)}} \left(
+    \textcolor{skyblue}{W^{(k)}}
+    \cdot
+    \underbrace{
+        \left[
+        \underbrace{
+            \textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+                        \textcolor{violet}{h_u^{(k-1)}}
+            )
+        }_{\text{Aggregation of v's neighbours}}
+        ,
+        \textcolor{orange}{h_v^{(k-1)}}
+        \right]
+    }_{\text{Concatenation of aggr. and prev. embedding}}
+
+\right) \forall v \in V
+$$
+
+where
+$\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+    \textcolor{violet}{h_u^{(k-1)}}
+)$ may be one of the followings:
+
+#### Mean
+
+This is similar to what we had on GCN above
+$$
+\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+    \textcolor{violet}{h_u^{(k-1)}}
+) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
+\frac{
+    \textcolor{orange}{h_v^{(k-1)}}+
+    \sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}}
+
+}{
+    1 + |\mathcal{N(v)}|
+}
+$$
+
+#### Dimension Wise Maximum
+
+$$
+\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
+    \textcolor{violet}{h_u^{(k-1)}}
+) = \max_{u \in \mathcal{N}(v)} \left\{
+    \sigma(
+        \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
+        \textcolor{violet}{h_u^{(k-1)}} +
+        \textcolor{skyblue}{b}
+    )
+\right\}
+$$
+
+#### LSTM Aggregator
+
+This is another network based on
+[LSTM](./../8-Recurrent-Networks/INDEX.md#long-short-term-memory--lstm). This
+methos **requires ordering the sequence of neighbours**
 
 ### Graph Isomorphism Network (GIN)
 
@@ -290,4 +573,6 @@ $$
     ) \cdot \textcolor{orange}{h_{v}^{(k - 1)}}
 \right)
 \forall v \in V
-$$
\ No newline at end of file
+$$
+
+[^chebnet]: [Data Warrior | Graph Convolutional Neural Network (Part II) | 9th November 2025](https://datawarrior.wordpress.com/2018/08/12/graph-convolutional-neural-network-part-ii/)