Revised Notes

This commit is contained in:
Christian Risi 2025-11-09 18:54:35 +01:00
parent 770af61909
commit 8a023a6079

View File

@ -4,19 +4,25 @@
- **Nodes**: Pieces of Information
- **Edges**: Relationship between nodes
- **Mutual**
- **One-Sided**
- **Mutual**
- **One-Sided**
- **Directionality**
- **Directed**: We care about the order of connections
- **Unidirectional**
- **Bidirectional**
- **Undirected**: We don't care about order of connections
- **Directed**: We care about the order of connections
- **Unidirectional**
- **Bidirectional**
- **Undirected**: We don't care about order of connections
Now, we can have attributes over
- **nodes**
- identity
- number of neighbours
- **edges**
- identity
- weight
- **master nodes** (a collection of nodes and edges)
- number of nodes
- longest path
for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones.
Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA)
@ -50,11 +56,20 @@ We want to predict relationships between nodes such as if they share an edge, or
For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a
sparse graph
### Downsides of Graphs
### Challenges of dealing with graphs
- They are not consistent in their structure and sometimes representing something as a graph is difficult
- If we don't care about order of nodes, we need to find a way to represent this **node-order equivariance**
- Graphs may be too large
While graphs are very powerful at representing structures in a compact and
natural way, they have several challenges.
**The number of nodes in a graph may change wildly**, making difficult to
work with different graphs.
**Sometimes nodes have no meaningful order**, thus we need to treat
different orders in the same way.
**Graphs can be very large**, thus take a lot of space. However they are,
usually, sparse in nature, making it possible to find ways to compress their
representation.
## Representing Graphs
@ -70,12 +85,12 @@ We store info about:
```python
nodes: list[any] = [
"forchetta", "spaghetti", "coltello", "cucchiao", "brodo"
"fork", "spaghetti", "knife", "spoon", "soup"
]
edges: list[any] = [
"serve per mangiare", "strumento", "cibo",
"strumento", "strumento", "serve per mangiare"
"needed to eat", "cutlery", "food",
"cutlery", "cutlery", "needed to eat"
]
adj_list: list[(int, int)] = [
@ -83,15 +98,22 @@ adj_list: list[(int, int)] = [
(0, 3), (2, 3), (3, 4)
]
graph: any = "tavola"
graph: any = "dining table"
```
If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts
> [!CAUTION]
> Even if in this example we used single values, edges, nodes and graphs may
> be made of Tensors or Structured data
## Graph Neural Networks (GNNs)
At the simpkest form we take a **graph-in** and **graph-out** approach with MLPs separate for
vertices, edges and master nodes that we apply **one at a time** over each element
At the simpkest form we take a **graph-in** and **graph-out** approach with `MLPs`
([Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron)
aka Neural Network)
separate for vertices, edges and master
nodes that we apply **one at a time** over each element
$$
\begin{aligned}
@ -101,20 +123,74 @@ $$
\end{aligned}
$$
This also means that its output is a graph and we need further refining
to get other kind of outputs. For example we can apply a classifier for each node
embedding to get a class of that node.
### Pooling
> [!CAUTION]
> This step comes after the embedding phase described above
This is a step that allows us to take info from other graph element type, different
from the ones we need. For example, we would like to have info coming from
edges, but bring them over vertices.
This is a step that can be used to take info about other elements, different from what we were considering
(for example, taking info from edges while making the computation over vertices).
```python
# Pseudo code for pooling
def pool(items):
By using this approach we usually gather some info from edges of a vertex, then we concat them in a matrix and
aggregate by summing them.
# This is usually a tensor 1xD
# where D is the embedding dimension
pooled_value = init_from(items.type)
for item in items:
pooled_value += item.value
return pooled_value
```
However, to make it use of parallel computation, we can also think of doing this
```python
# Parallel pseudo code
def parallel_pool(items):
# NxD matrix
# Each row of this matrix is an embedding
item_embedding_matrix = items.concat()
# Sum over rows
return sum(item_embedding_matrix, dim=0)
```
This, at its core, is useful when we lack properties about a portion of data,
for example edges or hypnernode, and we use data coming from other parts to
enable us to do computations.
### Message Passing
Take all node embeddings that are in the neighbouroud and do similar steps as the pooling function.
However, if we already have some info, we can use [`pooling`](#pooling) to augment
those info, taking into account connectivity of the graph by taking **only
adjacent info of the same type**.
This means that at each step, a node receives info from its neighbours in the
same fashion. After $step_k$ our node will have received partial information
of a node locates $k$ steps away
### Weaving
If we combine previous techniques together, we can merge info coming from
many parts of the graph. However if embeddings are not over the same dimension,
we need a linear layer to make them match dimensions.
As we are going to use a linear layer, this means that we are using a **learned**
representation, **which must be trained**.
> [!CAUTION]
> Because graphs may be very sparsely connected an the longest path may be over
> our layer number, to bypass this, we weave info over the hypernode as well.
>
> The graph node will be used as if it were a node connected to all nodes, giving
> each node an overview of all node information
### Special Layers
@ -134,68 +210,152 @@ D_{v,v} = \sum_{u} A_{v,u}
$$
In other words, $D_{v, v}$ is the number of nodes connected ot that one
In other words, $D_{v, v}$ is the number of nodes connected to the node $v$. In
fact, notice that we are summing over rows of the **Adjacency Matrix** $A$.
The **graph Laplacian** of the graph will be
The [**Graph Laplacian**](https://en.wikipedia.org/wiki/Laplacian_matrix)
$L$ of the graph will be
$$
L = D - A
$$
### Polynomials of Laplacian
As we can see, $L$ will have all elements of $D$ untouched, as each node has no
connection to itself in the adjacency matrix, and all elements of the
adjacency matrix with opposite sign (which means all negative)
These polynomials, which have the same dimensions of $L$, can be though as being **filter** like in
[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-networks)
### Convolution throught Polynomials of Laplacian
<!-- TODO: Study Laplacian Filters -->
Let's construct some polynomials by using the Laplacian Matrix:
$$
p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i}
$$
We then can get a ***filtered node*** by simply multiplying the polynomial with the node value
Here each $w_i$ is a weight over a vector $\vec{w} = [w_0, \dots, w_n] \in \R^n$.
We then can define the convolution as $p_{\vec{w}}(L) \times \vec{x}$, where
**<ins>x is the vector of all stacked vertices embeddings</ins>**
$$
\begin{aligned}
\vec{x}' = p_{\vec{w}}(L) \vec{x}
\vec{x} &\in \R^{l\times d}\\
\vec{x}' &= p_{\vec{w}}(L) \vec{x}
\end{aligned}
$$
> [!NOTE]
> In order to extract new features for a single vertex, supposing only $w_1 \neq 0$
>
> Observe that we are only taking $L_{v}$
To explain the ***convolutional effect***, let's see what happens for a polynomial
of $\vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n$:
$$
\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x}
$$
Here each item of $\vec{x}'$ is just a scaled version of itself. Now consider a
polynomial of $\vec{w}_1 = [0, w_1, \dots, 0] \in \R^n$:
$$
\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x}
$$
Here vertices merge info coming from their connected vertices plus a contribute of
themselves (scaled by the number of connections).
> [!TIP]
> As this may have not been very clear from these formulas, let's make a practical
> example, where vertex 0 is connected to both vertices 1 and 2:
>
> $$
> \begin{aligned}
> \vec{x}'_{v} &= (L\vec{x})_{v} \\
> &= \sum_{u \in G} L_{v,u} \vec{x}_{u} \\
> &= \sum_{u \in G} (D_{v,u} - A_{v,u}) \vec{x}_{u} \\
> &= \sum_{u \in G} D_{v,u} \vec{x}_{u} - A_{v,u} \vec{x}_{u} \\
> &= D_{v, v} \vec{x}_{v} - \sum_{u \in \mathcal{N}(v)} \vec{x}_{u}
> \vec{x} &= \begin{bmatrix}
> \vec{x}_0 \in \R^{1 \times d} \\
> \vec{x}_1 \in \R^{1 \times d} \\
> \vec{x}_2 \in \R^{1 \times d} \\
> \end{bmatrix}
> \\
> L &= \begin{bmatrix}
> 2 & -1 & -1 \\
> -1 & 1 & 0 \\
> -1 & 0 & 1
> \end{bmatrix}
> \\
> \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix}
> 2w_1 & -w_1 & -w_1 \\
> -w_1 & w_1 & 0 \\
> -w_1 & 0 & w_1
> \end{bmatrix}
> \times
> \begin{bmatrix}
> \vec{x}_0 \\
> \vec{x}_1 \\
> \vec{x}_2
> \end{bmatrix}
> \\
> &= \begin{bmatrix}
> 2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\
> -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\
> -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\
> \end{bmatrix}
> \end{aligned}
> $$
>
> Where the last step holds as $D$ is a diagonal matrix, and in the summatory we are only considering the neighbours
> of v
> For simplicity we wrote vertices as vectors rather than decomposing into their
> elements. As we can notice here, we combined adjacent vertices.
Now let's see what happens to the Laplacian Matrix for the power of 2:
$$
\begin{aligned}
L &= \begin{bmatrix}
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
\end{bmatrix}
\\
L \times L &=
\begin{bmatrix}
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
\end{bmatrix}
\times
\begin{bmatrix}
l_{0, 0} & l_{0, 1} & l_{0, 2} \\
l_{1, 0} & l_{1, 1} & l_{1, 2} \\
l_{2, 0} & l_{2, 1} & l_{2, 2} \\
\end{bmatrix} = \\
&= \begin{bmatrix}
l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} &
l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} &
l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\
l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} &
l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} &
l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\
l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} &
l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} &
l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2
\end{bmatrix}
\end{aligned}
$$
As we can see, information from neighbour connections leak into the Laplacian
graph. Over the extreme cases, it's possible to demonstrate that if the distance
between 2 nodes is more than the power of the Laplacian Matrix, they are
considered deatached up to their distance.
This means that by manipulating the degree of the polynomial, we can choose the
diffusion distance for infomation. For example, if we want info from vertices
at max 3 steps away, we will use a polynomial of 3rd degree.
Another interesting properties of these polynomials is that they do not depend on
the order of connections, but only over their existence, so they are order
equivariant.
> [!TIP]
> Go [here](./python-experiments/laplacian_graph.ipynb) to see some experiments.
>
> It can be demonstrated that in any graph
>
> $$
> dist_{G}(v, u) > i \rightarrow L_{v, u}^{i} = 0
> $$
>
> More in general it holds
>
> $$
> \begin{aligned}
> \vec{x}'_{v} = (p_{\vec{w}}(L)\vec{x})_{v} &= (p_{\vec{w}}(L))_{v} \vec{x} \\
> &= \sum_{i = 0}^{d} w_{i}L_{v}^{i} \vec{x} \\
> &= \sum_{i = 0}^{d} w_{i} \sum_{u \in G} L_{v,u}^{i}\vec{x}_{u} \\
> &= \sum_{i = 0}^{d} w_{i} \sum_{\substack{u \in G \\ dist_{G}(v, u) \leq i}} L_{v,u}^{i}\vec{x}_{u} \\
> \end{aligned}
> $$
>
> So this shows that the degree of the polynomial decides the max number of hops
> to be included during the filtering stage, like if it were defining a [kernel](./../7-Convolutional-Networks/INDEX.md#filters)
> In particular, see the second one
### ChebNet
@ -210,23 +370,65 @@ T_{i} &= cos(i\theta) \\
$$
- $T_{i}$ is Chebischev first kind polynomial
- $\theta$ are the weights to be learnt[^chebnet]
- $\tilde{L}$ is a reduced version of $L$ because we divide for its max eigenvalue,
keeping it in range $[-1, 1]$. Moreover $L$ ha no negative eigenvalues, so it's
positive semi-definite
These polynomials are more stable as they do not explode with higher powers
> [!NOTE]
>
> Even though we said that we could control the radius of neighbours leaking info,
> technically speaking, if we stack many GCN together, we get info from nodes at
> $N \cdot hops$, assuming each layer takes from max $hops$.
>
> So, let's say we have 2 layers taking from distance 3, after the computation a
> node gets info from nodes at 6 hops distance.
### Embedding Computation
<!-- TODO: Read PDF 14 Anelli from 81 to 83 -->
Whenever we are computing over nodes, we firstly need to compute their embedding.
From there on, all embeddings will be treated as inputs for our networks.
## Other methods
$$
\begin{aligned}
g(x) &\coloneqq \text{Any non linear function} \\
\vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right)
\end{aligned}
$$
- <span style="color:skyblue">Learnable parameters</span>
For each layer the same weights are applied, like in
[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-layer)
## Real Graph Networks Configurations
To sum up all we have seen until now, to make computations over graphs we
usually take info from neighbours (Node Aggregation) and transform old info
of our node and combine them (Node Update).
- <span style="color:skyblue">(Potentially) Learnable parameters</span>
- <span style="color:orange">Embeddings of node v</span>
- <span style="color:violet">Embeddings of neighbours of v</span>
### Graph Convolutional Networks
### Graph Neural Network
$$
\textcolor{orange}{h_v^{(k)}} =
\textcolor{skyblue}{f_2^{(k)}} \left(
\underbrace{
\sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}(
\textcolor{orange}{h_n^{(k-1)}},
\textcolor{orange}{h_i^{(k-1)}},
\textcolor{orange}{e_{i, n}}
),
}_{\text{message passing of edges to nodes}}\,\,
\textcolor{orange}{h_v^{(k -1)}}
\right)
$$
### Graph Convolutional Network
$$
\textcolor{orange}{h_{v}^{(k)}} =
@ -241,6 +443,11 @@ $$
\right) \forall v \in V
$$
> [!TIP]
>
> $\textcolor{skyblue}{f^{(k)}}$ here is a
> **<ins>Non-Linear Activation function</ins>**
### Graph Attention Networks
$$
@ -256,25 +463,101 @@ $$
\right] \right) \forall v \in V
$$
where
where $\alpha^{(k)}_{v,u}$ are weights generated by an attention mechanism
$A^{(k)}$ that is normalized to make all sums of $\alpha^{(k)}_{v,u}$ be 1 for each
node.
$$
\alpha^{(k)}_{v,u} = \frac{
\textcolor{skyblue}{A^{(k)}}(
\textcolor{orange}{h_{v}^{(k)}},
\textcolor{violet}{h_{u}^{(k)}},
\textcolor{violet}{h_{u}^{(k)}}
)
}{
\sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}(
\textcolor{orange}{h_{v}^{(k)}},
\textcolor{violet}{h_{w}^{(k)}},
\textcolor{violet}{h_{w}^{(k)}}
)
} \forall (v, u) \in E
$$
> [!TIP]
> To make computations faster, we can just compute all
> $\textcolor{skyblue}{A^{(k)}}(
> \textcolor{orange}{h_{v}^{(k)}},
> \textcolor{violet}{h_{u}^{(k)}}
> )$ and then sum them up at a later stage to compute $\alpha^{(k)}_{v,u}$
>
> Also, this system (GAT) is flexible enough to make us change attention
> mechanisms and include more heads, as the one above is done with just an
> attention head
### Graph Sample and Aggregate (GraphSAGE)
<!-- TODO: See PDF 14 Anelli from 98 to 102 -->
Here, instead of taking all neighbours, we take a fixed size of neighbour, called
`pool`, taken at random, increasing variance but allowing this method to be
applied to large graphs
$$
\textcolor{orange}{h_v^{(k)}} =
\textcolor{skyblue}{f^{(k)}} \left(
\textcolor{skyblue}{W^{(k)}}
\cdot
\underbrace{
\left[
\underbrace{
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
\textcolor{violet}{h_u^{(k-1)}}
)
}_{\text{Aggregation of v's neighbours}}
,
\textcolor{orange}{h_v^{(k-1)}}
\right]
}_{\text{Concatenation of aggr. and prev. embedding}}
\right) \forall v \in V
$$
where
$\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
\textcolor{violet}{h_u^{(k-1)}}
)$ may be one of the followings:
#### Mean
This is similar to what we had on GCN above
$$
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
\textcolor{violet}{h_u^{(k-1)}}
) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
\frac{
\textcolor{orange}{h_v^{(k-1)}}+
\sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}}
}{
1 + |\mathcal{N(v)}|
}
$$
#### Dimension Wise Maximum
$$
\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
\textcolor{violet}{h_u^{(k-1)}}
) = \max_{u \in \mathcal{N}(v)} \left\{
\sigma(
\textcolor{skyblue}{W_{pool}^{(k)}} \cdot
\textcolor{violet}{h_u^{(k-1)}} +
\textcolor{skyblue}{b}
)
\right\}
$$
#### LSTM Aggregator
This is another network based on
[LSTM](./../8-Recurrent-Networks/INDEX.md#long-short-term-memory--lstm). This
methos **requires ordering the sequence of neighbours**
### Graph Isomorphism Network (GIN)
@ -290,4 +573,6 @@ $$
) \cdot \textcolor{orange}{h_{v}^{(k - 1)}}
\right)
\forall v \in V
$$
$$
[^chebnet]: [Data Warrior | Graph Convolutional Neural Network (Part II) | 9th November 2025](https://datawarrior.wordpress.com/2018/08/12/graph-convolutional-neural-network-part-ii/)