diff --git a/Chapters/14-GNN-GCN/INDEX.md b/Chapters/14-GNN-GCN/INDEX.md index 04f410b..57f8a9e 100644 --- a/Chapters/14-GNN-GCN/INDEX.md +++ b/Chapters/14-GNN-GCN/INDEX.md @@ -4,19 +4,25 @@ - **Nodes**: Pieces of Information - **Edges**: Relationship between nodes - - **Mutual** - - **One-Sided** + - **Mutual** + - **One-Sided** - **Directionality** - - **Directed**: We care about the order of connections - - **Unidirectional** - - **Bidirectional** - - **Undirected**: We don't care about order of connections + - **Directed**: We care about the order of connections + - **Unidirectional** + - **Bidirectional** + - **Undirected**: We don't care about order of connections Now, we can have attributes over - **nodes** + - identity + - number of neighbours - **edges** + - identity + - weight - **master nodes** (a collection of nodes and edges) + - number of nodes + - longest path for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones. Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA) @@ -50,11 +56,20 @@ We want to predict relationships between nodes such as if they share an edge, or For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a sparse graph -### Downsides of Graphs +### Challenges of dealing with graphs -- They are not consistent in their structure and sometimes representing something as a graph is difficult -- If we don't care about order of nodes, we need to find a way to represent this **node-order equivariance** -- Graphs may be too large +While graphs are very powerful at representing structures in a compact and +natural way, they have several challenges. + +**The number of nodes in a graph may change wildly**, making difficult to +work with different graphs. + +**Sometimes nodes have no meaningful order**, thus we need to treat +different orders in the same way. + +**Graphs can be very large**, thus take a lot of space. However they are, +usually, sparse in nature, making it possible to find ways to compress their +representation. ## Representing Graphs @@ -70,12 +85,12 @@ We store info about: ```python nodes: list[any] = [ - "forchetta", "spaghetti", "coltello", "cucchiao", "brodo" + "fork", "spaghetti", "knife", "spoon", "soup" ] edges: list[any] = [ - "serve per mangiare", "strumento", "cibo", - "strumento", "strumento", "serve per mangiare" + "needed to eat", "cutlery", "food", + "cutlery", "cutlery", "needed to eat" ] adj_list: list[(int, int)] = [ @@ -83,15 +98,22 @@ adj_list: list[(int, int)] = [ (0, 3), (2, 3), (3, 4) ] -graph: any = "tavola" +graph: any = "dining table" ``` If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts +> [!CAUTION] +> Even if in this example we used single values, edges, nodes and graphs may +> be made of Tensors or Structured data + ## Graph Neural Networks (GNNs) -At the simpkest form we take a **graph-in** and **graph-out** approach with MLPs separate for -vertices, edges and master nodes that we apply **one at a time** over each element +At the simpkest form we take a **graph-in** and **graph-out** approach with `MLPs` +([Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) +aka Neural Network) +separate for vertices, edges and master +nodes that we apply **one at a time** over each element $$ \begin{aligned} @@ -101,20 +123,74 @@ $$ \end{aligned} $$ +This also means that its output is a graph and we need further refining +to get other kind of outputs. For example we can apply a classifier for each node +embedding to get a class of that node. + ### Pooling -> [!CAUTION] -> This step comes after the embedding phase described above +This is a step that allows us to take info from other graph element type, different +from the ones we need. For example, we would like to have info coming from +edges, but bring them over vertices. -This is a step that can be used to take info about other elements, different from what we were considering -(for example, taking info from edges while making the computation over vertices). +```python +# Pseudo code for pooling +def pool(items): -By using this approach we usually gather some info from edges of a vertex, then we concat them in a matrix and -aggregate by summing them. + # This is usually a tensor 1xD + # where D is the embedding dimension + pooled_value = init_from(items.type) + + for item in items: + + pooled_value += item.value + + return pooled_value +``` + +However, to make it use of parallel computation, we can also think of doing this + +```python +# Parallel pseudo code +def parallel_pool(items): + + # NxD matrix + # Each row of this matrix is an embedding + item_embedding_matrix = items.concat() + + # Sum over rows + return sum(item_embedding_matrix, dim=0) +``` + +This, at its core, is useful when we lack properties about a portion of data, +for example edges or hypnernode, and we use data coming from other parts to +enable us to do computations. ### Message Passing -Take all node embeddings that are in the neighbouroud and do similar steps as the pooling function. +However, if we already have some info, we can use [`pooling`](#pooling) to augment +those info, taking into account connectivity of the graph by taking **only +adjacent info of the same type**. + +This means that at each step, a node receives info from its neighbours in the +same fashion. After $step_k$ our node will have received partial information +of a node locates $k$ steps away + +### Weaving + +If we combine previous techniques together, we can merge info coming from +many parts of the graph. However if embeddings are not over the same dimension, +we need a linear layer to make them match dimensions. + +As we are going to use a linear layer, this means that we are using a **learned** +representation, **which must be trained**. + +> [!CAUTION] +> Because graphs may be very sparsely connected an the longest path may be over +> our layer number, to bypass this, we weave info over the hypernode as well. +> +> The graph node will be used as if it were a node connected to all nodes, giving +> each node an overview of all node information ### Special Layers @@ -134,68 +210,152 @@ D_{v,v} = \sum_{u} A_{v,u} $$ -In other words, $D_{v, v}$ is the number of nodes connected ot that one +In other words, $D_{v, v}$ is the number of nodes connected to the node $v$. In +fact, notice that we are summing over rows of the **Adjacency Matrix** $A$. -The **graph Laplacian** of the graph will be +The [**Graph Laplacian**](https://en.wikipedia.org/wiki/Laplacian_matrix) +$L$ of the graph will be $$ L = D - A $$ -### Polynomials of Laplacian +As we can see, $L$ will have all elements of $D$ untouched, as each node has no +connection to itself in the adjacency matrix, and all elements of the +adjacency matrix with opposite sign (which means all negative) -These polynomials, which have the same dimensions of $L$, can be though as being **filter** like in -[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-networks) +### Convolution throught Polynomials of Laplacian + + + +Let's construct some polynomials by using the Laplacian Matrix: $$ p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i} $$ -We then can get a ***filtered node*** by simply multiplying the polynomial with the node value +Here each $w_i$ is a weight over a vector $\vec{w} = [w_0, \dots, w_n] \in \R^n$. +We then can define the convolution as $p_{\vec{w}}(L) \times \vec{x}$, where +**x is the vector of all stacked vertices embeddings** $$ \begin{aligned} - \vec{x}' = p_{\vec{w}}(L) \vec{x} + \vec{x} &\in \R^{l\times d}\\ + \vec{x}' &= p_{\vec{w}}(L) \vec{x} \end{aligned} $$ -> [!NOTE] -> In order to extract new features for a single vertex, supposing only $w_1 \neq 0$ -> -> Observe that we are only taking $L_{v}$ +To explain the ***convolutional effect***, let's see what happens for a polynomial +of $\vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n$: + +$$ +\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x} +$$ + +Here each item of $\vec{x}'$ is just a scaled version of itself. Now consider a +polynomial of $\vec{w}_1 = [0, w_1, \dots, 0] \in \R^n$: + +$$ +\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x} +$$ + +Here vertices merge info coming from their connected vertices plus a contribute of +themselves (scaled by the number of connections). + +> [!TIP] +> As this may have not been very clear from these formulas, let's make a practical +> example, where vertex 0 is connected to both vertices 1 and 2: > > $$ > \begin{aligned} -> \vec{x}'_{v} &= (L\vec{x})_{v} \\ -> &= \sum_{u \in G} L_{v,u} \vec{x}_{u} \\ -> &= \sum_{u \in G} (D_{v,u} - A_{v,u}) \vec{x}_{u} \\ -> &= \sum_{u \in G} D_{v,u} \vec{x}_{u} - A_{v,u} \vec{x}_{u} \\ -> &= D_{v, v} \vec{x}_{v} - \sum_{u \in \mathcal{N}(v)} \vec{x}_{u} +> \vec{x} &= \begin{bmatrix} +> \vec{x}_0 \in \R^{1 \times d} \\ +> \vec{x}_1 \in \R^{1 \times d} \\ +> \vec{x}_2 \in \R^{1 \times d} \\ +> \end{bmatrix} +> \\ +> L &= \begin{bmatrix} +> 2 & -1 & -1 \\ +> -1 & 1 & 0 \\ +> -1 & 0 & 1 +> \end{bmatrix} +> \\ +> \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix} +> 2w_1 & -w_1 & -w_1 \\ +> -w_1 & w_1 & 0 \\ +> -w_1 & 0 & w_1 +> \end{bmatrix} +> \times +> \begin{bmatrix} +> \vec{x}_0 \\ +> \vec{x}_1 \\ +> \vec{x}_2 +> \end{bmatrix} +> \\ +> &= \begin{bmatrix} +> 2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\ +> -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\ +> -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\ +> \end{bmatrix} > \end{aligned} > $$ > -> Where the last step holds as $D$ is a diagonal matrix, and in the summatory we are only considering the neighbours -> of v +> For simplicity we wrote vertices as vectors rather than decomposing into their +> elements. As we can notice here, we combined adjacent vertices. + +Now let's see what happens to the Laplacian Matrix for the power of 2: + +$$ +\begin{aligned} + L &= \begin{bmatrix} + l_{0, 0} & l_{0, 1} & l_{0, 2} \\ + l_{1, 0} & l_{1, 1} & l_{1, 2} \\ + l_{2, 0} & l_{2, 1} & l_{2, 2} \\ + \end{bmatrix} + \\ + L \times L &= + \begin{bmatrix} + l_{0, 0} & l_{0, 1} & l_{0, 2} \\ + l_{1, 0} & l_{1, 1} & l_{1, 2} \\ + l_{2, 0} & l_{2, 1} & l_{2, 2} \\ + \end{bmatrix} + \times + \begin{bmatrix} + l_{0, 0} & l_{0, 1} & l_{0, 2} \\ + l_{1, 0} & l_{1, 1} & l_{1, 2} \\ + l_{2, 0} & l_{2, 1} & l_{2, 2} \\ + \end{bmatrix} = \\ + &= \begin{bmatrix} + l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} & + l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} & + l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\ + l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} & + l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} & + l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\ + l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} & + l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} & + l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2 + \end{bmatrix} +\end{aligned} +$$ + +As we can see, information from neighbour connections leak into the Laplacian +graph. Over the extreme cases, it's possible to demonstrate that if the distance +between 2 nodes is more than the power of the Laplacian Matrix, they are +considered deatached up to their distance. + +This means that by manipulating the degree of the polynomial, we can choose the +diffusion distance for infomation. For example, if we want info from vertices +at max 3 steps away, we will use a polynomial of 3rd degree. + +Another interesting properties of these polynomials is that they do not depend on +the order of connections, but only over their existence, so they are order +equivariant. + +> [!TIP] +> Go [here](./python-experiments/laplacian_graph.ipynb) to see some experiments. > -> It can be demonstrated that in any graph -> -> $$ -> dist_{G}(v, u) > i \rightarrow L_{v, u}^{i} = 0 -> $$ -> -> More in general it holds -> -> $$ -> \begin{aligned} -> \vec{x}'_{v} = (p_{\vec{w}}(L)\vec{x})_{v} &= (p_{\vec{w}}(L))_{v} \vec{x} \\ -> &= \sum_{i = 0}^{d} w_{i}L_{v}^{i} \vec{x} \\ -> &= \sum_{i = 0}^{d} w_{i} \sum_{u \in G} L_{v,u}^{i}\vec{x}_{u} \\ -> &= \sum_{i = 0}^{d} w_{i} \sum_{\substack{u \in G \\ dist_{G}(v, u) \leq i}} L_{v,u}^{i}\vec{x}_{u} \\ -> \end{aligned} -> $$ -> -> So this shows that the degree of the polynomial decides the max number of hops -> to be included during the filtering stage, like if it were defining a [kernel](./../7-Convolutional-Networks/INDEX.md#filters) +> In particular, see the second one ### ChebNet @@ -210,23 +370,65 @@ T_{i} &= cos(i\theta) \\ $$ - $T_{i}$ is Chebischev first kind polynomial +- $\theta$ are the weights to be learnt[^chebnet] - $\tilde{L}$ is a reduced version of $L$ because we divide for its max eigenvalue, keeping it in range $[-1, 1]$. Moreover $L$ ha no negative eigenvalues, so it's positive semi-definite These polynomials are more stable as they do not explode with higher powers +> [!NOTE] +> +> Even though we said that we could control the radius of neighbours leaking info, +> technically speaking, if we stack many GCN together, we get info from nodes at +> $N \cdot hops$, assuming each layer takes from max $hops$. +> +> So, let's say we have 2 layers taking from distance 3, after the computation a +> node gets info from nodes at 6 hops distance. + ### Embedding Computation +Whenever we are computing over nodes, we firstly need to compute their embedding. +From there on, all embeddings will be treated as inputs for our networks. -## Other methods +$$ +\begin{aligned} + g(x) &\coloneqq \text{Any non linear function} \\ + \vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right) +\end{aligned} +$$ -- Learnable parameters +For each layer the same weights are applied, like in +[CNNs](./../7-Convolutional-Networks/INDEX.md#convolutional-layer) + +## Real Graph Networks Configurations + +To sum up all we have seen until now, to make computations over graphs we +usually take info from neighbours (Node Aggregation) and transform old info +of our node and combine them (Node Update). + +- (Potentially) Learnable parameters - Embeddings of node v - Embeddings of neighbours of v -### Graph Convolutional Networks +### Graph Neural Network + +$$ +\textcolor{orange}{h_v^{(k)}} = +\textcolor{skyblue}{f_2^{(k)}} \left( + \underbrace{ + \sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}( + \textcolor{orange}{h_n^{(k-1)}}, + \textcolor{orange}{h_i^{(k-1)}}, + \textcolor{orange}{e_{i, n}} + ), + }_{\text{message passing of edges to nodes}}\,\, + \textcolor{orange}{h_v^{(k -1)}} +\right) +$$ + +### Graph Convolutional Network $$ \textcolor{orange}{h_{v}^{(k)}} = @@ -241,6 +443,11 @@ $$ \right) \forall v \in V $$ +> [!TIP] +> +> $\textcolor{skyblue}{f^{(k)}}$ here is a +> **Non-Linear Activation function** + ### Graph Attention Networks $$ @@ -256,25 +463,101 @@ $$ \right] \right) \forall v \in V $$ -where +where $\alpha^{(k)}_{v,u}$ are weights generated by an attention mechanism +$A^{(k)}$ that is normalized to make all sums of $\alpha^{(k)}_{v,u}$ be 1 for each +node. $$ \alpha^{(k)}_{v,u} = \frac{ \textcolor{skyblue}{A^{(k)}}( \textcolor{orange}{h_{v}^{(k)}}, - \textcolor{violet}{h_{u}^{(k)}}, + \textcolor{violet}{h_{u}^{(k)}} ) }{ \sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}( \textcolor{orange}{h_{v}^{(k)}}, - \textcolor{violet}{h_{w}^{(k)}}, + \textcolor{violet}{h_{w}^{(k)}} ) } \forall (v, u) \in E $$ +> [!TIP] +> To make computations faster, we can just compute all +> $\textcolor{skyblue}{A^{(k)}}( +> \textcolor{orange}{h_{v}^{(k)}}, +> \textcolor{violet}{h_{u}^{(k)}} +> )$ and then sum them up at a later stage to compute $\alpha^{(k)}_{v,u}$ +> +> Also, this system (GAT) is flexible enough to make us change attention +> mechanisms and include more heads, as the one above is done with just an +> attention head + ### Graph Sample and Aggregate (GraphSAGE) - +Here, instead of taking all neighbours, we take a fixed size of neighbour, called +`pool`, taken at random, increasing variance but allowing this method to be +applied to large graphs + +$$ +\textcolor{orange}{h_v^{(k)}} = +\textcolor{skyblue}{f^{(k)}} \left( + \textcolor{skyblue}{W^{(k)}} + \cdot + \underbrace{ + \left[ + \underbrace{ + \textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}( + \textcolor{violet}{h_u^{(k-1)}} + ) + }_{\text{Aggregation of v's neighbours}} + , + \textcolor{orange}{h_v^{(k-1)}} + \right] + }_{\text{Concatenation of aggr. and prev. embedding}} + +\right) \forall v \in V +$$ + +where +$\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}( + \textcolor{violet}{h_u^{(k-1)}} +)$ may be one of the followings: + +#### Mean + +This is similar to what we had on GCN above +$$ +\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}( + \textcolor{violet}{h_u^{(k-1)}} +) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot +\frac{ + \textcolor{orange}{h_v^{(k-1)}}+ + \sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}} + +}{ + 1 + |\mathcal{N(v)}| +} +$$ + +#### Dimension Wise Maximum + +$$ +\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}( + \textcolor{violet}{h_u^{(k-1)}} +) = \max_{u \in \mathcal{N}(v)} \left\{ + \sigma( + \textcolor{skyblue}{W_{pool}^{(k)}} \cdot + \textcolor{violet}{h_u^{(k-1)}} + + \textcolor{skyblue}{b} + ) +\right\} +$$ + +#### LSTM Aggregator + +This is another network based on +[LSTM](./../8-Recurrent-Networks/INDEX.md#long-short-term-memory--lstm). This +methos **requires ordering the sequence of neighbours** ### Graph Isomorphism Network (GIN) @@ -290,4 +573,6 @@ $$ ) \cdot \textcolor{orange}{h_{v}^{(k - 1)}} \right) \forall v \in V -$$ \ No newline at end of file +$$ + +[^chebnet]: [Data Warrior | Graph Convolutional Neural Network (Part II) | 9th November 2025](https://datawarrior.wordpress.com/2018/08/12/graph-convolutional-neural-network-part-ii/)