2025-11-09 18:54:35 +01:00

17 KiB

Graph ML

Graph Introduction

  • Nodes: Pieces of Information
  • Edges: Relationship between nodes
    • Mutual
    • One-Sided
  • Directionality
    • Directed: We care about the order of connections
      • Unidirectional
      • Bidirectional
    • Undirected: We don't care about order of connections

Now, we can have attributes over

  • nodes
    • identity
    • number of neighbours
  • edges
    • identity
    • weight
  • master nodes (a collection of nodes and edges)
    • number of nodes
    • longest path

for example images may be represented as a graph where each non edge pixel is a vertex connected to other 8 ones. Its information at the vertex is a 3 (or 4) dimensional vector (think of RGB and RGBA)

Adjacency Graph

Take a picture and make a matrix with dimension \{0, 1\}^{(h \cdot w) \times (h \cdot w)} and we put a 1 if these nodes are connected (share and edge), or 0 if they do not.

Note

For a 300 \times 250 image our matrix would be \{0, 1\}^{(250 \cdot 300) \times (250 \cdot 300)}

The way we put a 1 or a 0 has this rules: - Row element has connection towards Column element - Column element has a connection coming from Row element

Tasks

Graph-Level

We want to predict a graph property

Node-Level

We want to predict a node property, such as classification

Edge-Level

We want to predict relationships between nodes such as if they share an edge, or the value of the edge they share.

For this task we may start with a fully connected graph and then prune edges, as predictions go on, to come to a sparse graph

Challenges of dealing with graphs

While graphs are very powerful at representing structures in a compact and natural way, they have several challenges.

The number of nodes in a graph may change wildly, making difficult to work with different graphs.

Sometimes nodes have no meaningful order, thus we need to treat different orders in the same way.

Graphs can be very large, thus take a lot of space. However they are, usually, sparse in nature, making it possible to find ways to compress their representation.

Representing Graphs

Adjacency List

We store info about:

  • Nodes: list of values. index Node_k is the value of that node
  • Edges: list of values. index Edge_k is the value of that edge
  • Adjacent_list: list of Tuples with indices over nodes. index Tuple_k represent the Nodes involved in the kth edge
  • Graph: Value of graph
nodes: list[any] = [
    "fork", "spaghetti", "knife", "spoon", "soup"
]

edges: list[any] = [
    "needed to eat", "cutlery", "food",
    "cutlery", "cutlery", "needed to eat"
]

adj_list: list[(int, int)] = [
    (0, 1), (0, 2), (1, 4),
    (0, 3), (2, 3), (3, 4)
]

graph: any = "dining table"

If we find some parts of the graph that are disconnected, we can just avoid storing and computing those parts

Caution

Even if in this example we used single values, edges, nodes and graphs may be made of Tensors or Structured data

Graph Neural Networks (GNNs)

At the simpkest form we take a graph-in and graph-out approach with MLPs (Multi Layer Perceptron aka Neural Network) separate for vertices, edges and master nodes that we apply one at a time over each element


\begin{aligned}
    V_{i + 1} &= MLP_{V_{i}}(V_{i}) \\
    E_{i + 1} &= MLP_{E_{i}}(E_{i}) \\
    U_{i + 1} &= MLP_{U_{i}}(U_{i}) \\
\end{aligned}

This also means that its output is a graph and we need further refining to get other kind of outputs. For example we can apply a classifier for each node embedding to get a class of that node.

Pooling

This is a step that allows us to take info from other graph element type, different from the ones we need. For example, we would like to have info coming from edges, but bring them over vertices.

# Pseudo code for pooling
def pool(items):

    # This is usually a tensor 1xD
    #   where D is the embedding dimension
    pooled_value = init_from(items.type)

    for item in items:

        pooled_value += item.value

    return pooled_value

However, to make it use of parallel computation, we can also think of doing this

# Parallel pseudo code
def parallel_pool(items):

    # NxD matrix
    # Each row of this matrix is an embedding
    item_embedding_matrix = items.concat()

    # Sum over rows
    return sum(item_embedding_matrix, dim=0)

This, at its core, is useful when we lack properties about a portion of data, for example edges or hypnernode, and we use data coming from other parts to enable us to do computations.

Message Passing

However, if we already have some info, we can use pooling to augment those info, taking into account connectivity of the graph by taking only adjacent info of the same type.

This means that at each step, a node receives info from its neighbours in the same fashion. After step_k our node will have received partial information of a node locates k steps away

Weaving

If we combine previous techniques together, we can merge info coming from many parts of the graph. However if embeddings are not over the same dimension, we need a linear layer to make them match dimensions.

As we are going to use a linear layer, this means that we are using a learned representation, which must be trained.

Caution

Because graphs may be very sparsely connected an the longest path may be over our layer number, to bypass this, we weave info over the hypernode as well.

The graph node will be used as if it were a node connected to all nodes, giving each node an overview of all node information

Special Layers

Polynomial Filters

Each polynomial filter is order invariant

Graph Laplacian

Let's set an order over nodes of a graph, where A is the adjacency matrix:



D_{v,v} = \sum_{u} A_{v,u}

In other words, D_{v, v} is the number of nodes connected to the node v. In fact, notice that we are summing over rows of the Adjacency Matrix A.

The Graph Laplacian L of the graph will be


L = D - A

As we can see, L will have all elements of D untouched, as each node has no connection to itself in the adjacency matrix, and all elements of the adjacency matrix with opposite sign (which means all negative)

Convolution throught Polynomials of Laplacian

Let's construct some polynomials by using the Laplacian Matrix:


p_{\vec{w}}(L) = w_{0}I_{n} + w_{1}L^{1} + \dots + w_{d}L^{d} = \sum_{i=0}^{d} w_{i}L^{i}

Here each w_i is a weight over a vector \vec{w} = [w_0, \dots, w_n] \in \R^n. We then can define the convolution as p_{\vec{w}}(L) \times \vec{x}, where x is the vector of all stacked vertices embeddings


\begin{aligned}
    \vec{x} &\in \R^{l\times d}\\
    \vec{x}' &= p_{\vec{w}}(L) \vec{x}
\end{aligned}

To explain the convolutional effect, let's see what happens for a polynomial of \vec{w}_0 = [w_0, 0, \dots, 0] \in \R^n:


\vec{x}' = p_{\vec{w}_0}(L) = w_0I_{l} \times \vec{x}

Here each item of \vec{x}' is just a scaled version of itself. Now consider a polynomial of \vec{w}_1 = [0, w_1, \dots, 0] \in \R^n:


\vec{x}' = p_{\vec{w}_1}(L) = w_1L^1\times \vec{x}

Here vertices merge info coming from their connected vertices plus a contribute of themselves (scaled by the number of connections).

Tip

As this may have not been very clear from these formulas, let's make a practical example, where vertex 0 is connected to both vertices 1 and 2:


\begin{aligned}
    \vec{x} &= \begin{bmatrix}
        \vec{x}_0 \in \R^{1 \times d} \\
        \vec{x}_1 \in \R^{1 \times d} \\
        \vec{x}_2 \in \R^{1 \times d} \\
    \end{bmatrix}
    \\
    L &= \begin{bmatrix}
        2 & -1 & -1 \\
        -1 & 1 & 0 \\
        -1 & 0 & 1
    \end{bmatrix}
    \\
    \vec{x}' &= w_1 L \vec{x} = \begin{bmatrix}
        2w_1 & -w_1 & -w_1 \\
        -w_1 & w_1 & 0 \\
        -w_1 & 0 & w_1
    \end{bmatrix}
    \times
    \begin{bmatrix}
        \vec{x}_0 \\
        \vec{x}_1 \\
        \vec{x}_2
    \end{bmatrix}
    \\
    &= \begin{bmatrix}
        2w_1\vec{x}_0 - w_1\vec{x}_1 - w_1 \vec{x}_2 \in \R^{1 \times d} \\
        -w_1\vec{x}_0 + w_1\vec{x}_1 + 0\vec{x}_2 \in \R^{1 \times d} \\
        -w_1\vec{x}_0 + 0\vec{x}_1 + w_1\vec{x}_2 \in \R^{1 \times d} \\
    \end{bmatrix}
\end{aligned}

For simplicity we wrote vertices as vectors rather than decomposing into their elements. As we can notice here, we combined adjacent vertices.

Now let's see what happens to the Laplacian Matrix for the power of 2:


\begin{aligned}
    L &= \begin{bmatrix}
        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
    \end{bmatrix}
    \\
    L \times L &=
    \begin{bmatrix}
        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
    \end{bmatrix}
    \times
    \begin{bmatrix}
        l_{0, 0} & l_{0, 1} & l_{0, 2} \\
        l_{1, 0} & l_{1, 1} & l_{1, 2} \\
        l_{2, 0} & l_{2, 1} & l_{2, 2} \\
    \end{bmatrix} = \\
    &= \begin{bmatrix}
        l_{0, 0}^2 + l_{0, 1}l_{1, 0} + l_{0, 2}l_{2, 0} &
        l_{0, 0}l_{0, 1} + l_{0, 1}l_{1, 1} + l_{0, 2}l_{2, 1} &
        l_{0, 0}l_{0, 2} + l_{0, 1}l_{1, 2} + l_{0, 2}l_{2, 2} \\
        l_{1, 0}l_{0, 0} + l_{1, 1}l_{1, 0} + l_{1, 2} l_{2, 0} &
        l_{1, 0}l_{0, 1} + l_{1, 1}^2 + l_{1, 2}l_{2, 1} &
        l_{1, 0}l_{0, 2} + l_{1, 1}l_{1, 2} + l_{1, 2}l_{2, 2} \\
        l_{2, 0}l_{0, 0} + l_{2, 1}l_{1, 0}+ l_{2, 2} l_{2, 0} &
        l_{2, 0}l_{0, 1} + l_{2, 1}l_{1, 1} + l_{2, 2}l_{2, 1} &
        l_{2, 0}l_{0, 2} + l_{2, 1}l_{1, 2} + l_{2, 2}^2
    \end{bmatrix}
\end{aligned}

As we can see, information from neighbour connections leak into the Laplacian graph. Over the extreme cases, it's possible to demonstrate that if the distance between 2 nodes is more than the power of the Laplacian Matrix, they are considered deatached up to their distance.

This means that by manipulating the degree of the polynomial, we can choose the diffusion distance for infomation. For example, if we want info from vertices at max 3 steps away, we will use a polynomial of 3rd degree.

Another interesting properties of these polynomials is that they do not depend on the order of connections, but only over their existence, so they are order equivariant.

Tip

Go here to see some experiments.

In particular, see the second one

ChebNet

The polynomial in ChebNet becomes:


\begin{aligned}
p_{\vec{w}}(L) &= \sum_{i = 1}^{d} w_{i} T_{i}(\tilde{L}) \\
T_{i} &= cos(i\theta) \\
\tilde{L} &= \frac{2L}{\lambda_{\max}(L)} - I_{n}
\end{aligned}
  • T_{i} is Chebischev first kind polynomial
  • \theta are the weights to be learnt1
  • \tilde{L} is a reduced version of L because we divide for its max eigenvalue, keeping it in range [-1, 1]. Moreover L ha no negative eigenvalues, so it's positive semi-definite

These polynomials are more stable as they do not explode with higher powers

Note

Even though we said that we could control the radius of neighbours leaking info, technically speaking, if we stack many GCN together, we get info from nodes at N \cdot hops, assuming each layer takes from max hops.

So, let's say we have 2 layers taking from distance 3, after the computation a node gets info from nodes at 6 hops distance.

Embedding Computation

Whenever we are computing over nodes, we firstly need to compute their embedding. From there on, all embeddings will be treated as inputs for our networks.


\begin{aligned}
    g(x) &\coloneqq \text{Any non linear function} \\
    \vec{x}' &= g\left(p_{\vec{w}}(L) \times \vec{x}\right)
\end{aligned}

For each layer the same weights are applied, like in CNNs

Real Graph Networks Configurations

To sum up all we have seen until now, to make computations over graphs we usually take info from neighbours (Node Aggregation) and transform old info of our node and combine them (Node Update).

  • (Potentially) Learnable parameters
  • Embeddings of node v
  • Embeddings of neighbours of v

Graph Neural Network


\textcolor{orange}{h_v^{(k)}} =
\textcolor{skyblue}{f_2^{(k)}} \left(
        \underbrace{
            \sum_{n\in\mathcal{N}} \textcolor{skyblue}{f_1^k}(
                \textcolor{orange}{h_n^{(k-1)}},
                \textcolor{orange}{h_i^{(k-1)}},
                \textcolor{orange}{e_{i, n}}
            ),
        }_{\text{message passing of edges to nodes}}\,\,
        \textcolor{orange}{h_v^{(k -1)}}
\right)

Graph Convolutional Network


\textcolor{orange}{h_{v}^{(k)}} =
\textcolor{skyblue}{f^{(k)}} \left(
    \underbrace{\textcolor{skyblue}{W^{(k)}} \cdot
    \frac{
        \sum_{u \in \mathcal{N}(v)} \textcolor{violet}{h_{u}^{(k-1)}}
    }{
        |\mathcal{N}(v)|
    }}_{\text{mean of previous neighbour embeddings}} + \underbrace{\textcolor{skyblue}{B^{(k)}} \cdot
\textcolor{orange}{h_{v}^{(k - 1)}}}_{\text{previous embeddings}}
\right) \forall v \in V

Tip

\textcolor{skyblue}{f^{(k)}} here is a Non-Linear Activation function

Graph Attention Networks


\textcolor{orange}{h_{v}^{(k)}} =
\textcolor{skyblue}{f^{(k)}} \left(
    \textcolor{skyblue}{W^{(k)}} \cdot \left[
    \underbrace{
        \sum_{u \in \mathcal{N}(v)} \alpha^{(k-1)}_{v,u}
        \textcolor{violet}{h_{u}^{(k-1)}}
     }_{\text{weighted mean of previous neighbour embeddings}}  +
        \underbrace{\alpha^{(k-1)}_{v,v}
        \textcolor{orange}{h_{v}^{(k-1)}}}_{\text{previous embeddings}}
\right] \right) \forall v \in V

where \alpha^{(k)}_{v,u} are weights generated by an attention mechanism A^{(k)} that is normalized to make all sums of \alpha^{(k)}_{v,u} be 1 for each node.


\alpha^{(k)}_{v,u} = \frac{
    \textcolor{skyblue}{A^{(k)}}(
        \textcolor{orange}{h_{v}^{(k)}},
        \textcolor{violet}{h_{u}^{(k)}}
    )
}{
    \sum_{w \in \mathcal{N}(v)} \textcolor{skyblue}{A^{(k)}}(
        \textcolor{orange}{h_{v}^{(k)}},
        \textcolor{violet}{h_{w}^{(k)}}
    )
} \forall (v, u) \in E

Tip

To make computations faster, we can just compute all $\textcolor{skyblue}{A^{(k)}}( \textcolor{orange}{h_{v}^{(k)}}, \textcolor{violet}{h_{u}^{(k)}} )$ and then sum them up at a later stage to compute \alpha^{(k)}_{v,u}

Also, this system (GAT) is flexible enough to make us change attention mechanisms and include more heads, as the one above is done with just an attention head

Graph Sample and Aggregate (GraphSAGE)

Here, instead of taking all neighbours, we take a fixed size of neighbour, called pool, taken at random, increasing variance but allowing this method to be applied to large graphs


\textcolor{orange}{h_v^{(k)}} =
\textcolor{skyblue}{f^{(k)}} \left(
    \textcolor{skyblue}{W^{(k)}}
    \cdot
    \underbrace{
        \left[
        \underbrace{
            \textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
                        \textcolor{violet}{h_u^{(k-1)}}
            )
        }_{\text{Aggregation of v's neighbours}}
        ,
        \textcolor{orange}{h_v^{(k-1)}}
        \right]
    }_{\text{Concatenation of aggr. and prev. embedding}}

\right) \forall v \in V

where $\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}( \textcolor{violet}{h_u^{(k-1)}} )$ may be one of the followings:

Mean

This is similar to what we had on GCN above


\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
    \textcolor{violet}{h_u^{(k-1)}}
) = \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
\frac{
    \textcolor{orange}{h_v^{(k-1)}}+
    \sum_{u\in \mathcal{N}(v)}\textcolor{violet}{h_u^{(k-1)}}

}{
    1 + |\mathcal{N(v)}|
}

Dimension Wise Maximum


\textcolor{skyblue}{AGGR}_{u\in \mathcal{N}(v)}(
    \textcolor{violet}{h_u^{(k-1)}}
) = \max_{u \in \mathcal{N}(v)} \left\{
    \sigma(
        \textcolor{skyblue}{W_{pool}^{(k)}} \cdot
        \textcolor{violet}{h_u^{(k-1)}} +
        \textcolor{skyblue}{b}
    )
\right\}

LSTM Aggregator

This is another network based on LSTM. This methos requires ordering the sequence of neighbours

Graph Isomorphism Network (GIN)


\textcolor{orange}{h_{v}^{(k)}} =
 \textcolor{skyblue}{f^{(k)}}
\left(
    \sum_{u \in \mathcal{N}(v)}
    \textcolor{violet}{h_{u}^{(k - 1)}} +
    (
        1 +
        \textcolor{skyblue}{\epsilon^{(k)}}
    ) \cdot \textcolor{orange}{h_{v}^{(k - 1)}}
\right)
\forall v \in V