Graph neural networks have gained much popularity in recent years. Relatively a newer deep learning method, GNNs can derive insights from graphs and predict outcomes with the information so gathered. A graph represents data with two components – nodes and edges; GNN can be applied to graphs to conduct node-level, graph-level, and edge-level analysis and predictions.
The advantage of GNNs is that while CNNs can operate only on regular Euclidean data like images (2D grids) and texts (1D sequences), the former operates on graphs that are non-Euclidean that can be used to study and analyse 3D data.
That said, the performance of GNNs does not improve as the number of layers increases. This effect is called oversmoothing.
What is oversmoothing
Oversmoothing is a common phenomenon in GNNs. The paper authored by Qimai Li and his colleagues, titled “Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning”, said that graph convolution of the GCN model (a GNN architecture type) is a type of Laplacian smoothing. While Laplacian smoothing is the reason why GCNs work, they bring potential concerns of oversmoothing.
Smoothing in GNN is a desirable feature. In methods like graph convolutional networks, which are presented by node-level tasks, they are often considered in the context of semi-supervised learning. In such settings, the entire dataset is considered as a single graph, and the network has learning node representations that derive information from node features and the graph structure.
Most state-of-art approaches for incorporating graph structure information in neural network operation enforce similarity between the representations of neighbouring nodes. This implements the local smoothing of neuron activations over the graph. Such smoothing operations are effective in a whole-graph setting but may cause degradation of results in node processing tasks due to oversmoothing with nodes becoming more indistinguishable with the increasing complexity of the network architectures.
While graph attention networks have proved to overcome such limitations by using message passing operations for adaptive weights for graph smoothing using attention mechanism from node features and masked by graph edges.
That said, the networks still rely on enforcing similarity between neighbouring nodes along with intricate training as their attention mechanism requires gradient computations driven by both graph nodes and edges.
In a 2019 paper titled “Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View”, the authors proposed Mean Average Distance (MAD) that calculates the mean average distance representations in the graph to measure the smoothness of the graph. They report their findings writing that the MAD values of the GNNs became smaller with the increasing number of layers, supporting the argument that smoothing is the essential nature of GNNs.
The authors further wrote that over-mixing of information and noise leads to the over-smoothing issue. To measure the quality of the message received by the nodes, the authors defined the information-to-noise ratio as the proportion of intra-class pairs. They then proposed MADGap to measure over the smoothness of the graph.
The experimental results so obtained showed that MADGap has a high correlation with the model performance. Furthermore, model performance and MADGap values increase with the information-to-noise ratio. This validates the assumption that the information-to-noise ratio affects the smoothness of graph representation to a large extent. It was found that the information-to-noise ratio is caused by the discrepancy between the graph topology and the objective of the downstream task. In the case of node classification tasks, if there are many inter-class edges, the nodes receive too many messages from the nodes of the other classes after many propagation steps, which would result in oversmoothing.
What can be done about it?
So, why should we care about the oversmoothing issue in GNNs? The goal of learning the embeddings is to finally feed them to a classifier to ultimately predict the labels. In case of oversmoothing, one may end up with similar embeddings for nodes that dont have the same label. This results in mislabelling.
While one may think that reducing the number of layers would reduce the effect of oversmoothing, it is not that simple. In case one chooses to do that, it would mean that they are not exploiting the multi-hop information when it comes to complex structured data that consequently will not boost end task performance.
One way to tackle this problem is to quantify it so as to track it easily while training the GNN model. Quantifying also offers a metric to be used as numeric penalisation.