A Comprehensive Survey of Graph Neural Networks PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

JOURNAL OF LATEX CLASS FILES, VOL. X, NO.

X, DECEMBER 2018 1

A Comprehensive Survey on Graph Neural


Networks
Zonghan Wu, Shirui Pan, Member, IEEE, Fengwen Chen, Guodong Long,
Chengqi Zhang, Senior Member, IEEE, Philip S. Yu, Fellow, IEEE

Abstract—Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video
processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the
Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and
are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has
arXiv:1901.00596v2 [cs.LG] 10 Mar 2019

imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning
approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in
data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different
categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these
learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal
networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes
and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this
fast-growing field.

Index Terms—Deep Learning, graph neural networks, graph convolutional networks, graph representation learning, graph
autoencoder, network embedding

1 I NTRODUCTION meaningful features that are shared with the entire datasets
for various image analysis tasks.
While deep learning has achieved great success on Eu-
T HE recent success of neural networks has boosted re-
search on pattern recognition and data mining. Many
machine learning tasks such as object detection [1], [2], ma-
clidean data, there is an increasing number of applications
where data are generated from the non-Euclidean domain
chine translation [3], [4], and speech recognition [5], which and need to be effectively analyzed. For instance, in e-
once heavily relied on handcrafted feature engineering to commerce, a graph-based learning system is able to exploit
extract informative feature sets, has recently been revolu- the interactions between users and products [9], [10], [11]
tionized by various end-to-end deep learning paradigms, to make highly accurate recommendations. In chemistry,
i.e., convolutional neural networks (CNNs) [6], long short- molecules are modeled as graphs and their bioactivity needs
term memory (LSTM) [7], and autoencoders. The success to be identified for drug discovery [12], [13]. In a citation
of deep learning in many domains is partially attributed to network, papers are linked to each other via citations and
the rapidly developing computational resources (e.g., GPU) they need to be categorized into different groups [14],
and the availability of large training data, and is partially [15]. The complexity of graph data has imposed significant
due to the effectiveness of deep learning to extract latent challenges on existing machine learning algorithms. This is
representation from Euclidean data (e.g., images, text, and because graph data are irregular. Each graph has a variable
video). Taking image analysis as an example, an image can size of unordered nodes and each node in a graph has
be represented as a regular grid in the Euclidean space. a different number of neighbors, causing some important
A convolutional neural network (CNN) is able to exploit operations (e.g., convolutions), which are easy to compute
the shift-invariance, local connectivity, and compositionality in the image domain but are not directly applicable to the
of image data [8], and as a result, CNN can extract local graph domain anymore. Furthermore, a core assumption of
existing machine learning algorithms is that instances are
independent of each other. However, this is not the case for
• Z. Wu, F. Chen, G. Long, C. Zhang are with Centre for graph data where each instance (node) is related to others
Artificial Intelligence, FEIT, University of Technology Sydney,
NSW 2007, Australia (E-mail: [email protected];
(neighbors) via some complex linkage information, which is
[email protected]; [email protected]; used to capture the interdependence among data, including
[email protected]). citations, friendship, and interactions.
• S. Pan is with Faculty of Information Technology, Monash University, Recently, there is increasing interest in extending deep
Clayton, VIC 3800, Australia (Email: [email protected]).
• P. S. Yu is with Department of Computer Science, University of Illinois learning approaches for graph data. Driven by the success
at Chicago, Chicago, IL 60607-7053, USA (Email: [email protected]) of deep learning, researchers have borrowed ideas from
• Corresponding author: Shirui Pan. convolution networks, recurrent networks, and deep auto-
Manuscript received Dec xx, 2018; revised Dec xx, 201x. encoders to design the architecture of graph neural net-
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 2

works. To handle the complexity of graph data, new gen-


eralizations and definitions for important operations have
been rapidly developed over the past few years. For in-
stance, Figure 1 illustrates how a kind of graph convolution
is inspired by a standard 2D convolution. This survey aims
to provide a comprehensive overview of these methods, for
both interested researchers who want to enter this rapidly
developing field and experts who would like to compare
graph neural network algorithms.
A Brief History of Graph Neural Networks The nota-
(a) 2D Convolution. Analo- (b) Graph Convolution. To get
tion of graph neural networks was firstly outlined in Gori gous to a graph, each pixel a hidden representation of the
et al. (2005) [16], and further elaborated in Micheli (2009) in an image is taken as a red node, one simple solution
[17] and Scarselli et al. (2009) [18]. These early studies learn node where neighbors are de- of graph convolution opera-
termined by the filter size. tion takes the average value
a target node’s representation by propagating neighbor in- The 2D convolution takes a of node features of the red
formation via recurrent neural architectures in an iterative weighted average of pixel val- node along with its neighbors.
manner until a stable fixed point is reached. This process ues of the red node along with Different from image data, the
is computationally expensive, and recently there have been its neighbors. The neighbors of neighbors of a node are un-
a node are ordered and have a ordered and variable in size.
increasing efforts to overcome these challenges [19], [20]. In fixed size.
our survey, we generalize the term graph neural networks to
represent all deep learning approaches for graph data. Fig. 1: 2D Convolution vs. Graph Convolution.
Inspired by the huge success of convolutional networks
in the computer vision domain, a large number of methods
that re-define the notation of convolution for graph data have al. [30] position graph networks as the building blocks for
emerged recently. These approaches are under the umbrella learning from relational data, reviewing part of graph neu-
of graph convolutional networks (GCNs). The first promi- ral networks under a unified framework. However, their
nent research on GCNs is presented in Bruna et al. (2013), generalized framework is highly abstract, losing insights on
which develops a variant of graph convolution based on each method from its original paper. Lee et al. [31] conduct
spectral graph theory [21]. Since that time, there have been a partial survey on the graph attention model, which is
increasing improvements, extensions, and approximations one type of graph neural network. Most recently, Zhang et
on spectral-based graph convolutional networks [12], [14], al. [32] present a most up-to-date survey on deep learning
[22], [23], [24]. As spectral methods usually handle the for graphs, missing those studies on graph generative and
whole graph simultaneously and are difficult to parallel spatial-temporal networks. In summary, none of the existing
or scale to large graphs, spatial-based graph convolutional surveys provide a comprehensive overview of graph neural
networks have rapidly developed recently [25], [26], [27], networks, only covering some of the graph convolution
[28]. These methods directly perform the convolution in the neural networks and examining a limited number of works,
graph domain by aggregating the neighbor nodes’ informa- thereby missing the most recent development of alternative
tion. Together with sampling strategies, the computation can graph neural networks, such as graph generative networks
be performed in a batch of nodes instead of the whole graph and graph spatial-temporal networks.
[25], [28], which has the potential to improve efficiency.
Graph neural networks vs. network embedding The
In addition to graph convolutional networks, many alter-
research on graph neural networks is closely related to
native graph neural networks have been developed in the
graph embedding or network embedding, another topic
past few years. These approaches include graph attention
which attracts increasing attention from both the data min-
networks, graph autoencoders, graph generative networks,
ing and machine learning communities [33] [34] [35] [36],
and graph spatial-temporal networks. Details on the catego-
[37], [38]. Network embedding aims to represent network
rization of these methods are given in Section 3.
vertices into a low-dimensional vector space, by preserving
Related surveys on graph neural networks. There are both network topology structure and node content informa-
a limited number of existing reviews on the topic of graph tion, so that any subsequent graph analytics tasks such as
neural networks. Using the notation geometric deep learning, classification, clustering, and recommendation can be easily
Bronstein et al. [8] give an overview of deep learning performed by using simple off-the-shelf machine learning
methods in the non-Euclidean domain, including graphs algorithm (e.g., support vector machines for classification).
and manifolds. While being the first review on graph con- Many network embedding algorithms are typically unsu-
volution networks, this survey misses several important pervised algorithms and they can be broadly classified into
spatial-based approaches, including [15], [20], [25], [27], three groups [33], i.e., matrix factorization [39], [40], ran-
[28], [29], which update state-of-the-art benchmarks. Fur- dom walks [41], and deep learning approaches. The deep
thermore, this survey does not cover many newly devel- learning approaches for network embedding at the same
oped architectures which are equally important to graph time belong to graph neural networks, which include graph
convolutional networks. These learning paradigms, includ- autoencoder-based algorithms (e.g., DNGR [42] and SDNE
ing graph attention networks, graph autoencoders, graph [43]) and graph convolution neural networks with unsuper-
generative networks, and graph spatial-temporal networks, vised training(e.g., GraphSage [25]). Figure 2 describes the
are comprehensively reviewed in this article. Battaglia et differences between network embedding and graph neural
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 3

TABLE 1: Commonly used notations.


Notations Descriptions
|·| The length of a set
Element-wise product.
AT Transpose of vector/matrix A.
[A, B] Concatenation of A and B.
G A graph
V The set of nodes in a graph
vi A node vi ∈ V
N (v) The neighbors of node v
E The set of edges in a graph
eij An edge eij ∈ E
X ∈ RN ×D The feature matrix of a graph.
x ∈ RN The feature vector of a graph in the case of D = 1.
Xi ∈ R D The feature vector of the node vi .
N The number of nodes, N = |V|.
M The number of edges, M = |E|.
Fig. 2: Network Embedding v.s. Graph Neural Networks. D The dimension of a node vector.
T The total number of time steps in time series.

networks in this paper.

Our Contributions Our paper makes notable contribu-


tions summarized as follows:

• New taxonomy In light of the increasing number of


studies on deep learning for graph data, we propose
a new taxonomy of graph neural networks (GNNs).
In this taxonomy, GNNs are categorized into five
groups: graph convolution networks, graph atten-
tion networks, graph auto-encoders, graph genera-
tive networks, and graph spatial-temporal networks.
We pinpoint the differences between graph neural
networks and network embedding and draw the
connections between different graph neural network
architectures.
• Comprehensive review This survey provides the
most comprehensive overview of modern deep
learning techniques for graph data. For each type Fig. 3: Categorization of Graph Neural Networks.
of graph neural network, we provide detailed de-
scriptions on representative algorithms, and make
a necessary comparison and summarise the corre- 2 D EFINITION
sponding algorithms. In this section, we provide definitions of basic graph con-
• Abundant resources This survey provides abundant cepts. For easy retrieval, we summarize the commonly used
resources on graph neural networks, which include notations in Table 1.
state-of-the-art algorithms, benchmark datasets,
open-source codes, and practical applications. This Definition 1 (Graph). A Graph is G = (V, E, A) where V
survey can be used as a hands-on guide for under- is the set of nodes, E is the set of edges, and A is the
standing, using, and developing different deep learn- adjacency matrix. In a graph, let vi ∈ V to denote a node
ing approaches for various real-life applications. and eij = (vi , vj ) ∈ E to denote an edge. The adjacency
• Future directions This survey also highlights the cur- matrix A is a N × N matrix with Aij = wij > 0 if
rent limitations of the existing algorithms, and points eij ∈ E and Aij = 0 if eij ∈ / E. The degree of a node is
out possible directions in this rapidly developing the number of edges connected to it.
field. A graph can be associated with node attributes X 1 ,
where X ∈ RN ×D is a feature matrix with Xi ∈ RD
Organization of Our Survey The rest of this survey representing the feature vector of node vi . In the case of
is organized as follows. Section 2 defines a list of graph- D = 1, we replace x ∈ RN with X to denote the feature
related concepts. Section 3 clarifies the categorization of vector of the graph.
graph neural networks. Section 4 and Section 5 provides Definition 2 (Directed Graph). A directed graph is a graph
an overview of graph neural network models. Section 6 with all edges pointing from one node to another. For
presents a gallery of applications across various domains. a directed graph, Aij 6= Aji . An undirected graph is
Section 7 discusses the current challenges and suggests
future directions. Section 8 summarizes the paper. 1. Such graph is referred to an attributed graph in literature.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 4

(a) Graph Convolution Networks with Pooling Modules for Graph


Classification [12]. A GCN layer [14] is followed by a pooling layer to
Fig. 4: A Variant of Graph Convolution Networks with Mul- coarsen a graph into sub-graphs so that node representations on coars-
tiple GCN layers [14]. A GCN layer encapsulates each node’s ened graphs represent higher graph-level representations. To calculate
hidden representation by aggregating feature information from the probability for each graph label, the output layer is a linear layer
its neighbors. After feature aggregation, a non-linear transfor- with the SoftMax function.
mation is applied to the resultant outputs. By stacking multiple
layers, the final hidden representation of each node receives
messages from a further neighborhood.

a graph with all edges undirected. For an undirected


graph, Aij = Aji .
Definition 3 (Spatial-Temporal Graph). A spatial-temporal (b) Graph Auto-encoder with GCN [62]. The encoder uses GCN layers
to get latent rerpesentations for each node. The decoder computes the
graph is an attributed graph where the feature matrix X pair-wise distance between node latent representations produced by the
evolves over time. It is defined as G = (V, E, A, X) with encoder. After applying a non-linear activation function, the decoder
X ∈ RT ×N ×D where T is the length of time steps. reconstructs the graph adjacency matrix.

3 C ATEGORIZATION AND F RAMEWORKS


In this section, we present our taxonomy of graph neural
networks. We consider any differentiable graph models
which incorporate neural architectures as graph neural net-
works. We categorize graph neural networks into graph con-
volution networks, graph attention networks, graph auto-
encoders, graph generative networks and graph spatial-
(c) Graph Spatial-Temporal Networks with GCN [74]. A GCN layer is
temporal networks. Of these, graph convolution networks followed by a 1D-CNN layer. The GCN layer operates on At and Xt
play a central role in capturing structural dependencies. to capture spatial dependency, while the 1D-CNN layer slides over X
As illustrated in Figure 3, methods under other categories along the time axis to capture the temporal dependency. The output
layer is a linear transformation, generating a prediction for each node.
partially utilize graph convolution networks as building
blocks. We summarize the representative methods in each Fig. 5: Different Graph Neural Network Models built with
category in Table 2, and we give a brief introduction of each GCNs.
category in the following.

3.1 Taxonomy of GNNs


Graph Convolution Networks (GCNs) generalize the oper- parameters within an end-to-end framework. Figure 6 illus-
ation of convolution from traditional data (images or grids) trates the difference between graph convolutional networks
to graph data. The key is to learn a function f to generate and graph attention networks in aggregating the neighbor
a node vi ’s representation by aggregating its own features node information.
Xi and neighbors’ features Xj , where j ∈ N (vi ). Figure 4
Graph Auto-encoders are unsupervised learning frame-
shows the process of GCNs for node representation learn-
works which aim to learn low dimensional node vectors
ing. Graph convolutional networks play a central role in
via an encoder, and then reconstruct the graph data via
building up many other complex graph neural network
a decoder. Graph autoencoders are a popular approach to
models, including auto-encoder-based models, generative
learn the graph embedding, for both plain graphs with-
models, and spatial-temporal networks, etc. Figure 5 illus-
out attributed information [42], [43] as well as attributed
trates several graph neural network models building on
graphs [64], [65]. For plain graphs, many algorithms directly
GCNs.
prepossess the adjacency matrix, by either constructing a
Graph Attention Networks are similar to GCNs and seek an new matrix (i.e., pointwise mutual information matrix) with
aggregation function to fuse the neighboring nodes, random rich information [42] or feeding the adjacency matrix to
walks, and candidate models in graphs to learn a new an autoencoder model and capturing both first order and
representation. The key difference is that graph attention second order information [43]. For attributed graphs, graph
networks employ attention mechanisms which assign larger autoencoder models tend to employ GCN [14] as a building
weights to the more important nodes, walks, or models. The block for the encoder and reconstruct the structure informa-
attention weight is learned together with neural network tion via a link prediction decoder [62], [64].
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 5

TABLE 2: Representative Publications of Graph Neural Networks


Category Publications
Spectral-based [12], [14], [21], [22], [23], [24], [44], [45], [46]
Graph Convolution Networks [13], [16], [17], [18], [19], [20], [25], [26], [27], [28], [47]
Spatial-based
[48], [49], [50], [51], [52], [53], [54], [55], [56], [57]
Pooling modules [12], [22], [58], [59]
Graph Attention Networks [15], [29], [60], [61]
Graph Auto-encoder [42], [43], [62], [63], [64], [65], [66]
Graph Generative Networks [67], [68], [69], [70], [71]
Graph Spatial-Temporal Networks [72], [73], [74], [75], [76]

3.2 Frameworks
Graph neural networks, graph convolution networks
(GCNs) in particular, try to replicate the success of CNN
in graph data by defining graph convolutions via graph
spectral theory or spatial locality. With the graph structure
and node content information as inputs, the outputs of GCN
can focus on different graph analytics task with one of the
following mechanisms:
• Node-level outputs relate to node regression and
classification tasks. As a graph convolution module
directly gives nodes’ latent representations, a multi-
(a) Graph Convolution Net- (b) Graph Attention Networks
works [14] explicitly assign a [15] implicitly capture the perceptron layer or softmax layer is used as the final
non-parametric weight aij = weight aij via an end-to-end layer of GCN. We review graph convolution modules
√ 1 neural network architecture,
to the neigh- in Section 4.1 and Section 4.2.
deg(vi )deg(vj )
so that more important nodes
bor vj of vi during the aggre-
receive larger weights.
• Edge-level outputs relate to the edge classifica-
gation process. tion and link prediction tasks. To predict the la-
bel/connection strength of an edge, an additional
Fig. 6: Differences between graph convolutional networks
function will take two nodes’ latent representations
and graph attention networks.
from the graph convolution module as inputs.
• Graph-level outputs relate to the graph classification
task. To obtain a compact representation on graph
level, a pooling module is used to coarse a graph into
Graph Generative Networks aim to generate plausible sub-graphs or to sum/average over the node repre-
structures from data. Generating graphs given a graph sentations. We review the graph pooling module in
empirical distribution is fundamentally challenging, mainly Section 4.3.
because graphs are complex data structures. To address this In Table 3, we list the details of the inputs and outputs
problem, researchers have explored to factor the generation of the main GCNs methods. In particular, we summarize
process as forming nodes and edges alternatively [67], [68], output mechanisms in between each GCN layer and in the
to employ generative adversarial training [69], [70]. One final layer of each method. The output mechanisms may
promising application domain of graph generative networks involve several pooling operations, which are discussed in
is chemical compound synthesis. In a chemical graph, atoms Section 4.3.
are treated as nodes and chemical bonds are treated as
edges. The task is to discover new synthesizable molecules End-to-end Training Frameworks. Graph convolutional net-
which possess certain chemical and physical properties. works can be trained in a (semi-) supervised or purely un-
supervised way within an end-to-end learning framework,
depending on the learning tasks and label information avail-
Graph Spatial-temporal Networks aim to learn unseen pat-
able at hand.
terns from spatial-temporal graphs, which are increasingly
important in many applications such as traffic forecasting • Semi-supervised learning for node-level classifi-
and human activity prediction. For instance, the underlying cation. Given a single network with partial nodes
road traffic network is a natural graph where each key loca- being labeled and others remaining unlabeled, graph
tion is a node whose traffic data is continuously monitored. convolutional networks can learn a robust model that
By developing effective graph spatial-temporal network effectively identify the class labels for the unlabeled
models, we can accurately predict the traffic status over nodes [14]. To this end, an end-to-end framework can
the whole traffic system [73], [74]. The key idea of graph be built by stacking a couple of graph convolutional
spatial-temporal networks is to consider spatial dependency layers followed by a softmax layer for multi-class
and temporal dependency at the same time. Many current classification.
approaches apply GCNs to capture the dependency together • Supervised learning for graph-level classification.
with some RNN [73] or CNN [74] to model the temporal Given a graph dataset, graph-level classification aims
dependency. to predict the class label(s) for an entire graph [58],
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 6

− 21 − 21
[59], [77], [78]. The end-to-end learning for this task defined as L = In − D AD P, where D is a diagonal
can be done with a framework which combines matrix of node degrees, Dii = j (Ai,j ). The normalized
both graph convolutional layers and the pooling graph Laplacian matrix possesses the property of being real
procedure [58], [59]. Specifically, by applying graph symmetric positive semidefinite. With this property, the nor-
convolutional layers, we obtain representation with malized Laplacian matrix can be factored as L = UΛUT ,
a fixed number of dimensions for each node in every where U = [u0 , u1 , · · · , un−1 ] ∈ RN ×N is the matrix of
single graph. Then, we can get the representation of eigenvectors ordered by eigenvalues and Λ is the diagonal
an entire graph through pooling which summarizes matrix of eigenvalues, Λii = λi . The eigenvectors of the
the representation vectors of all nodes in a graph. normalized Laplacian matrix form an orthonormal space, in
Finally, by applying linear layers and a softmax layer, mathematical words, UT U = I. In graph signal processing,
we can build an end-to-end framework for graph a graph signal x ∈ RN is a feature vector of the nodes of a
classification. An example is given in Fig 5a. graph where xi is the value of the ith node. The graph Fourier
• Unsupervised learning for graph embedding. When transform to a signal x is defined as F (x) = UT x and the in-
no class labels are available in graphs, we can learn verse graph Fourier transform is defined as F −1 (x̂) = Ux̂,
the graph embedding in a purely unsupervised way where x̂ represents the resulting signal from graph Fourier
in an end-to-end framework. These algorithms ex- transform. To understand graph Fourier transform, from its
ploit edge-level information in two ways. One simple definition we see that it indeed projects the input graph
way is to adopt an autoencoder framework where signal to the orthonormal space where the basis is formed by
the encoder employs graph convolutional layers to eigenvectors of the normalized graph Laplacian. Elements
embed the graph into the latent representation upon of the transformed signal x̂ are the coordinates of the graph
which a decoder is used to reconstruct the graph signal in the new spaceP so that the input signal can be
structure [62], [64]. Another way is to utilize the neg- represented as x = i x̂i ui , which is exactly the inverse
ative sampling approach which samples a portion graph Fourier transform. Now the graph convolution of the
of node pairs as negative pairs while existing node input signal x with a filter g ∈ RN is defined as
pairs with links in the graphs being positive pairs.
Then a logistic regression layer is applied after the x ∗G g = F −1 (F (x) F (g))
convolutional layers for end-to-end learning [25]. (1)
= U(UT x UT g)

where denotes the Hadamard product. If we denote a


4 G RAPH C ONVOLUTION N ETWORKS filter as gθ = diag(UT g), then the graph convolution is
In this section, we review graph convolution networks simplified as
(GCNs), the fundamental of many complex graph neural x ∗G gθ = Ugθ UT x (2)
network models. GCNs approaches fall into two categories,
spectral-based and spatial-based. Spectral-based approaches Spectral-based graph convolution networks all follow this
define graph convolutions by introducing filters from the definition. The key difference lies in the choice of the filter
perspective of graph signal processing [79] where the graph gθ .
convolution operation is interpreted as removing noise
from graph signals. Spatial-based approaches formulate
graph convolutions as aggregating feature information from 4.1.2 Methods of Spectral-based GCNs
neighbors. While GCNs operate on the node level, graph Spectral CNN. Bruna et al. [21] propose the first spectral
pooling modules can be interleaved with the GCN layer, to convolution neural network (Spectral CNN). Assuming the
coarsen graphs into high-level sub-structures. As shown in filter gθ = Θki,j is a set of learnable parameters and consid-
Fig 5a, such an architecture design can be used to extract ering graph signals of multi-dimension, they define a graph
graph-level representations and to perform graph classifi- convolution layer as
cation tasks. In the following, we introduce spectral-based
GCNs, spatial-based GCNs, and graph pooling modules fk−1
X
separately. Xk+1
:,j = σ( UΘki,j UT Xk:,i ) (j = 1, 2, · · · , fk ) (3)
i=1

4.1 Spectral-based Graph Convolutional Networks where Xk ∈ RN ×fk−1 is the input graph signal, N is the
Spectral-based methods have a solid foundation in graph number of nodes, fk−1 is the number of input channels and
signal processing [79]. We first give some basic knowledge fk is the number of output channels, Θki,j is a diagonal
background of graph signal processing, after which we matrix filled with learnable parameters, and σ is a non-
review the representative research on the spectral-based linear transformation.
GCNs.
Chebyshev Spectral CNN (ChebNet). Defferrard et al.
[12] propose ChebNet which defines a filter as Chebyshev
4.1.1 Backgrounds polynomials of the diagonal matrix of eigenvalues, i.e,
PK−1
In spectral-based models, graphs are assumed to be undi- gθ = i=0 θi Tk (Λ̃), where Λ̃ = 2Λ/λmax − IN . The
rected. A robust mathematical representation of an undi- Chebyshev polynomials are defined recursively by Tk (x) =
rected graph is the normalized graph Laplacian matrix, 2xTk−1 (x) − Tk−2 (x) with T0 (x) = 1 and T1 (x) = x. As a
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 7

TABLE 3: Summary of Graph Convolution Networks


Inputs Output Mechanisms
Category Approach (allow edge Outputs
features?) Intermediate Final
Spectral CNN (2014) [21] 7 Graph-level cluster+max pooling softmax function
Spectral ChebNet (2016) [12] 7 Graph-level efficient pooling mlp layer+softmax function
Based 1stChebNet (2017) [14] 7 Node-level activation function softmax function
AGCN (2018) [23] 7 Graph-level max pooling sum pooling
Node-level - mlp layer+softmax function
GNN (2009) [18] 3
Graph-level - add a dummy super node
Node-level - mlp layer/softmax function
GGNNs (2015) [19] 7
Graph-level - sum pooling
SSE (2018) [20] 7 Node-level - softmax function
Spatial Node-level softmax function
MPNN (2017) [13] 3
Based Graph-level - sum pooling
GraphSage (2017) [25] 7 Node-level activation function softmax function
Node-level activation function softmax function
DCNN (2016) [47] 3
Graph-level - mean pooling
PATCHY-SAN (2016) [27] 3 Graph-level - mlp layer+softmax function
LGCN (2018) [28] 7 Node-level skip connections mlp layer+softmax function

result, the convolution of a graph signal x with the defined need to be symmetric. For example, it can be the form
filter gθ is of D−1 A. The main drawback of 1stChebNet is that the
K−1 computation cost increases exponentially with the increase
X
x ∗G gθ = U( θi Tk (Λ̃))UT x of the number of 1stChebNet layers during batch training.
i=0 Each node in the last layer has to expand its neighborhood
K−1
(4) recursively across previous layers. Chen et al. [48] assume
X
= θi Ti (L̃)x the rescaled adjacency matrix à in Equation 7 comes from a
i=0 sampling distribution. Under this assumption, the technique
of Monte Carlo and variance reduction techniques are used
where L̃ = 2L/λmax − IN .
to facilitate the training process. Chen et al. [49] reduce
From Equation 4, ChebNet implicitly avoids the compu-
the receptive field size of the graph convolution to an
tation of the graph Fourier basis, reducing the computation
arbitrary small scale by sampling neighborhoods and using
complexity from O(N 3 ) to O(KM ). Since Ti (L̃) is a polyno-
historical hidden representations. Huang et al. [57] propose
mial of L̃ of ith order, Ti (L̃)x operates locally on each node.
an adaptive layer-wise sampling approach to accelerate the
Therefore, the filters of ChebNet are localized in space.
training of 1stChebNet, where sampling for the lower layer
First order of ChebNet (1stChebNet 2 ) Kipf et al. [14] in- is conditioned on the top one. This method is also applicable
troduce a first-order approximation of ChebNet. Assuming for explicit variance reduction.
K = 1 and λmax = 2 , Equation 4 is simplified as
1 1
Adaptive Graph Convolution Network (AGCN). To ex-
x ∗G gθ = θ0 x − θ1 D− 2 AD− 2 x (5) plore hidden structural relations unspecified by the graph
To restrain the number of parameters and avoid over- Laplacian matrix, Li et al. [23] propose the adaptive graph
fitting, 1stChebNet further assumes θ = θ0 = −θ1 , leading convolution network (AGCN). AGCN augments a graph
to the following definition of graph convolution, with a so-called residual graph, which is constructed by
1 1
computing a pairwise distance of nodes. Despite being able
x ∗G gθ = θ(In + D− 2 AD− 2 )x (6) to capture complement relational information, AGCN incurs
expensive O(N 2 ) computation.
In order to incorporate multi-dimensional graph input
signals, 1stChebNet proposes a graph convolution layer
which modifies Equation 6,
Xk+1 = ÃXk Θ (7) 4.1.3 Summary
1 1
where à = IN + D− 2 AD− 2 . Spectral CNN [21] relies on the eigen-decomposition of the
The graph convolution defined by 1stChebNet is local- Laplacian matrix. It has three effects. First, any perturbation
ized in space. It bridges the gap between spectral-based to a graph results in a change of eigenbasis. Second, the
methods and spatial-based methods. Each row of the output learned filters are domain dependent, meaning they cannot
represents the latent representation of each node obtained be applied to a graph with a different structure. Third, eigen-
by a linear transformation of aggregated information from decomposition requires O(N 3 ) computation and O(N 2 )
the node itself and its neighboring nodes with weights memory. Filters defined by ChebNet [12] and 1stChebNet
specified by the row of Ã. From the perspective of spatial- [14] are localized in space. The learned weights can be
based methods, the adjacency matrix à not necessarily shared across different locations in a graph. However, a
common drawback of spectral methods is they need to
2. Due to its impressive performance in many node classification
tasks, 1stChebNet is simply termed as GCN and is considered as a load the whole graph into the memory to perform graph
strong baseline in the research community. convolution, which is not efficient in handling big graphs.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 8

ℎ() ℎ$( ℎ(% ℎ(&*$ ℎ(&


rium is reached. To handle heterogeneous graphs, the spatial

!"#$ !"#$ !"#$
graph convolution of GNNs is defined as
htv = f (lv , lco [v], ht−1
ne [v], lne [v]) (8)
(a) Recurrent-based
where lv denotes the label attributes of node v , lco [v] denotes
the label attributes of corresponding edges of node v , htne [v]
ℎ() ℎ$( ℎ(% ℎ(&*$ ℎ(&
denotes the hidden representations of node v ’s neighbors at

!"#$ !"#% !"#&
time step t, and lne [v] denotes the label attributes of node
v ’s neighbors.
To ensure convergence, the recurrent function f (·) must
(b) Composition-based
be a contraction mapping, which shrinks the distance be-
tween two points after mapping. In the case of f (·) is a
Fig. 7: Recurrent-based v.s. Composition-based Spatial neural network, a penalty term has to be imposed on the
GCNs. Jacobian matrix of parameters. GNNs uses the Almeida-
Pineda algorithm [80], [81] to train its model. The core idea
is to run the propagation process to reach fixed points and
4.2 Spatial-based Graph Convolutional Networks then perform the backward procedure given the converged
Imitating the convolution operation of a conventional con- solution.
volution neural network on an image, spatial-based meth-
Gated Graph Neural Networks (GGNNs) employs gated
ods define graph convolution based on a node’s spatial
recurrent units(GRU) [82] as the recurrent function, reduc-
relations. To relate images with graphs, images can be
ing the recurrence to a fixed number of steps. The spatial
considered as a special form of a graph with each pixel
graph convolution of GGNNs is defined as
representing a node. As illustrated in Figure 1a, each pixel
is directly connected to its nearby pixels. With a 3 × 3
X
htv = GRU (hvt−1 , Whtu ) (9)
window, the neighborhood of each node is its surrounding u∈N (v)
eight pixels. The positions of these eight pixels indicate an
ordering of a node’s neighbors. A filter is then applied Different from GNNs, GGNNs use back-propagation
to this 3 × 3 patch by taking the weighted average of through time (BPTT) to learn the parameters. The advantage
pixel values of the central node and its neighbors across is that it no longer needs to constrain parameters to ensure
each channel. Due to the specific ordering of neighboring convergence. However, the downside of training by BPTT
nodes, the trainable weights are able to be shared across is that it sacrifices efficiency both in time and memory. This
different locations. Similarly, for a general graph, the spatial- is especially problematic for large graphs, as GGNNs need
based graph convolution takes the aggregation of the cen- to run the recurrent function multiple times over all nodes,
tral node representation and its neighbors representation requiring intermediate states of all nodes to be stored in
to get a new representation for this node, as depicted by memory.
Figure 1b. To explore the depth and breadth of a node’s Stochastic Steady-state Embedding (SSE). To improve the
receptive field, a common practice is to stack multiple learning efficiency, the SSE algorithm [20] updates the node
graph convolution layer together. According to the different latent representations stochastically in an asynchronous
approaches of stacking convolution layers, spatial-based fashion. As shown in Algorithm 1, SSE recursively estimates
GCNs can be further divided into two categories, recurrent- node latent representations and updates the parameters
based and composition-based spatial GCNs. Recurrent-based with sampled batch data. To ensure convergence to steady
methods apply the same graph convolution layer to update states, the recurrent function of SSE is defined as a weighted
hidden representations, while composition-based methods average of the historical states and new states,
apply a different graph convolution layer to update hidden X
representations. Figure 7 illustrates this difference. In the hv t = (1 − α)hv t−1 + αW1 σ(W2 [xv , [hu t−1 , xu ]])
following, we give an overview of these two branches. u∈N (v)
(10)
4.2.1 Recurrent-based Spatial GCNs Though summing neighborhood information implicitly con-
The main idea of recurrent-based methods is to update a siders node degree, it remains questionable whether the
node’s latent representation recursively until a stable fixed scale of this summation affects the stability of this algorithm.
point is reached. This is done by imposing constraints on
recurrent functions [18], employing gate recurrent unit ar- 4.2.2 Composition Based Spatial GCNs
chitectures [19], updating node latent representations asyn- Composition-based methods update the nodes’ representa-
chronously and stochastically [20]. In the following, we will tions by stacking multiple graph convolution layers.
introduce these three methods.
Message Passing Neural Networks (MPNNs). Gilmer et al.
Graph Neural Networks(GNNs) Being one of the earliest [13] generalize several existing graph convolution networks
works on graph neural networks, GNNs recursively update including [12], [14], [19], [21], [56], [83], [84] into a uni-
node latent representations until convergence [18]. In other fied framework named Message Passing Neural Networks
words, from the perspective of the diffusion process, each (MPNNs). The MPNNs consists of two phases, the message
node exchanges information with its neighbors until equilib- passing phase and the readout phase. The message passing
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 9

ALGORITHM 1: Learning with Stochastic Fixed Point


Iteration [20]

Initialize parameters,{h0v }v∈V


for k = 1 to K do
for t = 1 to T do
Sample n nodes from the whole node set V
Use Equation 10 to update hidden
representations of sampled n nodes
end
for p = 1 to P do Fig. 8: Learning Process of GraphSage [25]
Sample m nodes from the labeled node set V
Forward model according to Equation 10
Back-propagate gradients 4.2.3 Miscellaneous Variants of Spatial GCNs
end Diffusion Convolution Neural Networks (DCNN) [47]
end proposed a graph convolution network which encapsulates
the graph diffusion process. A hidden node representation
is obtained by independently convolving inputs with power
phase actually runs T -step spatial-based graph convolu- series of transition probability matrix. The diffusion convo-
tions. The graph convolution operation is defined through lution operation of DCNN is formulated as
a message function Mt (·) and an updating function Ut (·) Zm m m
i,j,: = f (Wj,: Pi,j,: Xi,: ) (14)
according to
X In Equation 14, zm
i,j,: denotes the hidden representation
htv = Ut (ht−1
v , Mt (ht−1 t−1
v , hw , evw )) (11) of node i for hop j in graph m, Pm :,j,: denotes the prob-
w∈N (v) ability transition matrix of hop j in graph m, and Xm i,:
The readout phase is actually a pooling operation which denote the input features of node i in graph m, where
produces a representation of the entire graph based on zm ∈ RNm ×H×F , W ∈ RH×F , Pm ∈ RNm ×H×Nm and
hidden representations of each individual node. It is defined Xm ∈ RNm ×F .
as Though covering a larger receptive field through higher
ŷ = R(hTv |v ∈ G) (12) orders of the transition matrix, the DCNN model needs
2
O(Nm H) memory, causing severe problems when applying
Through the output function R(·), the final representation ŷ it to large graphs.
is used to perform graph-level prediction tasks. The authors
PATCHY-SAN [27] uses a standard convolution neural net-
present that several other graph convolution networks fall
work (CNN) to solve graph classification tasks. To do this,
into their framework by assuming different forms of Ut (·)
it converts graph-structured data into grid-structured data.
and Mt (·).
First, it selects a fixed number of nodes for each graph using
GraphSage [25] introduces the notion of the aggregation a graph labeling procedure. A graph labeling procedure
function to define graph convolution. The aggregation func- essentially assigns a ranking to each node in the graph,
tion essentially assembles a node’s neighborhood informa- which can be based on node-degree, centrality, Weisfeiler-
tion. It must be invariant to permutations of node orderings Lehman color [85] [86] and etc. Second, as each node in a
such as mean, sum and max function. The graph convolu- graph can have a different number of neighbors, PATCHY-
tion operation is defined as, SAN selects and orders a fixed number of neighbors for
each node according to their graph labelings. Finally, after
htv = σ(Wt · aggregatet (ht−1 t−1
v , {hu , ∀u ∈ N (v)}) (13) the grid-structured data with fixed-size is formed, PATCHY-
SAN employed standard CNN to learn the graph hidden
Instead of updating states over all nodes, GraphSage
representations. Utilizing standard CNN in GCNs has the
proposes a batch-training algorithm, which improves scal-
advantage of keeping shift-invariance, which relies on the
ability for large graphs. The learning process of GraphSage
sorting function. As a result, the ranking criteria in the node
consists of three steps. First, it samples a node’s local k-hop
selection and ordering process is of paramount importance.
neighborhood with fixed-size. Second, it derives the central
In PATCHY-SAN, the ranking is based on graph labelings.
node’s final state by aggregating its neighbors feature in-
However, graph labelings only take graph structures into
formation. Finally, it uses the central node’s final state to
consideration, ignoring node feature information.
make predictions and backpropagate errors. This process is
illustrated in Figure 8. Large-scale Graph Convolution Networks (LGCN). In
Assuming the number of neighbors to be sampled at tth follow-up work, large-scale graph convolution networks
hop is st , the time complexity of GraphSage in one batch (LGCN) [28] proposes a ranking method based on node fea-
QT
is O( t=1 st ). Therefore the computation cost increases ex- ture information. Unlike PATCHY-SAN, LGCN uses stan-
ponentially with the increase of t. This prevents GraphSage dard CNN to generate node-level outputs. For each node,
from having a deep architecture. However, in practice, the LGCN assembles a feature matrix of its neighborhood and
authors find that with t = 2 GraphSage already achieves sorts this feature matrix along each column. The first k
high performance. rows of the sorted feature matrix are taken as the input
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 10

grid-data for the target node. In the end, LGCN applies 1D from original feature data. Mean/max/sum pooling is the
CNN on the resultant inputs to get the target node’s hidden most primitive and most effective way of implementing this
representation. While deriving graph labelings in PATCHY- since calculating the mean/max/sum value in the pooling
SAN requires complex pre-processing, sorting feature val- window is rapid.
ues in LGCN does not need a pre-processing step, making
it more efficient. To suit the scenario of large-scale graphs, hG = mean/max/sum(hT1 , hT2 , ..., hTn ) (15)
LGCN proposes a subgraph training strategy, which puts
the sampled subgraphs into a mini-batch. Henaff et al. [22] prove that performing a simple
max/mean pooling at the beginning of the network is espe-
Mixture Model Network (MoNet) [26] unifies standard cially important to reduce the dimensionality in the graph
CNN with convolutional architectures on non-Euclidean domain and mitigate the cost of the expensive graph Fourier
domains. While several spatial-based approaches ignore the transform operation.
relative positions between a node and its neighbors when Defferrard et al. optimize max/min pooling and devise
aggregating neighborhood feature information, MoNet in- an efficient pooling strategy in their approach ChebNet [12].
troduce pseudo-coordinates and weight functions to let the Input graphs are first processed by the coarsening process
weight of a node’s neighbor be determined by the relative described in Fig 5a . After coarsening, the vertices of the
position (pseudo-coordinates) between the node and its input graph and its coarsened versions are reformed in a
neighbor. Under such a framework, several approaches on balanced binary tree. Arbitrarily ordering the nodes at the
manifolds such as Geodesic CNN (GCNN) [87], Anisotropic coarsest level then propagating this ordering to the lower
CNN(ACNN) [88], Spline CNN [89], and on graphs such level in the balanced binary tree would finally produce a
as GCN [14], DCNN [47] can be generalized as special regular ordering in the finest level. Pooling such a rear-
instances of MoNet. These approaches under the framework ranged 1D signal is much more efficient than the original.
of MoNet have fixed weight functions. MoNet instead pro- Zhang et al. also propose a framework DGCNN [58]
poses a Gaussian kernel with learnable parameters to freely with a similar pooling strategy named SortPooling which
adjust the weight function. performs pooling by rearranging vertices to a meaningful
order. Different from ChebNet [12], DGCNN sorts vertices
4.2.4 Summary
according to their structural roles within the graph. The
Spatial-based methods define graph convolutions via ag- graph’s unordered vertex features from spatial graph con-
gregating feature information from neighbors. According to volutions are treated as continuous WL colors [85], and they
different ways of stacking graph convolution layers, spatial- are then used to sort vertices. In addition to sorting the
based methods are split into two groups, recurrent-based vertex features, it unifies the graph size to k by truncat-
and composition-based. While recurrent-based approaches ing/extending the graph’s feature tensor. The last n−k rows
try to obtain nodes’ steady states, composition-based ap- are deleted if n > k , otherwise k − n zero rows are added.
proaches try to incorporate higher orders of neighborhood This method enhances the pooling network to improve the
information. In each layer, both two groups have to update performance of GCNs by solving one challenge underlying
hidden states over all nodes during training. However, it is graph-structured tasks which is referred to as permutation
inefficient to store all the intermediate states into memory. invariant. Verma and Zhang propose graph capsule net-
To address this issue, several training strategies have been works [93] which further explore the permutation invariant
proposed, including sub-graph training for composition- problem for graph data.
based approaches such as GraphSage [25] and stochastically Recently a pooling module, DIFFPOOL [59], is proposed
asynchronous training for recurrent-based approaches such which can generate hierarchical representations of graphs
as SSE [20]. In addition, recent advances in spatial-based and can be combined with not only CNNs, but also var-
approaches tend to construct more complex network ar- ious graph neural network architectures in an end-to-end
chitectures. For examples, GeniePath [51] leverages gating fashion. Compared to all previous coarsening methods,
mechanisms to control the depth and breadth of a node’s DIFFPOOL does not simply cluster the nodes in one graph
neighborhood. DualGCN [52] devises two graph convolu- but provide a general solution to hierarchically pool nodes
tion networks to jointly embed the local consistency and the across a broad set of input graphs. This is done by learning
global consistency knowledge of a graph. Tran et al. [90] a cluster assignment matrix S at layer l referred to as
introduce a hyper-parameter to influence the receptive field S(l) ∈ Rnl ×nl +1 . Two separate GNNs with both input
size of a node. cluster node features X(l) and coarsened adjacency matrix
A(l) are being used to generate the assignment matrix S(l)
4.3 Graph Pooling Modules and embedding matrices Z(l) as follows:
When generalizing convolutional neural networks to graph-
Z(l) = GN Nl,embed (A(l) , X(l) ) (16)
structured data, another key component, graph pooling
module, is also of vital importance, particularly for graph- S(l) = sof tmax(GN Nl,pool (A(l) , X(l) )) (17)
level classification tasks [58], [59], [91]. According to Xu
et al. [92], pooling-assisted GCNs are as powerful as the Equation 16 and 17 can be implemented with any
Weisfeiler-Lehman test [85] in distinguishing graph struc- standard GNN module, which processes the same input
tures. Similar to the original pooling layer which comes data but has distinct parametrizations since the roles they
with CNNs, graph pooling module could easily reduce the play in the framework are different. The GN Nl,embed will
variance and computation complexity by down-sampling produce new embeddings while the GN Nl,pool generates a
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 11

probabilistic assignment of the input nodes to nl+1 clusters. inputs can be incorporated into the aggregation function
The Softmax function is applied in a row-wise fashion in (e.g. [13], [18], [54], [55], [56]).
Equation 17. As a result, each row of S(l) corresponds to one As a result, spatial models have attracted increasing
of the nl nodes(or clusters) at layer l, and each column of attention in recent years [26].
S(l) corresponds to one of the nl at the next layer. Once we
have Z(l) and S(l) , the pooling operation comes as follows:
5 B EYOND G RAPH C ONVOLUTIONAL N ETWORKS
(l+1) (l) T (l) nl+1 ×d In this section, we review other graph neural networks
X =S Z ∈R (18)
including graph attention neural networks, graph auto-
T
A(l+1) = S(l) A(l) S(l) ∈ Rnl+1 ×nl+1 (19) encoder, graph generative networks, and graph spatial-
temporal networks. In Table 4, we provide a summary of
Equation 18 takes the cluster embeddings Z(l) then the main approaches under each category.
aggregates these embeddings according to the cluster as-
signments S(l) to calculate embedding for each of the nl+1
5.1 Graph Attention Networks
clusters. Initial cluster embedding would be node repre-
sentation. Similarly, Equation 19 takes the adjacency matrix Attention mechanisms have almost become a standard in
A(l) as inputs and generates a coarsened adjacency matrix sequence-based tasks [94]. The virtue of attention mecha-
denoting the connectivity strength between each pair of the nisms is their ability to focus on the most important parts
clusters. of an object. This specialty has been proven to be useful
Overall, DIFFPOOL [59] redefines the graph pooling for many tasks, such as machine translation and natural
module by using two GNNs to cluster the nodes. Any language understanding. Thanks to the increased model
standard GCN module is able to combine with DIFFPOOL, capacity of attention mechanisms, graph neural networks
not only to achieve enhanced performance but also to speed also benefit from this by using attention during aggregation,
up the convolution operation. integrating outputs from multiple models, and generating
importance-oriented random walks. In this section, we will
discuss how attention mechanisms are being used in graph-
4.4 Comparison Between Spectral and Spatial Models structured data.
As the earliest convolutional networks for graph data,
spectral-based models have achieved impressive results in 5.1.1 Methods of Graph Attention Networks
many graph related analytics tasks. These models are ap- Graph Attention Network (GAT) [15] is a spatial-based
pealing in that they have a theoretical foundation in graph graph convolution network where the attention mechanism
signal processing. By designing new graph signal filters is involved in determining the weights of a node’s neighbors
[24], we can theoretically design new graph convolution when aggregating feature information. The graph convolu-
networks. However, there are several drawbacks to spectral- tion operation of GAT is defined as,
based models. We illustrate this in the following from three X
aspects, efficiency, generality and flexibility.
hti = σ( α(hit−1 , ht−1
j )W
t−1 t−1
hj ) (20)
j∈Ni
In terms of efficiency, the computational cost of spectral-
based models increases dramatically with the graph size where α(·) is an attention function which adaptively con-
because they either need to perform eigenvector compu- trols the contribution of a neighbor j to the node i. In order
tation [21] or handle the whole graph at the same time, to learn attention weights in different subspaces, GAT uses
which makes them difficult to parallel or scale to large multi-head attentions.
graphs. Spatial based models have the potential to handle X
large graphs as they directly perform the convolution in the
hti =kK
k=1 σ( αk (hit−1 , ht−1
j )Wk
t−1 t−1
hj ) (21)
j∈Ni
graph domain via aggregating the neighboring nodes. The
computation can be performed in a batch of nodes instead where k denotes concatenation.
of the whole graph. When the number of neighboring nodes
Gated Attention Network (GAAN) [29] also employs the
increases, sampling techniques [25], [28] can be developed
multi-head attention mechanism in updating a node’s hid-
to improve efficiency.
den state. However, rather than assigning equal weight to
In terms of generality, spectral-based models assumed
each head, GAAN introduces a self-attention mechanism
a fixed graph, making them generalize poorly to new or
which computes a different weight for each head. The
different graphs. Spatial-based models, on the other hand,
updating rule is defined as,
perform graph convolution locally on each node, where
weights can be easily shared across different locations and X
structures. hti = φo (xi ⊕ kK k
k=1 gi αk (ht−1
i , ht−1 t−1
j )φv (hj )) (22)
In terms of flexibility, spectral-based models are limited j∈Ni
to work on undirected graphs. There is no clear definition
where φo (·) and φv (·) denote feedforward neural networks
of the Laplacian matrix on directed graphs so that the only
and gik is the attention weight of the k th attention head.
way to apply spectral-based models to directed graphs is to
transfer directed graphs to undirected graphs. Spatial-based Graph Attention Model (GAM) [60] proposes a recur-
models are more flexible to deal with multi-source inputs rent neural network model to solve graph classification
such as edge features and edge directions because these problems, which processes informative parts of a graph
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 12

TABLE 4: Summary of Alternative Graph Neural Networks (Graph Convolutional Networks Excluded). We summarize
methods based on their inputs, outputs, targeted tasks, and whether a method is GCN-based. Inputs indicate whether a
method suits attributed graphs (A), directed graphs (D), and spatial-temporal graphs (S).
Inputs GCN
Category Approaches Outputs Tasks
A D S Based
Graph GAT (2017) [15] 3 3 7 node labels node classification 3
Attention GAAN (2018) [29] 3 3 7 node labels node classification 3
Networks GAM (2018) [60] 3 3 7 graph labels graph classification 7
Attention Walks (2018) [61] 7 7 7 node embedding network embedding 7
GAE (2016) [62] 3 7 7 reconstructed adajacency matrix network embedding 3
Graph ARGA (2018) [64] 3 7 7 reconstructed adajacency matrix network embedding 3
Auto-encoder reconstructed sequences of
NetRA (2018) [65] 7 7 7 network embedding 7
random walks
DNGR (2016) [42] 7 7 7 reconstructed PPMI matrix network embedding 7
SDNE (2016) [43] 7 3 7 reconstructed adajacency matrix network embedding 7
DNRE (2018) [66] 3 7 7 reconstructed node embedding network embedding 7
MolGAN (2018) [69] 3 7 7 new graphs graph generation 3
Graph
DGMG (2018) [68] 7 7 7 new graphs graph generation 3
Generative
GraphRNN (2018) [67] 7 7 7 new graphs graph generation 7
Networks
NetGAN (2018) [70] 7 7 7 new graphs graph generation 7
spatial-temporal
DCRNN (2018) [73] 7 7 3 node value vectors 3
Graph forecasting
Spatial-Temporal spatial-temporal
CNN-GCN (2017) [74] 7 7 3 node value vectors 3
Networks forecasting
spatial-temporal
ST-GCN (2018) [75] 7 7 3 graph labels 3
classification
spatial-temporal
Structural RNN (2016) [76] 7 7 3 node labels/value vectors 7
forecasting

by adaptively visiting a sequence of important nodes. The the umbrella of graph attention networks, they can also be
GAM model is defined as considered as spatial-based graph convolution networks at
the same time. The advantage of GAT [15] and GAAN [29]
ht = fh (fs (rt−1 , vt−1 , g; θs ), ht−1 ; θh ) (23) is that they can adaptively learn the importance weights of
neighbors as illustrated in Fig 6. However, the computa-
where fh (·) is an LSTM network, fs is the step network tion cost and memory consumption increase rapidly as the
which takes a step from the current node vt−1 to one of its attention weights between each pair of neighbors must be
neighbors ct , prioritizing those whose type have a higher computed.
rank in vt−1 which is generated by a policy network:

rt = fr (ht ; θr ) (24) 5.2 Graph Auto-encoders


where rt is a stochastic rank vector which indicates which Graph auto-encoders are one class of network embedding
node is more important and thus should be further explored approaches which aim at representing network vertices into
with high priority, ht contains the historical information that a low-dimensional vector space by using neural network
the agent has aggregated from the exploration of a graph, architectures. A typical solution is to leverage multi-layer
and is used to make a prediction for the graph label. perceptrons as the encoder to obtain node embeddings,
Attention Walks [61] learns node embeddings through where a decoder reconstructs a node’s neighborhood statis-
random walks. Unlike DeepWalk [41] using fixed apriori, tics such as positive pointwise mutual information (PPMI)
Attention Walks factorizes the co-occurrence matrix with [42] or the first and second order of proximities [43]. Re-
differentiable attention weights. cently, researchers have explored the use of GCN [14] as an
encoder, combining GCN [14] with GAN [95], or combining
C
X LSTM [7] with GAN [95] in designing a graph auto-encoder.
E[D] = P̃(0) ak (P)k (25)
We will first review GCN based autoencoder and then
k=1
summarize other variants in this category.
where D denotes the co-occurrence matrix, P̃(0) denotes
the initial position matrix, and P denotes the probability
5.2.1 GCN Based Auto-encoders
transition matrix.
Graph Auto-encoder (GAE) [62] firstly integrates GCN
5.1.2 Summary [14] into a graph auto encoder framework. The encoder is
Attention mechanisms contribute to graph neural networks defined as
in three different ways, namely assigning attention weights Z = GCN (X, A) (26)
to different neighbors when aggregating feature informa-
tion, ensembling multiple models according to attention while the decoder is defined as
weights, and using attention weights to guide random
walks. Despite categorizing GAT [15] and GAAN [29] under  = σ(ZZT ) (27)
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 13

The framework of GAE is also depicted in Fig 5b. The GAE of adjacent nodes close to each other as much as possible.
can be trained in a variational manner, i.e., to minimize the Specifically, the loss function L1st is defined as
variational lower bound L: n
(k) (k)
X
L1st = Ai,j ||hi − hj ||2 (30)
L = Eq(Z|X,A) [logp (A|Z)] − KL[q(Z|X, A)||p(Z)] (28) i,j=1

The second-order proximity is defined as the distance be-


Adversarially Regularized Graph Autoencoder (ARGA) tween a node’s input and its reconstructed inputs where
[64] employs the training scheme of generative adversarial the input is the corresponding row of the node in the
networks (GANs) [95] to regularize a graph auto-encoder. In adjacency matrix. The goal for the second-order proximity is
ARGA, an encoder encodes a node’s structural information to preserve a node’s neighborhood information. Concretely,
with its features into a hidden representation by GCN the loss function L2nd is defined as
[14], and a decoder reconstructs the adjacency matrix from n
X
the outputs of the encoder. The GANs play a min-max L2nd = ||(x̂i − xi ) bi ||2 (31)
game between a generator and a discriminator in training i=1
generative models. A generator generates “faked samples” The role of vector bi is to penalize non-zero elements more
as real as possible while a discriminator makes its best than zero elements since the inputs are highly sparse. In
to distinguish the “faked samples” from the real ones. A detail, bi,j = 1 if Ai,j = 0 and bi,j = β > 1 if Ai,j = 1.
GAN helps the ARGA to regularize the learned hidden Overall, the objective function is defined as
representations of nodes to follow a prior distribution. In
detail, the encoder, working as a generator, tries to make L = L2nd + αL1st + λLreg (32)
the learned node hidden representations indistinguishable
where Lreg is the L2 regularization term.
from a real prior distribution. A discriminator, on the other
side, tries to identify whether the learned node hidden Deep Recursive Network Embedding (DRNE) [66] di-
representations are generated from the encoder or from a rectedly reconstructs a node’s hidden state instead of the
real prior distribution. whole graph statistics. Using an aggregation function as the
encoder, DRNE designs the loss function as,
X
5.2.2 Miscellaneous Variants of Graph Auto-encoders L= ||hv − aggregate(hu |u ∈ N (v))||2 (33)
Network Representations with Adversarially Regularized v∈V

Autoencoders (NetRA) [65] is a graph auto-encoder frame- One innovation of DRNE is that it chooses an LSTM as
work which shares a similar idea with ARGA. It also the aggregation function where the neighbors’ sequence is
regularizes node hidden representations to comply with a ordered by their node degree.
prior distribution via adversarial training. Instead of recon-
structing the adjacency matrix, they recover node sequences 5.2.3 Summary
sampled from random walks by a sequence-to-sequence DNGR and SDNE learn node embeddings only given the
architecture [96]. topological structures, while GAE, ARGA, NetRA, DRNE
learn node embeddings when both topological information
Deep Neural Networks for Graph Representations
and node content features are available. One challenge of
(DNGR) [42] uses the stacked denoising autoencoder
graph auto-encoders is the sparsity of the adjacency matrix
[97] to reconstruct the pointwise mutual information ma-
trix(PPMI). The PPMI matrix intrinsically captures nodes
A, causing the number of positive entries of the decoder to
be far less than the negative ones. To tackle this issue, DNGR
co-occurrence information when a graph is serialized as
reconstructs a denser matrix namely the PPMI matrix, SDNE
sequences by random walks. Formally, the PPMI matrix is
imposes a penalty to zero entries of the adjacency matrix,
defined as
GAE reweights the terms in the adjacency matrix, and
count(v1 , v2 ) · |D| NetRA linearizes Graphs into sequences.
PPMIv1 ,v2 = max(log( ), 0) (29)
count(v1 )count(v2 )
P 5.3 Graph Generative Networks
where |D| = v1 ,v2 count(v1 , v2 ) and v1 , v2 ∈ V . The
The goal of graph generative networks is to generate graphs
stacked denoising autoencoder is able to learn highly non-
given an observed set of graphs. Many approaches to graph
linear regularity behind data. Different from conventional
generative networks are domain specific. For instance, in
neural autoencoders, it adds noise to inputs by randomly
molecular graph generation, some works model a string
switching entries of inputs to zero. The learned latent repre-
representation of molecular graphs called SMILES [98], [99],
sentation is more robust especially when there are missing
[100], [101]. In natural language processing, generating a se-
values present.
mantic or a knowledge graph is often conditioned on a given
Structural Deep Network Embedding (SDNE) [43] uses sentence [102], [103]. Recently, several general approaches
stacked auto encoder to preserve nodes first-order proximity have been proposed. Some work factor the generation pro-
and second-order proximity jointly. The first-order proxim- cess as forming nodes and edges alternatively [67], [68]
ity is defined as the distance between a node’s hidden rep- while others employ generative adversarial training [69],
resentation and its neighbor’s hidden representation. The [70]. The methods in this category either employ GCN as
goal for the first-order proximity is to drive representations building blocks or use different architectures.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 14

Fig. 9: Framework of MolGAN [70]. A generator first samples an initial vector from a standard normal distribution. Passing
this initial vector through a neural network, the generator outputs a dense adjacency matrix A and a corresponding feature
matrix X . Next, the generator produces a sampled discrete à and X̃ from categorical distributions based on A and X .
Finally, GCN is used to derive a vector representation of the sampled graph. Feeding this graph representation to two
distinct neural networks, a discriminator and a reward network output a score between zero and one separately, which
will be used as feedback to update the model parameters.

5.3.1 GCN Based Graph Generative Networks NetGAN [70] combines LSTM [7] with Wasserstein GAN
Molecular Generative Adversarial Networks (MolGAN) [106] to generate graphs from a random-walk-based ap-
[69] integrates relational GCN [104], improved GAN [105] proach. The GAN framework consists of two modules, a
and reinforcement learning (RL) objective to generate generator and a discriminator. The generator makes its best
graphs with desired properties. The GAN consists of a effort to generate plausible random walks through an LSTM
generator and a discriminator, competing with each other network while the discriminator tries to distinguish faked
to improve the authenticity of the generator. In MolGAN, random walks from the real ones. After training, a new
the generator tries to propose a faked graph along with its graph is obtained by normalizing a co-occurrence matrix
feature matrix while the discriminator aims to distinguish of nodes which occur in a set of random walks.
the faked sample from the empirical data. Additionally,
a reward network is introduced in parallel with the dis- 5.3.3 Summary
criminator to encourage the generated graphs to possess Evaluating generated graphs remains a difficult problem.
certain properties according to an external evaluator. The Unlike synthesized images or audios, which can be di-
framework of MolGAN is described in Fig 9. rectly assessed by human experts, the quality of generated
graphs is difficult to inspect visually. MolGAN and DGMG
Deep Generative Models of Graphs (DGMG) [68] uti-
make use of external knowledge to evaluate the validity
lizes spatial-based graph convolution networks to obtain
of generated molecule graphs. GraphRNN and NetGAN
a hidden representation of an existing graph. The decision
evaluate generated graphs by graph statistics (e.g. node
process of generating nodes and edges is conditioned on the
degrees). Whereas DGMG and GraphRNN generate nodes
resultant graph representation. Briefly, DGMG recursively
and edges sequentially, MolGAN and NetGAN generate
proposes a node to a growing graph until a stopping crite-
nodes and edges jointly. According to [71], the disadvantage
rion is evoked. In each step after adding a new node, DGMG
of the former approaches is that when graphs become large,
repeatedly decides whether to add an edge to the added
modeling a long sequence is not realistic. The challenge
node until the decision turns to false. If the decision is true,
of the latter approaches is that the global properties of a
it evaluates the probability distribution of connecting the
graph are difficult to control. A recent approach [71] adopts
newly added node to all existing nodes and samples one
variational auto-encoder to generate a graph by proposing
node from the probability distribution. After a new node
the adjacency matrix, imposing penalty terms to address
and its connections are added to the existing graph, DGMG
validity constraints. However, as the output space of a graph
updates the graph representation again.
with n nodes is n2 , none of these methods is scalable to large
graphs.
5.3.2 Miscellaneous Graph Generative Networks
GraphRNN [67] exploits deep graph generative models
through two-level recurrent neural networks. The graph- 5.4 Graph Spatial-Temporal Networks
level RNN adds a new node each time to a node sequence Graph spatial-temporal networks capture spatial and tem-
while the edge level RNN produces a binary sequence poral dependencies of a spatial-temporal graph simultane-
indicating connections between the newly added node and ously. Spatial-temporal graphs have a global graph structure
previously generated nodes in the sequence. To linearize with inputs to each node which are changing across time.
a graph into a sequence of nodes for training the graph For instance, in traffic networks, each sensor taken as a
level RNN, GraphRNN adopts the breadth-first-search (BFS) node records the traffic speed of a certain road continuously
strategy. To model the binary sequence for training the edge- where the edges of the traffic network are determined by
level RNN, GraphRNN assumes multivariate Bernoulli or the distance between pairs of sensors. The goal of graph
conditional Bernoulli distribution. spatial-temporal networks can be forecasting future node
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 15

values or labels, or predicting spatial-temporal graph labels. graph according to the distance of the two related nodes.
Recent studies have explored the use of GCNs [75] solely, In this way, the adjacency matrix can be represented as a
a combination of GCNs with RNN [73] or CNN [74], and summation of K adjacency matrices where K is the number
a recurrent architecture tailored to graph structures [76]. In of labels. Then ST-GCN applies GCN [14] with a different
the following, we introduce these methods. weight matrix to each of the K adjacency matrix and sums
them. X −1 −1
5.4.1 GCN Based Graph Spatial-Temporal Networks fout = Λj 2 Aj Λj 2 fin Wj (37)
Diffusion Convolutional Recurrent Neural Network j

(DCRNN) [73] introduces diffusion convolution as graph


5.4.2 Miscellaneous Variants
convolution for capturing spatial dependency and uses
sequence-to-sequence architecture [96] with gated recurrent Structural-RNN [76] proposes a recurrent structured frame-
units (GRU) [82] to capture temporal dependency. work named Structural-RNN. The aim of Structural-RNN
Diffusion convolution models a truncated diffusion pro- is to predict node labels at each time step. In Structural-
cess with the forward and backward directions. Formally, RNN, it comprises of two kinds of RNNs, namely nodeRNN
the diffusion convolution is defined as and edgeRNN. The temporal information of each node and
each edge is passed through a nodeRNN and an edgeRNN
K−1
respectively. Since assuming different RNNs for different
X nodes and edges increases model complexity dramatically,
X:,p ?G f (θ) = (θk1 (D−1 k −1 T k
O A) + θk2 (DI A ) )X:,p
they instead split nodes and edges into semantic groups.
k=0
(34) For example, a human-object interaction graph consists of
where DO is the out-degree matrix and DI is the in- two groups of nodes, human nodes and object nodes, and
degree matrix. To allow multiple input and output channels, three groups of edges, human-human edges, object-object
DCRNN proposes a diffusion convolution layer, defined as edges, and human-object edges. Nodes or edges in a same
semantic group share the same RNN model. To incorporate
P
X the spatial information, a nodeRNN will take the outputs of
Z:,q = σ( X:,p ?G f (Θq,p,:,: )) (35)
edgeRNN as inputs.
p=1

where X ∈ RN ×P and Z ∈ RN ×Q , Θ ∈ RQ×P ×K×2 , Q 5.4.3 Summary


is the number of output channels and P is the number of The advantage of DCRNN is that it is able to handle long-
input channels. term dependencies because of the recurrent network archi-
To capture temporal dependency, DCRNN processes the tectures. Though simpler than DCRNN, CNN-GCN pro-
inputs of GRU using a diffusion convolution layer so that cesses spatial-temporal graphs more efficiently owing to the
the recurrent unit simultaneously receives history informa- fast implementation of 1D CNN. ST-GCN considers tempo-
tion from the last time step and neighborhood information ral flow as graph edges, resulting in the size of the adjacency
from graph convolution. The modified GRU in DCRNN is matrix growing quadratically. On the one hand, it increases
named as the diffusion convolutional gated recurrent Unit the computation cost of the graph convolution layer. On the
(DCGRU), other hand, to capture the long-term dependency, the graph
convolution layer has to be stacked many times. Structural-
r(t) = sigmoid(Θr ?G [X(t) , H(t−1) ] + br ) RNN improves model efficiency by sharing the same RNN
within the same semantic group. However, Structural-RNN
u(t) = sigmoid(Θu ?G [X(t) , H(t−1) ] + bu ) demands human prior knowledge to split the semantic
(36)
C(t) = tanh(ΘC ?G [X(t) , (r(t) H(t−1) )] + br ) groups.
H(t) = u(t) H(t−1) + (1 − u(t) ) C(t)
To meet the demands of multi-step forecasting, DCRNN 6 A PPLICATIONS
adopts sequence-to-sequence architecture [96] where the Graph neural networks have a wide variety of applications.
recurrent unit is replaced by DCGRU. In this section, we first summarize the benchmark datasets
frequently used in the literature. Then we report the bench-
CNN-GCN [74] interleaves 1D-CNN with GCN [14] to mark performance on four commonly used datasets and
learn spatial-temporal graph data. For an input tensor list the available open source implementations of graph
X ∈ RT ×N ×D , the 1D-CNN layer slides over X[:,i,:] along neural networks. Finally, we provide practical applications
the time axis to aggregate temporal information for each of graph neural networks in various domains.
node while the GCN layer operates on X[i,:,:] to aggregate
spatial information at each time step. The output layer is a
linear transformation, generating a prediction for each node. 6.1 Datasets
The framework of CNN-GCN is depicted in Fig 5c. In our survey, we count the frequency of each dataset which
occurs in the papers reviewed in this work, and report in
Spatial Temporal GCN (ST-GCN) [75] adopts a different
Table 5 the datasets which occur at least twice.
approach by extending the temporal flow as graph edges so
that spatial and temporal information can be extracted using Citation Networks consist of papers, authors and their
a unified GCN model at the same time. ST-GCN defines relationship such as citation, authorship, co-authorship. Al-
a labelling function to assign a label to each edge of the though citation networks are directed graphs, they are often
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 16

treated as undirected graphs in evaluating model perfor- NewsGroup dataset consists of around 20,000 News Group
mance with respect to node classification, link prediction, (NG) text documents categorized by 20 news types. The
and node clustering tasks. There are three popular datasets graph of the 20-NewsGroup is constructed by representing
for paper-citation networks, Cora, Citeseer and Pubmed. each document as a node and using the similarities between
The Cora dataset contains 2708 machine learning publi- nodes as edge weights.
cations grouped into seven classes. The Citeseer dataset
Others There are several other datasets worth mentioning.
contains 3327 scientific papers grouped into six classes. Each
The METR-LA is a traffic dataset collected from the high-
paper in Cora and Citeseer is represented by a one-hot
ways of Los Angeles County. The MovieLens-1M dataset
vector indicating the presence or absence of a word from
from the MovieLens website contains 1 million item rat-
a dictionary. The Pubmed dataset contains 19717 diabetes-
ings given by 6k users. It is a benchmark dataset for
related publications. Each paper in Pubmed is represented
recommender systems. The NELL dataset is a knowledge
by a term frequency-inverse document frequency (TF-IDF)
graph obtained from the Never-Ending Language Learning
vector. Furthermore, DBLP is a large citation dataset with
project. It consists of facts represented by a triplet which
millions of papers and authors which have been collected
involves two entities and their relation.
from computer science bibliographies. The raw dataset of
DBLP can be found on https://dblp.uni-trier.de. A pro-
cessed version of the DBLP paper-citation network is up- 6.2 Benchmarks & Open-source Implementations
dated continuously by https://aminer.org/citation. Of the datasets listed in Table 5, Cora, Pubmed, Citeseer,
and PPI are the most frequently used datasets. They are
Social Networks are formed by user interactions from
often tested to compare the performance of graph convo-
online services such as BlogCatalog, Reddit, and Epinions.
lution networks in node classification tasks. In Table 6, we
The BlogCatalog dataset is a social network which con-
report the benchmark performance of these four datasets,
sists of bloggers and their social relationships. The labels
all of which use standard data splits. Open-source imple-
of bloggers represent their personal interests. The Reddit
mentations facilitate the work of baseline experiments in
dataset is an undirected graph formed by posts collected
deep learning research. Due to the vast number of hyper-
from the Reddit discussion forum. Two posts are linked if
parameters, it is difficult to achieve the same results as
they contain comments by the same user. Each post has a
reported in the literature without using published codes.
label indicating the community to which it belongs. The
In Table 7, we provide the hyperlinks of open-source imple-
Epinions dataset is a multi-relation graph collected from an
mentations of the graph neural network models reviewed in
online product review website where commenters can have
Section 4-5. Noticeably, Fey et al. [89] published a geometric
more than one type of relation, such as trust, distrust, co-
learning library in PyTorch named PyTorch Geometric 3 ,
review, and co-rating.
which implements several graph neural networks including
Chemical/Biological Graphs Chemical molecules and com- ChebNet [12], 1stChebNet [14], GraphSage [25], MPNNs
pounds can be represented by chemical graphs with atoms [13], GAT [15] and SplineCNN [89]. Most recently, the Deep
as nodes and chemical bonds as edges. This category of Graph Library (DGL) 4 is released which provides a fast
graphs is often used to evaluate graph classification perfor- implementation of many graph neural networks with a set
mance. The NCI-1 and NCI-9 dataset contain 4110 and 4127 of functions on top of popular deep learning platforms such
chemical compounds respectively, labeled as to whether as PyTorch and MXNet.
they are active to hinder the growth of human cancer cell
lines. The MUTAG dataset contains 188 nitro compounds, 6.3 Practical Applications
labeled as to whether they are aromatic or heteroaromatic.
Graph neural networks have a wide range of applications
The D&D dataset contains 1178 protein structures, labeled
across different tasks and domains. Despite general tasks at
as to whether they are enzymes or non-enzymes. The QM9
which each category of GNNs is specialized, including node
dataset contains 133885 molecules labeled with 13 chemical
classification, node representation learning, graph classifi-
properties. The Tox21 dataset contains 12707 chemical com-
cation, graph generation, and spatial-temporal forecasting,
pounds labeled with 12 types of toxicity. Another important
GNNs can also be applied to node clustering, link predic-
dataset is the Protein-Protein Interaction network(PPI). It
tion [124], and graph partition [125]. In this section, we
contains 24 biological graphs with nodes represented by
mainly introduce practical applications according to general
proteins and edges represented by the interactions between
domains to which they belong.
proteins. In PPI, each graph is associated with one human
tissue. Each node is labeled with its biological states. Computer Vision One of the biggest application areas for
graph neural networks is computer vision. Researchers have
Unstructured Graphs To test the generalization of graph explored leveraging graph structures in scene graph gener-
neural networks to unstructured data, the k nearest neigh- ation, point clouds classification and segmentation, action
bor graph (k-NN graph) has been widely used. The MNIST recognition and many other directions.
dataset contains 70000 images of size 28×28 labeled with ten In scene graph generation, semantic relationships be-
digits. A typical way to convert an MNIST image to a graph tween objects facilitate the understanding of the semantic
is to construct an 8-NN graph based on its pixel locations. meaning behind a visual scene. Given an image, scene
The Wikipedia dataset is a word co-occurrence network ex-
tracted from the first million bytes of the Wikipedia dump. 3. https://github.com/rusty1s/pytorch geometric
Labels of words represent part-of-speech (POS) tags. The 20- 4. https://www.dgl.ai/
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 17

TABLE 5: Summary of Commonly Used Datasets


Category Dataset Source # Graphs # Nodes # Edges #Features # Labels Citation
[14], [15], [24], [26], [28]
[47], [48], [49], [52], [57]
Cora [107] 1 2708 5429 1433 7
[61], [62], [63], [64], [108]
Citation
[109], [110]
Networks
[14], [15], [28], [49], [52]
Citeseer [107] 1 3327 4732 3703 6 [57], [61], [62], [63], [64]
[109], [110]
[14], [15], [26], [28], [47]
Pubmed [107] 1 19717 44338 500 3 [48], [51], [52], [54], [57]
[62], [64], [70], [109], [110]
dblp.uni-trier.de
DBLP [111](aminer.org 1 - - - - [65], [70], [108], [112]
/citation)
BlogCatalog [113] 1 10312 333983 - 39 [43], [51], [65], [114]
Social
[25], [29], [48], [49], [57]
Networks Reddit [25] 1 232965 11606919 602 41
[109]
Epinions www.epinions.com 1 - - - - [53], [112]
[15], [20], [25], [28], [29]
PPI [115] 24 56944 818716 50 121
[49], [51], [65], [109]
[27], [47], [50], [55], [58]
Chemical/ NCI-1 [116] 4110 - - 37 2
[60], [90], [92], [93]
Biological
NCI-109 [116] 4127 - - 38 2 [27], [47], [55], [93]
Graphs
[27], [47], [55], [58], [90]
MUTAG [117] 188 - - 7 2
[92]
[27], [50], [55], [58], [59]
D&D [118] 1178 - - - 2
[90], [93]
QM9 [119] 133885 - - - 13 [13], [69]
tripod.nih.gov/
tox21 12707 - - - 12 [23], [56]
tox21/challenge/
yann.lecun.com
Unstruct- MNIST 70000 - - - 10 [12], [21], [24], [26], [55]
/exdb/mnist/
ured
www.mattmahoney
Graphs Wikipedia 1 4777 184812 - 40 [65], [114]
.net/dc/textdata
20NEWS [120] 1 18846 - - 20 [12], [42]
METR-LA [121] - - - - - [29], [73]
Others [122]
grouplens.org/
Movie-Lens1M 1 10000 1 Million - - [24], [114]
datasets/
movielens/1m/
Nell [123] 1 65755 266144 61278 210 [14], [49], [52]

TABLE 6: Benchmark performance of four most frequently unmanned vehicles. To identify objects depicted by point
used datasets. The listed methods use the same training, clouds, [130], [131], [132] convert point clouds into k-nearest
validation, and test data for evaluation. neighbor graphs or superpoint graphs, and use graph con-
Method Cora Citeseer Pubmed PPI volution networks to explore the topological structure.
1stChebnet (2016) [14] 81.5 70.3 79.0 - In action recognition, recognizing human actions con-
GraphSage (2017) [25] - - - 61.2 tained in videos facilitates a better understanding of video
GAT (2017) [15] 83.0±0.7 72.5±0.7 79.0±0.3 97.3±0.2
Cayleynets (2017) [24] 81.9±0.7 - - - content from a machine aspect. One group of solutions
StoGCN (2018) [49] 82.0±0.8 70.9±0.2 78.7±0.3 97.8±0.05 detects the locations of human joints in video clips. Human
DualGCN (2018) [52] 83.5 72.6 80.0 -
GAAN (2018) [29] - - - 98.71±0.02
joints which are linked by skeletons naturally form a graph.
GraphInfoMax (2018) [109] 82.3±0.6 71.8±0.7 76.8±0.6 63.8±0.2 Given the time series of human joint locations, [75], [76]
GeniePath (2018) [51] - - 78.5 97.9 applies spatial-temporal neural networks to learn human
LGCN (2018) [28] 83.3±0.5 73.0±0.6 79.5±0.2 77.2±0.2
SSE (2018) [20] - - - 83.6 action patterns.
In addition, the number of possible directions in which
to apply graph neural networks in computer vision is still
graph generation models detect and recognize objects and growing. This includes human-object interaction [133], few-
predict semantic relationships between pairs of objects [126], shot image classification [134], [135], semantic segmentation
[127], [128]. Another application inverses the process by [136], [137], visual reasoning [138] and question answering
generating realistic images given scene graphs [129]. As [139].
natural language can be parsed as semantic graphs where Recommender Systems Graph-based recommender sys-
each word represents an object, it is a promising solution to tems take items and users as nodes. By leveraging the
synthesize images given textual descriptions. relations between items and items, users and users, users
In point clouds classification and segmentation, a point and items, as well as content information, graph-based
cloud is a set of 3D points recorded by LiDAR scans. recommender systems are able to produce high-quality
Solutions for this task enable LiDAR devices to see the recommendations. The key to a recommender system is
surrounding environment, which is typically beneficial for to score the importance of an item to a user. As a result,
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 18

TABLE 7: A Summary of Open-source Implementations


Model Framework Github Link
ChebNet (2016) [12] tensorflow https://github.com/mdeff/cnn graph
1stChebNet (2017) [14] tensorflow https://github.com/tkipf/gcn
GGNNs (2015) [19] lua https://github.com/yujiali/ggnn
SSE (2018) [20] c https://github.com/Hanjun-Dai/steady state embedding
GraphSage (2017) [25] tensorflow https://github.com/williamleif/GraphSAGE
LGCN (2018) [28] tensorflow https://github.com/divelab/lgcn/
SplineCNN (2018) [89] pytorch https://github.com/rusty1s/pytorch geometric
GAT (2017) [15] tensorflow https://github.com/PetarV-/GAT
GAE (2016) [62] tensorflow https://github.com/limaosen0/Variational-Graph-Auto-Encoders
ARGA (2018) [64] tensorflow https://github.com/Ruiqi-Hu/ARGA
DNGR (2016) [42] matlab https://github.com/ShelsonCao/DNGR
SDNE (2016) [43] python https://github.com/suanrong/SDNE
DRNE (2016) [66] tensorflow https://github.com/tadpole/DRNE
GraphRNN (2018) [67] tensorflow https://github.com/snap-stanford/GraphRNN
DCRNN (2018) [73] tensorflow https://github.com/liyaguang/DCRNN
CNN-GCN (2017) [74] tensorflow https://github.com/VeritasYin/STGCN IJCAI-18
ST-GCN (2018) [75] pytorch https://github.com/yysijie/st-gcn
Structural RNN (2016) [76] theano https://github.com/asheshjain399/RNNexp

it can be cast as a link prediction problem. The goal is 7 F UTURE D IRECTIONS


to predict the missing links between users and items. To Though graph neural networks have proven their power
address this problem, Van et al. [9] and Ying et al. [11] et al. in learning graph data, challenges still exist due to the
propose a GCN-based graph auto-encoder. Monti et al. [10] complexity of graphs. In this section, we provide four future
combine GCN and RNN to learn the underlying process that directions of graph neural networks.
generates the known ratings.
Go Deep The success of deep learning lies in deep neu-
Traffic Traffic congestion has become a hot social issue in ral architectures. In image classification, for example, an
modern cities. Accurately forecasting traffic speed, volume outstanding model named ResNet [153] has 152 layers.
or the density of roads in traffic networks is fundamentally However, when it comes to graphs, experimental studies
important in route planning and flow control. [29], [73], [74], have shown that with the increase in the number of layers,
[140] adopt a graph-based approach with spatial-temporal the model performance drops dramatically [110]. According
neural networks. The input to their models is a spatial- to [110], this is due to the effect of graph convolutions in
temporal graph. In this spatial-temporal graph, nodes are that it essentially pushes representations of adjacent nodes
represented by sensors placed on roads, edges are repre- closer to each other so that, in theory, with an infinite times
sented by the distance of pair-wise nodes above a threshold of convolutions, all nodes’ representations will converge to a
and each node contains a time series as features. The goal is single point. This raises the question of whether going deep
to forecast the average speed of a road within a time inter- is still a good strategy for learning graph-structured data.
val. Another interesting application is the taxi-demand pre-
Receptive Field The receptive field of a node refers to a
diction. This greatly helps intelligent transportation systems
set of nodes including the central node and its neighbors.
make use of resources and save energy effectively. Given
The number of neighbors of a node follows a power law
historical taxi demands, location information, weather data,
distribution. Some nodes may only have one neighbor,
and event features, Yao et al. [141] incorporate LSTM, CNN
while other nodes may neighbors as many as thousands.
and node embeddings trained by LINE [142] to form a joint
Though sampling strategies have been adopted [25], [27],
representation for each location to predict the number of
[28], how to select a representative receptive field of a node
taxis demanded for a location within a time interval.
remains to be explored.
Chemistry In chemistry, researchers apply graph neural Scalability Most graph neural networks do not scale well
networks to study the graph structure of molecules. In a for large graphs. The main reason for this is when stacking
molecular graph, atoms function as nodes and chemical multiple layers of a graph convolution, a node’s final state
bonds function as edges. Node classification, graph classi- involves a large number of its neighbors’ hidden states,
fication and graph generation are three main tasks targeting leading to the high complexity of backpropagation. While
at molecular graphs in order to learn molecular fingerprints several approaches try to improve their model efficiency by
[56], [83], to predict molecular properties [13], to infer pro- fast sampling [48], [49] and sub-graph training [25], [28],
tein interfaces [143], and to synthesize chemical compounds they are still not scalable enough to handle deep architec-
[68], [69], [144]. tures with large graphs.
Others There have been initial explorations into applying Dynamics and Heterogeneity The majority of current graph
GNNs to other problems such as program verification [19], neural networks tackle with static homogeneous graphs. On
program reasoning [145], social influence prediction [146], the one hand, graph structures are assumed to be fixed.
adversarial attacks prevention [147], electrical health records On the other hand, nodes and edges from a graph are
modeling [148], [149], brain networks [150], event detection assumed to come from a single source. However, these two
[151] and combinatorial optimization [152]. assumptions are not realistic in many scenarios. In a social
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 19

network, a new person may enter into a network at any time [11] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
and an existing person may quit the network as well. In J. Leskovec, “Graph convolutional neural networks for web-
scale recommender systems,” in Proceedings of the ACM SIGKDD
a recommender system, products may have different types International Conference on Knowledge Discovery and Data Mining.
where their inputs may have different forms such as texts ACM, 2018, pp. 974–983.
or images. Therefore, new methods should be developed to [12] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
handle dynamic and heterogeneous graph structures. neural networks on graphs with fast localized spectral filtering,”
in Advances in Neural Information Processing Systems, 2016, pp.
3844–3852.
8 C ONCLUSION [13] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
“Neural message passing for quantum chemistry,” in Proceedings
In this survey, we conduct a comprehensive overview of of the International Conference on Machine Learning, 2017, pp. 1263–
graph neural networks. We provide a taxonomy which 1272.
[14] T. N. Kipf and M. Welling, “Semi-supervised classification with
groups graph neural networks into five categories: graph graph convolutional networks,” in Proceedings of the International
convolutional networks, graph attention networks, graph Conference on Learning Representations, 2017.
autoencoders, graph generative networks and graph spatial- [15] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio,
temporal networks. We provide a thorough review, com- and Y. Bengio, “Graph attention networks,” in Proceedings of the
International Conference on Learning Representations, 2017.
parisons, and summarizations of the methods within or [16] M. Gori, G. Monfardini, and F. Scarselli, “A new model for
between categories. Then we introduce a wide range of ap- learning in graph domains,” in Proceedings of the International Joint
plications of graph neural networks. Datasets, open source Conference on Neural Networks, vol. 2. IEEE, 2005, pp. 729–734.
[17] A. Micheli, “Neural network for graphs: A contextual construc-
codes, and benchmarks for graph neural networks are sum-
tive approach,” IEEE Transactions on Neural Networks, vol. 20,
marized. Finally, we suggest four future directions for graph no. 3, pp. 498–511, 2009.
neural networks. [18] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon-
fardini, “The graph neural network model,” IEEE Transactions on
Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
ACKNOWLEDGMENT [19] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
sequence neural networks,” in Proceedings of the International
This research was funded by the Australian Government Conference on Learning Representations, 2015.
through the Australian Research Council (ARC) under [20] H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song, “Learning
grants 1) LP160100630 partnership with Australia Govern- steady-states of iterative algorithms over graphs,” in Proceedings
ment Department of Health and 2) LP150100671 partnership of the International Conference on Machine Learning, 2018, pp. 1114–
1122.
with Australia Research Alliance for Children and Youth [21] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net-
(ARACY) and Global Business College Australia (GBCA). works and locally connected networks on graphs,” in Proceedings
We acknowledge the support of NVIDIA Corporation and of International Conference on Learning Representations, 2014.
[22] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks
MakeMagic Australia with the donation of GPU used for on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
this research. [23] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph convolu-
tional neural networks,” in Proceedings of the AAAI Conference on
Artificial Intelligence, 2018, pp. 3546–3553.
R EFERENCES [24] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets:
[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only Graph convolutional neural networks with complex rational
look once: Unified, real-time object detection,” in Proceedings of spectral filters,” IEEE Transactions on Signal Processing, vol. 67,
the IEEE conference on computer vision and pattern recognition, 2016, no. 1, pp. 97–109, 2017.
pp. 779–788. [25] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards learning on large graphs,” in Advances in Neural Information
real-time object detection with region proposal networks,” in Processing Systems, 2017, pp. 1024–1034.
Advances in neural information processing systems, 2015, pp. 91–99. [26] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M.
[3] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches Bronstein, “Geometric deep learning on graphs and manifolds
to attention-based neural machine translation,” in Proceedings of using mixture model cnns,” in Proceedings of the IEEE Conference
the Conference on Empirical Methods in Natural Language Processing, on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124.
2015, pp. 1412–1421. [27] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
[4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, neural networks for graphs,” in Proceedings of the International
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural Conference on Machine Learning, 2016, pp. 2014–2023.
machine translation system: Bridging the gap between human [28] H. Gao, Z. Wang, and S. Ji, “Large-scale learnable graph convolu-
and machine translation,” arXiv preprint arXiv:1609.08144, 2016. tional networks,” in Proceedings of the ACM SIGKDD International
[5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, Conference on Knowledge Discovery & Data Mining. ACM, 2018,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep pp. 1416–1424.
neural networks for acoustic modeling in speech recognition: [29] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan:
The shared views of four research groups,” IEEE Signal processing Gated attention networks for learning on large and spatiotem-
magazine, vol. 29, no. 6, pp. 82–97, 2012. poral graphs,” in Proceedings of the Uncertainty in Artificial Intelli-
[6] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, gence, 2018.
speech, and time series,” The handbook of brain theory and neural [30] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
networks, vol. 3361, no. 10, p. 1995, 1995. V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” R. Faulkner et al., “Relational inductive biases, deep learning, and
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. graph networks,” arXiv preprint arXiv:1806.01261, 2018.
[8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van- [31] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention
dergheynst, “Geometric deep learning: going beyond euclidean models in graphs: A survey,” arXiv preprint arXiv:1807.07984,
data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2018.
2017. [32] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A
[9] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolu- survey,” arXiv preprint arXiv:1812.04202, 2018.
tional matrix completion,” stat, vol. 1050, p. 7, 2017. [33] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network em-
[10] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix com- bedding,” IEEE Transactions on Knowledge and Data Engineering,
pletion with recurrent multi-graph neural networks,” in Advances 2017.
in Neural Information Processing Systems, 2017, pp. 3697–3707.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 20

[34] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learn- [57] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling
ing on graphs: Methods and applications,” in Advances in Neural towards fast graph representation learning,” in Advances in Neu-
Information Processing Systems, 2017, pp. 1024–1034. ral Information Processing Systems, 2018, pp. 4563–4572.
[35] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation [58] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end
learning: A survey,” IEEE Transactions on Big Data, 2018. deep learning architecture for graph classification,” in Proceedings
[36] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of of the AAAI Conference on Artificial Intelligence, 2018.
graph embedding: problems, techniques and applications,” IEEE [59] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
Transactions on Knowledge and Data Engineering, 2018. “Hierarchical graph representation learning with differentiable
[37] P. Goyal and E. Ferrara, “Graph embedding techniques, applica- pooling,” in Advances in Neural Information Processing Systems,
tions, and performance: A survey,” Knowledge-Based Systems, vol. 2018, pp. 4801–4811.
151, pp. 78–94, 2018. [60] J. B. Lee, R. Rossi, and X. Kong, “Graph classification using struc-
[38] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep tural attention,” in Proceedings of the ACM SIGKDD International
network representation,” in Proceedings of the International Joint Conference on Knowledge Discovery & Data Mining. ACM, 2018,
Conference on Artificial Intelligence. AAAI Press, 2016, pp. 1895– pp. 1666–1674.
1901. [61] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi, “Watch
[39] X. Shen, S. Pan, W. Liu, Y.-S. Ong, and Q.-S. Sun, “Discrete your step: Learning node embeddings via graph attention,” in
network embedding,” in Proceedings of the International Joint Con- Advances in Neural Information Processing Systems, 2018, pp. 9197–
ference on Artificial Intelligence, 7 2018, pp. 3549–3555. 9207.
[40] H. Yang, S. Pan, P. Zhang, L. Chen, D. Lian, and C. Zhang, [62] T. N. Kipf and M. Welling, “Variational graph auto-encoders,”
“Binarized attributed network embedding,” in IEEE International arXiv preprint arXiv:1611.07308, 2016.
Conference on Data Mining. IEEE, 2018. [63] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang, “Mgae: Marginal-
[41] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning ized graph autoencoder for graph clustering,” in Proceedings of
of social representations,” in Proceedings of the ACM SIGKDD the ACM on Conference on Information and Knowledge Management.
international conference on Knowledge discovery and data mining. ACM, 2017, pp. 889–898.
ACM, 2014, pp. 701–710. [64] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adver-
[42] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning sarially regularized graph autoencoder for graph embedding.”
graph representations,” in Proceedings of the AAAI Conference on in Proceedings of the International Joint Conference on Artificial
Artificial Intelligence, 2016, pp. 1145–1152. Intelligence, 2018, pp. 2609–2615.
[43] D. Wang, P. Cui, and W. Zhu, “Structural deep network embed- [65] W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song, B. Zong,
ding,” in Proceedings of the ACM SIGKDD International Conference H. Chen, and W. Wang, “Learning deep network representations
on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1225– with adversarially regularized autoencoders,” in Proceedings of
1234. the ACM SIGKDD International Conference on Knowledge Discovery
[44] A. Sandryhaila and J. M. Moura, “Discrete signal processing on & Data Mining. ACM, 2018, pp. 2663–2671.
graphs,” IEEE transactions on signal processing, vol. 61, no. 7, pp. [66] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu, “Deep recursive
1644–1656, 2013. network embedding with regular equivalence,” in Proceedings of
[45] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević, “Discrete the ACM SIGKDD International Conference on Knowledge Discovery
signal processing on graphs: Sampling theory,” IEEE Transactions and Data Mining. ACM, 2018, pp. 2357–2366.
on Signal Processing, vol. 63, no. 24, pp. 6510–6523, 2015. [67] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec,
[46] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Graphrnn: A deep generative model for graphs,” Proceedings of
“Accelerated filtering on graphs using lanczos method,” arXiv International Conference on Machine Learning, 2018.
preprint arXiv:1509.04537, 2015. [68] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning
[47] J. Atwood and D. Towsley, “Diffusion-convolutional neural net- deep generative models of graphs,” in Proceedings of the Interna-
works,” in Advances in Neural Information Processing Systems, 2016, tional Conference on Machine Learning, 2018.
pp. 1993–2001. [69] N. De Cao and T. Kipf, “Molgan: An implicit generative model
[48] J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph for small molecular graphs,” arXiv preprint arXiv:1805.11973,
convolutional networks via importance sampling,” in Proceedings 2018.
of the International Conference on Learning Representations, 2018. [70] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann, “Net-
[49] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph gan: Generating graphs via random walks,” in Proceedings of the
convolutional networks with variance reduction,” in Proceedings International Conference on Machine Learning, 2018.
of the International Conference on Machine Learning, 2018, pp. 941– [71] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semanti-
949. cally valid graphs via regularizing variational autoencoders,” in
[50] F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, Advances in Neural Information Processing Systems, 2018, pp. 7110–
A. Michael, N. D. Cahill, and R. Ptucha, “Robust spatial filter- 7121.
ing with graph convolutional neural networks,” IEEE Journal of [72] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Struc-
Selected Topics in Signal Processing, vol. 11, no. 6, pp. 884–896, 2017. tured sequence modeling with graph convolutional recurrent
[51] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, and L. Song, “Geniepath: networks,” arXiv preprint arXiv:1612.07659, 2016.
Graph neural networks with adaptive receptive paths,” arXiv [73] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional
preprint arXiv:1802.00910, 2018. recurrent neural network: Data-driven traffic forecasting,” in
[52] C. Zhuang and Q. Ma, “Dual graph convolutional networks for Proceedings of International Conference on Learning Representations,
graph-based semi-supervised classification,” in Proceedings of the 2018.
World Wide Web Conference on World Wide Web. International [74] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
World Wide Web Conferences Steering Committee, 2018, pp. 499– networks: A deep learning framework for traffic forecasting,”
508. in Proceedings of the International Joint Conference on Artificial
[53] T. Derr, Y. Ma, and J. Tang, “Signed graph convolutional net- Intelligence, 2018, pp. 3634–3640.
work,” arXiv preprint arXiv:1808.06354, 2018. [75] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph con-
[54] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh, “Column volutional networks for skeleton-based action recognition,” in
networks for collective classification,” in Proceedings of the AAAI Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Conference on Artificial Intelligence, 2017, pp. 2485–2491. [76] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:
[55] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned Deep learning on spatio-temporal graphs,” in Proceedings of the
filters in convolutional neural networks on graphs,” in Proceed- IEEE Conference on Computer Vision and Pattern Recognition, 2016,
ings of the IEEE conference on computer vision and pattern recognition, pp. 5308–5317.
2017. [77] S. Pan, J. Wu, X. Zhu, C. Zhang, and P. S. Yu, “Joint structure
[56] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, feature exploration and regularization for multi-task graph clas-
“Molecular graph convolutions: moving beyond fingerprints,” sification,” IEEE Transactions on Knowledge and Data Engineering,
Journal of computer-aided molecular design, vol. 30, no. 8, pp. 595– vol. 28, no. 3, pp. 715–728, 2016.
608, 2016. [78] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Task sensitive fea-
ture exploration and learning for multitask graph classification,”
IEEE transactions on cybernetics, vol. 47, no. 3, pp. 744–758, 2017.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 21

[79] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van- [101] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-
dergheynst, “The emerging field of signal processing on graphs: Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-
Extending high-dimensional data analysis to networks and other Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik,
irregular domains,” IEEE Signal Processing Magazine, vol. 30, “Automatic chemical design using a data-driven continuous
no. 3, pp. 83–98, 2013. representation of molecules,” ACS central science, vol. 4, no. 2,
[80] L. B. Almeida, “A learning rule for asynchronous perceptrons pp. 268–276, 2018.
with feedback in a combinatorial environment.” in Proceedings of [102] B. Chen, L. Sun, and X. Han, “Sequence-to-action: End-to-end
the International Conference on Neural Networks, vol. 2. IEEE, 1987, semantic graph generation for semantic parsing,” in Proceedings of
pp. 609–618. the Annual Meeting of the Association for Computational Linguistics,
[81] F. J. Pineda, “Generalization of back-propagation to recurrent 2018, pp. 766–777.
neural networks,” Physical review letters, vol. 59, no. 19, p. 2229, [103] D. D. Johnson, “Learning graphical state transitions,” in Proceed-
1987. ings of the International Conference on Learning Representations, 2016.
[82] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, [104] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep- and M. Welling, “Modeling relational data with graph convolu-
resentations using rnn encoder-decoder for statistical machine tional networks,” in European Semantic Web Conference. Springer,
translation,” in Proceedings of the Conference on Empirical Methods 2018, pp. 593–607.
in Natural Language Processing, 2014, pp. 1724–1734. [105] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.
[83] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, Courville, “Improved training of wasserstein gans,” in Advances
T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional in Neural Information Processing Systems, 2017, pp. 5767–5777.
networks on graphs for learning molecular fingerprints,” in [106] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv
Advances in Neural Information Processing Systems, 2015, pp. 2224– preprint arXiv:1701.07875, 2017.
2232. [107] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-
[84] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and Rad, “Collective classification in network data,” AI magazine,
A. Tkatchenko, “Quantum-chemical insights from deep tensor vol. 29, no. 3, p. 93, 2008.
neural networks,” Nature communications, vol. 8, p. 13890, 2017. [108] X. Zhang, Y. Li, D. Shen, and L. Carin, “Diffusion maps for textual
[85] B. Weisfeiler and A. Lehman, “A reduction of a graph to a network embedding,” in Advances in Neural Information Processing
canonical form and an algebra arising during this reduction,” Systems, 2018.
Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968. [109] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Ben-
[86] B. L. Douglas, “The weisfeiler-lehman method and graph isomor- gio, and R. D. Hjelm, “Deep graph infomax,” arXiv preprint
phism testing,” arXiv preprint arXiv:1101.5211, 2011. arXiv:1809.10341, 2018.
[87] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst, [110] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convo-
“Geodesic convolutional neural networks on riemannian mani- lutional networks for semi-supervised learning,” in Proceedings of
folds,” in Proceedings of the IEEE International Conference on Com- the AAAI Conference on Artificial Intelligence, 2018.
puter Vision Workshops, 2015, pp. 37–45. [111] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer:
[88] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein, “Learning extraction and mining of academic social networks,” in Proceed-
shape correspondence with anisotropic convolutional neural net- ings of the ACM SIGKDD International Conference on Knowledge
works,” in Advances in Neural Information Processing Systems, 2016, Discovery and Data Mining. ACM, 2008, pp. 990–998.
pp. 3189–3197. [112] Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang, “Multi-
[89] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast dimensional graph convolutional networks,” arXiv preprint
geometric deep learning with continuous b-spline kernels,” in arXiv:1808.06099, 2018.
Proceedings of the IEEE Conference on Computer Vision and Pattern [113] L. Tang and H. Liu, “Relational learning via latent social dimen-
Recognition, 2018, pp. 869–877. sions,” in Proceedings of the ACM SIGKDD International Conference
[90] D. V. Tran, A. Sperduti et al., “On filter size in graph convolu- on Knowledge Ciscovery and Data Mining. ACM, 2009, pp. 817–
tional networks,” in 2018 IEEE Symposium Series on Computational 826.
Intelligence (SSCI). IEEE, 2018, pp. 1534–1541. [114] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie,
[91] S. Pan, J. Wu, and X. Zhu, “Cogboost: Boosting for fast cost- and M. Guo, “Graphgan: Graph representation learning with
sensitive graph classification,” IEEE Transactions on Knowledge & generative adversarial nets,” in Proceedings of the AAAI Conference
Data Engineering, no. 1, pp. 1–1, 2015. on Artificial Intelligence, 2017.
[92] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are [115] M. Zitnik and J. Leskovec, “Predicting multicellular function
graph neural networks,” arXiv preprint arXiv:1810.00826, 2018. through multi-layer tissue networks,” Bioinformatics, vol. 33,
[93] S. Verma and Z.-L. Zhang, “Graph capsule convolutional neural no. 14, pp. i190–i198, 2017.
networks,” arXiv preprint arXiv:1805.08090, 2018. [116] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descrip-
[94] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. tor spaces for chemical compound retrieval and classification,”
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375,
in Advances in Neural Information Processing Systems, 2017, pp. 2008.
5998–6008. [117] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J.
[95] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Shusterman, and C. Hansch, “Structure-activity relationship of
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- mutagenic aromatic and heteroaromatic nitro compounds. cor-
sarial nets,” in Advances in neural information processing systems, relation with molecular orbital energies and hydrophobicity,”
2014, pp. 2672–2680. Journal of medicinal chemistry, vol. 34, no. 2, pp. 786–797, 1991.
[96] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence [118] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures
learning with neural networks,” in Advances in Neural Information from non-enzymes without alignments,” Journal of molecular biol-
Processing Systems, 2014, pp. 3104–3112. ogy, vol. 330, no. 4, pp. 771–783, 2003.
[97] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex- [119] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilien-
tracting and composing robust features with denoising autoen- feld, “Quantum chemistry structures and properties of 134 kilo
coders,” in Proceedings of the international conference on Machine molecules,” Scientific data, vol. 1, p. 140022, 2014.
learning. ACM, 2008, pp. 1096–1103. [120] T. Joachims, “A probabilistic analysis of the rocchio algorithm
[98] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. with tfidf for text categorization.” Carnegie-mellon univ pitts-
Farias, and A. Aspuru-Guzik, “Objective-reinforced generative burgh pa dept of computer science, Tech. Rep., 1996.
adversarial networks (organ) for sequence generation models,” [121] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M.
arXiv preprint arXiv:1705.10843, 2017. Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its tech-
[99] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar nical challenges,” Communications of the ACM, vol. 57, no. 7, pp.
variational autoencoder,” arXiv preprint arXiv:1703.01925, 2017. 86–94, 2014.
[100] H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song, “Syntax-directed [122] B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan, and J. Riedl,
variational autoencoder for molecule generation,” in Proceedings “Movielens unplugged: experiences with an occasionally con-
of the International Conference on Learning Representations, 2018. nected recommender system,” in Proceedings of the international
conference on Intelligent user interfaces. ACM, 2003, pp. 263–266.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 22

[123] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, question answering,” in Advances in Neural Information Processing
and T. M. Mitchell, “Toward an architecture for never-ending Systems, 2018, pp. 2655–2666.
language learning.” in Proceedings of the AAAI Conference on [140] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “High-order graph
Artificial Intelligence, 2010, pp. 1306–1313. convolutional recurrent neural network: a deep learning frame-
[124] M. Zhang and Y. Chen, “Link prediction based on graph neural work for network-scale traffic learning and forecasting,” arXiv
networks,” in Advances in Neural Information Processing Systems, preprint arXiv:1802.07007, 2018.
2018. [141] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye,
[125] T. Kawamoto, M. Tsubaki, and T. Obuchi, “Mean-field theory and Z. Li, “Deep multi-view spatial-temporal network for taxi
of graph neural networks in graph partitioning,” in Advances in demand prediction,” in Proceedings of the AAAI Conference on
Neural Information Processing Systems, 2018, pp. 4362–4372. Artificial Intelligence, 2018, pp. 2588–2595.
[126] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation [142] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:
by iterative message passing,” in Proceedings of the IEEE Confer- Large-scale information network embedding,” in Proceedings of
ence on Computer Vision and Pattern Recognition, vol. 2, 2017. the International Conference on World Wide Web. International
[127] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn World Wide Web Conferences Steering Committee, 2015, pp.
for scene graph generation,” in European Conference on Computer 1067–1077.
Vision. Springer, 2018, pp. 690–706. [143] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein interface
[128] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Fac- prediction using graph convolutional networks,” in Advances in
torizable net: an efficient subgraph-based framework for scene Neural Information Processing Systems, 2017, pp. 6530–6539.
graph generation,” in European Conference on Computer Vision. [144] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec, “Graph
Springer, 2018, pp. 346–363. convolutional policy network for goal-directed molecular graph
[129] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from generation,” in Advances in Neural Information Processing Systems,
scene graphs,” arXiv preprint, 2018. 2018.
[130] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. [145] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to
Solomon, “Dynamic graph cnn for learning on point clouds,” represent programs with graphs,” in Proceedings of the Interna-
arXiv preprint arXiv:1801.07829, 2018. tional Conference on Learning Representations, 2017.
[131] L. Landrieu and M. Simonovsky, “Large-scale point cloud seman- [146] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “Deepinf:
tic segmentation with superpoint graphs,” in Proceedings of the Social influence prediction with deep learning,” in Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, 2018. the ACM SIGKDD International Conference on Knowledge Discovery
[132] G. Te, W. Hu, Z. Guo, and A. Zheng, “Rgcnn: Regular- & Data Mining. ACM, 2018, pp. 2110–2119.
ized graph cnn for point cloud segmentation,” arXiv preprint [147] D. Zügner, A. Akbarnejad, and S. Günnemann, “Adversarial
arXiv:1806.02952, 2018. attacks on neural networks for graph data,” in Proceedings of the
[133] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human- ACM SIGKDD International Conference on Knowledge Discovery and
object interactions by graph parsing neural networks,” in Proceed- Data Mining. ACM, 2018, pp. 2847–2856.
ings of the European Conference on Computer Vision (ECCV), 2018, [148] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “Gram:
pp. 401–417. graph-based attention model for healthcare representation learn-
[134] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph ing,” in Proceedings of the ACM SIGKDD International Conference on
neural networks,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 787–795.
Learning Representations, 2018. [149] E. Choi, C. Xiao, W. Stewart, and J. Sun, “Mime: Multilevel
[135] M. Guo, E. Chou, D.-A. Huang, S. Song, S. Yeung, and L. Fei- medical embedding of electronic health records for predictive
Fei, “Neural graph matching networks for fewshot 3d action healthcare,” in Advances in Neural Information Processing Systems,
recognition,” in European Conference on Computer Vision. Springer, 2018, pp. 4548–4558.
2018, pp. 673–689. [150] J. Kawahara, C. J. Brown, S. P. Miller, B. G. Booth, V. Chau, R. E.
[136] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural Grunau, J. G. Zwicker, and G. Hamarneh, “Brainnetcnn: convo-
networks for rgbd semantic segmentation,” in Proceedings of the lutional neural networks for brain networks; towards predicting
IEEE Conference on Computer Vision and Pattern Recognition, 2017, neurodevelopment,” NeuroImage, vol. 146, pp. 1038–1049, 2017.
pp. 5199–5208. [151] T. H. Nguyen and R. Grishman, “Graph convolutional networks
[137] L. Yi, H. Su, X. Guo, and L. J. Guibas, “Syncspeccnn: Synchro- with argument-aware pooling for event detection,” in Proceedings
nized spectral cnn for 3d shape segmentation.” in Proceedings of of the AAAI Conference on Artificial Intelligence, 2018, pp. 5900–
the IEEE Conference on Computer Vision and Pattern Recognition, 5907.
2017, pp. 6584–6592. [152] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization
[138] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual reason- with graph convolutional networks and guided tree search,” in
ing beyond convolutions,” in Proceedings of the IEEE Conference on Advances in Neural Information Processing Systems, 2018, pp. 536–
Computer Vision and Pattern Recognition, 2018. 545.
[139] M. Narasimhan, S. Lazebnik, and A. Schwing, “Out of the [153] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
box: Reasoning with graph convolution nets for factual visual for image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.

You might also like