Automated Unsupervised Graph Representation Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

3, MARCH 2023 2285

Automated Unsupervised Graph


Representation Learning
Zhenyu Hou , Yukuo Cen , Yuxiao Dong , Jie Zhang, and Jie Tang , Fellow, IEEE

Abstract—Graph data mining has largely benefited from the recent developments of graph representation learning. Most attempts to improve
graph representations have thus far focused on designing new network embedding or graph neural network (GNN) architectures. Inspired by
the SGC and ProNE models, we instead focus on enhancing any existing or learned graph representations by further smoothing them via
graph filters. In this paper, we introduce an automated framework AutoProNE to achieve this. Specifically, AutoProNE automatically searches
for a unique optimal set of graph filters for any input dataset, and its existing representations are then smoothed via the selected filters. To make
AutoProNE more general, we adopt self-supervised loss functions to guide the optimization of the automated search process. Extensive
experiments on eight commonly used datasets demonstrate that the AutoProNE framework can consistently improve the expressive power of
graph representations learned by existing network embedding and GNN methods by up to 44%. AutoProNE is also implemented in CogDL, an
open source graph learning library, to help boost more algorithms.

Index Terms—Representation learning, graph embedding, graph filter

1 INTRODUCTION and node2vec [6] as well as matrix factorization based meth-


ods such as HOPE [7], and NetMF [8].
RAPHS have been commonly used for modeling struc-
G tured and relational data, such as social networks,
knowledge graphs, and biological networks. Over the past
Graph neural networks (GNNs) on the other hand usually
take both node/edge features and graph structures as the
input and iteratively aggregate neighbors’ features to update
few years, mining and learning with graph data have been
each node’s representation. Most early GNN models are end-
shifted from structural feature engineering to graph repre-
to-end (semi) supervised frameworks with the label informa-
sentation learning, which offers promising results in various
tion for optimization, such as the graph convolutional network
applications, including node classification [1], recom-
(GCN) [1] and graph attention network (GAT) [9]. Recently,
mender system [2], and machine cognitive reasoning [3].
self-supervised graph neural networks gain significant atten-
Broadly, research on graph representation learning can be
tion due to their enormous potential in shaping the future of
grouped into two categories: unsupervised network embed-
graph mining and learning. For example, the GraphSage
ding methods and graph neural networks (GNNs). The net-
model [10] with the unsupervised loss can be considered as a
work embedding methods aim to project the structure of a
self-supervised framework. DGI [11] and GCC [12] leverage
network into a latent low-dimensional space such as its struc-
the idea of contrastive learning [13], [14] to pre-train graph
tural properties are encoded in the latent vectors. Usually, the
neural networks via self-supervised signals from the unlabeled
input to them contains only the graph topology without input
input data.
features. Representative network embedding models include
Among the exciting progress in graph representation
skip-gram based methods such as DeepWalk [4], LINE [5],
learning, two recent developments are particularly attrac-
tive. First, Wu et al. [15] discover that the non-linearity
between GCN layers can be simplified, based on which the
SGC model without non-linearity is proposed. By removing
 Zhenyu Hou, Yukuo Cen, and Jie Zhang are with the Department of
Computer Science and Technology, Tsinghua University, Beijing 100084,
the non-linearity, it is easy to decouple the feature transfor-
China. E-mail: {houzy21, cyk20, j-z16}@mails.tsinghua.edu.cn. mation and propagation steps in GCNs’ neighborhood
 Yuxiao Dong is with the Facebook AI, Seattle, WA 98109 USA. aggregation process, enabling us to design these two steps
E-mail: [email protected]. separately. The second one is the ProNE model, in which
 Jie Tang is with the Tsinghua National Laboratory for Information Science
and Technology (TNList), Department of Computer Science and Technology, Zhang et al. [16] show that using a modulated Gaussian fil-
Tsinghua University, Beijing 100084, China. E-mail: [email protected]. ter to propagate/smooth the node embeddings can signifi-
cn. cantly improve the expressive power of the embeddings.
Manuscript received 17 January 2021; revised 23 August 2021; accepted 9 Sep- Inspired by these two works, we propose to study
tember 2021. Date of publication 24 September 2021; date of current version 3 whether we can improve graph representation learning by
February 2023.
The work was supported in part by the National Key R&D Program of China focusing on the propagation step, that is, given the input
under Grant 2018YFB1402600, in part by the NSFC for Distinguished Young embeddings such as learned by DeepWalk or DGI, the goal
Scholar under Grant 61825602, and in part by the Tsinghua University Initiative is to further enhance their expressive power by propagat-
Scientific Research Program under Grant 20181080300. ing/smoothing the embeddings. First, instead of relying on
(Corresponding author: Jie Tang.)
Recommended for acceptance by P. Tsaparas. label information, we are interested in techniques that can
Digital Object Identifier no. 10.1109/TKDE.2021.3115017 benefit graph representation learning in an unsupervised or
1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2286 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

Fig. 1. The overview of the AutoProNE framework. The input is the node embeddings learned by any other network embedding or GNN methods and
the output is the smoothed embeddings with enhanced expressive power. AutoProNE includes two components: (1) Parameter selection: the AutoML
parameter generator automatically select graph filters and corresponding hyperparameters. (2) Embedding propagation: once the graph filters are
selected, they are used to smooth the input embeddings. In the example, the PPR and Gaussian graph filters as well as their parameters are selected
for the specific input graph. And as a result, each node pays attention to its indirect neighbors.

self-supervised manner. Second, rather than a fixed Gaussian and CogDL1 [17] , a open source graph learning library, to
filter, we would like to answer the question of whether there make it more convenient to collaborate with existing graph
exist better graph filters for propagating existing embeddings. representation algorithms.
Finally, the natural need is to avoid the manual design or Contribution. The main contributions of this work
choice of graph filters for each dataset, that is, can we automati- include:
cally find the best filters for each specific graph?
Solutions. To address these issues, we present the  We investigate the role of graph filters on unsuper-
AutoProNE framework to automatically find the appro- vised graph representation learning and provide
priate graph filters to smooth existing graph representa- insights into various graph filters.
tions in a self-supervised manner. Different from ProNE  We propose a comprehensive framework Auto-
that uses a fixed Gaussian kernel for all graphs, the pro- ProNE that integrates automatic machine learning
posed framework leverages the idea of AutoML to and graph representation learning.
search for the best filter(s) for the given graph, such as  We conduct extensive experiments to evaluate our
low-pass, band-pass, and signal-rescaling filters. More- framework. The results demonstrate the effective-
over, rather than using supervised signals to guide the ness of AutoProNE where our framework achieves
AutoML process, the optimization of AutoProNE is built significant improvements over various methods and
upon the recent advancements in self-supervised graph datasets.
representation learning, that is, AutoProNE adopts self-
supervised loss to direct the AutoML optimization.
2 PRELIMINARIES
Fig. 1 illustrates the framework of AutoProNE.
We conduct extensive experiments on eight publicly 2.1 Concepts
available datasets. By using both network embedding meth- Graph Notations. Let G = ðV; E; XÞ denote an undirected graph
ods without input features (e.g., DeepWalk) and graph neu- with V as the set of its n nodes and E as its edges, and X 2
ral networks with node features (e.g., DGI), we show that Rnd as the feature matrix of V. Given G, we have its (sym-
the AutoProNE framework can consistently and signifi- metric) adjacency matrix as A ¼ ðaij Þ with aij = 1 if and only
cantly enhance the expressive power of graph representa- if there exists an edge between vi and vj , as well P as its
tions learned by these base methods. For example, the degree matrix D ¼ diagðd1 ; d2 ; . . . ; dn Þ with di ¼ j aij . In
simple, automated, and unsupervised AutoProNE can boost practice, the row-normalized adjacency matrix A ^ rw ¼ D1 A
the node classification performance of LINE’s representa- and symmetric normalized adjacency matrix A ^ sym ¼
1=2 1=2
tions by 34–44% on the PPI dataset. Furthermore, we show D AD are more commonly used. In this work, we
that the graph filters automatically decided by the AutoML simplify A^ rw and A^ sym as A.
^
process of AutoProNE are always among the best options Network Embedding. The problem of network embed-
across all different datasets. Finally, we find that Auto- ding aims to learn a mapping function f : V ! Rd that
ProNE requires only 2–4% of the running time of DeepWalk projects each node to a d-dimensional space (d  jVj).
to offer outperformance by up to 8%, and also demonstrate Network embedding methods mainly focus on neighbor-
that its scalability is linear to the number of nodes in the hood similarity and capture the structural properties of
graph, enabling it for handling large-scale data. The code is
available at https://github.com/THINK2TRY/AutoProNE. 1. https://github.com/THUDM/cogdl

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2287

P
the network. Extensive studies have shown that the where D ~ ii ¼ ~
j Aij . Furthermore, multiple layers are
learned node representations can benefit various graph stacked to recover a rich class of convolutional filter func-
mining tasks, such as node classification and link predic- tions and each layer is followed by a linear transformation
tion. DeepWalk, LINE, Node2Vec and NetMF are all net- and a point-wise non-linearity. Then the one-layer graph
work embedding methods. convolution becomes as follows:
Graph Signal Processing. In this part, we introduce a
recent formulation [18] of graph signal processing [19], ~ ðiÞ QðiÞ Þ;
Hðiþ1Þ ¼ sðAH (5)
[20] on irregular graphs. The Laplacian matrix of graph
G is defined as L ¼ D  A. L ^ ¼ In  A
^ is the augmented where QðiÞ is a layer-specific trainable matrix, HðiÞ 2 Rnd is
normalized Laplacian. A vector x 2 Rn defined on the the d-dimensional hidden node representation in the ith
nodes of G is called a graph signal. A useful property of layer.
graph Laplacian L is that its quadratic form measures
the smoothness of the graph signal. It is easy to verify 2.2 ProNE
that In this part, we give a brief introduction to ProNE. ProNE is
X X a fast network embedding method based on matrix factori-

x T Lx Aij ðxi  xj Þ2 ¼ ðxi  xj Þ2 : (1)
zation. First, it formulates network embedding as a sparse
i;j ði;jÞ2E matrix factorization for efficient representation transforma-
tion. Second, it utilizes a Gaussian graph filter for represen-
Let U 2 Rnn ¼ ½u u1 ; . . . ; u n T be the eigenvectors of L
^ and tation smoothing to improve the representation.
we have L ¼ ULU , where L ¼ diag½1 ; . . . ; n  is the eigen-
^ T
Network Embedding as Sparse Matrix Factorization. ProNE
values of L ^ corresponding to U. The graph Fourier transform leverages an edge to represent a node-context pair. Let
F : Rn ! Rn is defined by F x ¼ x ~ ¼ UT x , and the inverse ri ; ci 2 Rd be the node embedding and context vectors of
1
graph Fourier transform F is F 1 x
~ ¼ x ¼ U~x because node vi respectively. The concurrence of context vj given vi is
1
F F ¼ U U ¼ In .
T

Graph filter is defined based on the graph Fourier p^i;j ¼ sðrTi cj Þ; (6)
g
transform. Let g: R ! R be a function mapping x ! y.
In frequency domain, the graph filter specified by g is sðDÞ is the sigmoid function. To maximize the concurrence
defined by the relation: y~ ¼ gðLÞ~ x , that is, y~ði Þ ¼ probability and avoid trivial solution (ri ¼ ci &^ pi;j ¼ 1) for
gði Þ~
x~ði Þ. In the spatial domain, the above equation is each pair, the objective can be expressed as the weighted
equivalent to y ¼ gðLÞx ^ x. sum of log loss over all edges accompanied with negative
For multi-dimensional cases, the graph Fourier transform sampling
~ ¼ UT X and the inverse form is F 1 X X
is F X ¼ X ~ ¼ X ¼ UX. ~
Loss ¼  ½pi;j ln sðrTi cj Þ þ tPE;j ln sðsðrTi cj ÞÞ;
Then the graph filter is represented as ði;jÞ2E

(7)
Y ¼ gðLÞX
^ ¼ UgðLÞUT X: (2)
where pi;j ¼ Ai;j= Di;i indicates the weight
P of pair (vi ; vj ), t is
Generally, for computational efficiency, Taylor or Cheby- the ratio negative sampling, PE;j / ð i:ði;jÞ2E pði; jÞÞa with
shev approximation is applied to avoid the eigendecompo- a ¼ 1 or 0.75. To minimize the objective equals to let the the
^ Without loss of generality, let T 2 Rnn be the
sition of L. partial derivative with respect to rTi cj be zero. And hence
transition matrix and ui be the weight coefficients. The k- we get
order expansion is represented as
rTi cj ¼ ln pi;j  lnðtPE;j Þ: (8)
X
k
^ 
gðLÞ ui Ti 2 Rnn : (3) ProNE proposes to define a proximity matrix M with
i¼0 each entry as rTi cj , which represents the similarity between
embedding of vi and context embedding of vj
Graph Convolution. In spectral graph convolution net-
works, the graph convolution operation is fast approxi- 
ln pi;j  lnðtPE;j Þ; ðvi ; vj Þ 2 E
mated with the layer-wise propagation rule [1]. GCN M¼ :
simplifies Eq. (3) by only keeping the first two items with 0; ðvi ; vj Þ 2
=E
u0 ¼ u1
Now that the relationship matrix is built, the objective is
^ ¼ u0 In þ u1 A
gðLÞ ^ transformed into matrix factorization. ProNE uses a fast
(4)
randomized tSVD to achieve fast sparse matrix factorization
¼ uðIn þ AÞ:
^ ^ 2 RNd as node embedding matrix.
and finally gets X
In is an identity matrix. The eigenvalues of I þ A
^ is in the
Spectral Propagation. To address the representation
range [0, 2]. To alleviate the problem of numerical instability smoothing problem, ProNE proposes to leverage a Gaussian
and exploding/vanishing gradients when used in deep ^ ProNE designs
graph filter as the smoothing function fðAÞ.
neural networks, the following renormalization trick is intro- 1 2
the graph filter as gðÞ ¼ e½2ðmÞ 1u and formulates the
duced transformation as the following rule:

In þ A ~ 1=2 ðI þ AÞD
^ !D ~ 1=2 ¼ D
~ 1=2 A
~D~ 1=2 ; ^ ¼ D1 AðI
fðAÞ ^ n  LÞ;
~ (9)

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2288 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

~ is defined as the Lapla-


where In is the identity matrix and L 3.2 The AutoProNE Framework
cian filter as Inspired by ProNE, we propose that graph signal process-
ing can be a general and powerful method to address the
~ ¼ Udiagð½gð1 Þ; . . . ; gðn ÞÞUT :
L (10) representation smoothing problem. Graph filters, such as
the low-pass filter and band-pass filter in the frequency
Eq. (9) can be viewed as two steps. It first modulates the domain, or adjusting the structure importance of nodes in
spectral property with modulator In  L
~ and then re-propa- the spatial domain can be used to design the smoothing
gates information between neighbors. ^
function fðAÞ.
ProNE uses a fixed Gaussian filter to tackle all cases.
3 AUTOMATED UNSUPERVISED GRAPH However, intuitively, each dataset may require unique
REPRESENTATION LEARNING graph filters to help “pass” true features and “filter out”
noises. However, the input dataset is often considered as a
3.1 Problem
black box and thus it is computationally expensive to ana-
One GCN layer consists of three steps: neighborhood feature
lyze its spatial and spectral features. Therefore, it is infeasi-
aggregation, feature transformation,and nonlinear transi-
ble to manually select or design appropriate graph filters
tion. By stacking multiple layers, GCNs enable the propaga-
for each given graph dataset.
tion to reach high-order neighbors. Recently, Wu et al. [15]
In light of these challenges, we propose the automated
suggest that the non-linearity between GCN layers is not crit-
AutoProNE graph representation learning framework to
ical and thus propose the simplified graph convolution net-
automatically select graph filters in an unsupervised man-
work (SGC) by removing non-linearity and having only one
ner for different input graphs. Fig. 1 shows the overall archi-
step of feature propagation and transformation, i.e.,
tecture of AutoProNE, which first utilizes the idea of
AutoML to select suitable graph filters for modulating
^ K XQ1    QK ¼ A
HSGC ¼ A ^ K XQ: (11)
graph signals (parameter selection) and then smooth node
representations based on the selected filters (propagation).
Inspired by SGC, we can further abstract graph convolu- In parameter selection, the parameter controller automati-
tion as cally selects graph filters from the designated graph filter
set together with their hyper-parameters (Cf next section for
H ¼ AXQ
^ ¼ fðAÞhðX;
^ QÞ; (12) details). We employ AutoML to implement this process.
Briefly, AutoML is designed to automatically build machine
^ and hðX; QÞ are two independent steps for fea-
where fðAÞ learning applications [21]. It involves two parts—model
ture representation learning. To make Eq. (12) generalize to generation and model evaluation.
unsupervised graph embedding methods, such as Deep- First, model generation can be divided into search space
Walk and node2vec, we have hðÞ as hðA; X; QÞ. In these and optimization methods, which are usually classified into
methods, the input feature matrix X is empty and the node hyperparameter optimization (HPO) and architecture opti-
representations are learned only based on the graph topol- mization (AO). In AutoProNE, we mainly focus on AO,
ogy A. Therefore we can have which indicates the model-related parameters, e.g., the
selection of graph filters.
H ¼ fðAÞhðA;
^ X; QÞ: (13) Second, for model evaluation, indicators like accuracy on
the validation set are often used as measures. However, our
In other words, graph representation learning, including problem is formalized under the unsupervised setting, that
both GNNs and unsupervised methods, can be simplified is, there exist no supervised indicators to help select the
and abstracted as two independent processes: representa- model parameters. To address this issue, we leverage the
tion transformation hðA; X; QÞ and propagation (or smooth- idea of contrastive learning to maximize the mutual infor-
^
ing) fðAÞ. mation of different embeddings to evaluate the model.
^ ¼ hðA; X; QÞ, the node
In representation transformation X In propagation, the node representations X^ are propagated
^
representations X are either learned solely from graph struc- with selected filters. Under the hypothesis that each graph
tures without input features X (such as by DeepWalk or filter would differently amplify true features and attenuate
NetMF) or transformed with X by learnable parameters noises, we concatenate the results of each filter together to
(such as MLP or GNNs). preserve the information instead of averaging them, if mul-
^ step, the representa-
In the representation smoothing fðAÞ tiple filters are selected. In order to keep the embedding
tions initialized in the previous step are propagated to dimension the same as the input, it is then followed by the
(high-order) neighbors. Note that in traditional unsuper- SVD operation on the concatenated representations.
vised methods, this step is ignored. In SGC, the features are Finally, the loss is collected to help optimize the search.
smoothed via the high order propagation, that is, A^ K. In AutoProNE, we directly utilize Optuna [22] as the
Problem. The focus of this work is on the smoothing step AutoML framework. As the designed search space is small,
of graph representation learning. Specifically, given the the algorithm will converge quickly. Algorithm 1 illustrates
node representations of one graph dataset learned by exist- the process of AutoProNE described above.
ing methods, such as DeepWalk, the goal is to automatically Note that AutoNE [23] also combines network embed-
design a smoothing function fðAÞ^ for this graph to effectively ding (NE) with AutoML and attempts to decide the hyper-
propagate them over its specific structure in an unsupervised parameters of a given NE algorithm using meta-learning.
manner. Differently, AutoProNE aims to obtain the optimal graph

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2289

TABLE 1 particular temperature. When applied to graphs, Heat ker-


Graph Filters nel is defined as fh ðÞ ¼ eu , in which u is a scaling hyper-
parameter. We denote fh ðLÞ as fh ðLÞ ¼ diagðeu1 ; . . . ;
Name Graph Filter Parameters
eun Þ. In fact, heat kernel works as a low-pass filter in the
PPR H ¼ Ap X^ a frequency domain and the heat graph filter is defined based
Heat Kernel H ¼ Lh X
^ u on the Laplacian matrix as
Gaussian Kernel ^
H ¼ Lg X m, u
Signal Rescaling H ¼ As X^ -
Lh ¼ Ufh ðLÞUT ¼ Udiagðeu1 ; . . . ; eun ÞUT : (15)

filters for further improving the embeddings learned by any


Gaussian Kernel [28]. Gaussian filter is a widely used
existing NE algorithms, instead of searching for hyperpara-
band-pass filter in signal processing. Gaussian kernel in
meters to have better NE algorithms. In other words, Auto- 1 2
graph is defined as fðÞ ¼ e½2ðmÞ 1u with m and u as the
ProNE runs on the embedding results and is not related to
scaling hyperparameters. Gaussian kernel works as a band-
the learning process of the NE algorithms.
pass filter with different hyperparameters. We have fg ðLÞ ¼
diagðeui ; . . .; eui Þ if we denote i ¼ 12 ði  mÞ2  1. Then
 

Algorithm 1. The AutoProNE Algorithm the Gaussian graph filter is similar to the heat graph filter
^ Input embeddings X;
Input: Normalized adj acency matrix A; ^
Lg ¼ Ufg ðLÞUT ¼ Udiagðeu1 ; . . . ; eun ÞUT :
 
Search iterations N. (16)
Output: Enhanced embeddings H with the same dimension
^
size as X. Signal Rescaling. In addition to the three existing filters,
1: for i ¼ 1 to N do we also propose a signal rescaling filter. The intuition is that
2: Select filters GFs from fPPR; Heat; Gaussian; SRg
structural information plays a central role in networks. An
3: Let Z be an empty list: Z ¼ ½
intuitive example is that nodes of a high degree may be
4: for each gf 2 GFs do
^ gfÞ more important and influential in a network. To capture
5: Hgf ¼ PropagationðA;
^ X;
6: Append Hgf into Z
this phenomenon, we can attenuate node signals (and per-
7: end for form renormalization) before propagation, that is
8: Concatenate smoothed embeddings in Z to have Zr 1
9: Conduct dimension reduction on Zr to get H As ¼ NormalizeðAsðD
^ ÞÞ; (17)
10: Calculate the corresponding loss and optimize
where sðxÞ ¼ 1þe1x .
parameters.
Chebyshev Expansion for Efficiency. For both the heat kernel
11: end for
and Gaussian kernel, we utilize the Chebyshev expansion
and Bessel function [29] to avoid explicit eigendecomposi-
tion. Chebyshev polynomials of the first kind are defined as
3.3 Automated Graph Filter Selection
Tiþ1 ðxÞ ¼ 2xTi ðxÞ  Ti1 ðxÞ with T0 ðxÞ ¼ 1; T1 ðxÞ ¼ x. Then,
We introduce the design choices of the core components in the expression of a general graph filter can be expanded as
AutoProNE: Search Space Design and Unsupervised Loss
Function. X
k X
k
U
L ~ T ¼
ci ðuÞTi ðLÞU ci ðuÞTi ðLÞ;
~ (18)
3.3.1 Search Space Design i¼0 i¼0

In graph signal processing, different graph filters have been where specifically L ~ ¼ L, L~¼L ^ for heat kernel and L~¼
designed for modulating graph signals. We adopt three 2 ðL
1
 mIÞ2 , L ^  mIÞ2 for Gaussian kernel. The coeffi-
~ ¼ 1 ðL
2
existing simple and efficient graph filters—PPR, Heat kernel, cient ci ðuÞ can be obtained with the Bessel function
and Gaussian kernel—that have been widely used in graph
Z
representation learning [16], [24], [25] and also propose a b Ti ðxÞexu
ci ðuÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi dx ¼ bðÞi Bi ðuÞ; (19)
new filter based on signal rescaling. The four graph filters p 1  x2
are summarized in Table 1.
PPR [26]. Personalized PageRank(PPR) is defined based where b = 1 if i = 0 otherwise b=2 and Bi ðuÞ is the modified
on random walk with restart and reflects the probability of Bessel function of the first kind [29]. Then the truncated
a random walk starting from a node to another. PPR is expansion of the filter becomes as follows:
defined as pðx xÞ ¼ ð1  aÞApðx
^ xÞ þ ax x, and its closed-form
matrix is Ap ¼ aðIn  ð1  aÞAÞ.
^ To avoid the high computa- X
k1
L  B0 ðuÞT0 ðLÞ
~ þ2 Bi ðuÞTi ðLÞ:
~ (20)
tional cost of matrix inversion, we use Taylor expansion to i¼1
approximate Ap , that is
This truncated Chebyshev expansion provides a good
X
k approximation for eu with a fast convergence rate.
Ap  uðiÞ
p A;
^ where uðiÞ
p ¼ að1  aÞ :
i
(14) Taylor expansion and Chebyshev expansion are common
i¼0
approximation methods. [30] adopts another local spectral
Heat Kernel [27]. The heat kernel is often used in heat con- embedding method which utilizes random vectors and
dition and diffusion, and represents the evolution of tem- Gauss-Seidel iteration to avoid explicit eigendecomposition
perature in a region whose boundary is held fixed at a of the adjacency matrix.

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2290 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

3.3.2 Unsupervised Loss Function discrimination as


Different from most AutoML settings where supervised X
N
hTi x
expðh ^ i =tÞ
information is used, we aim to guide the AutoML process Qf ¼  log PK ; (22)
i¼1 j¼0 expðhhTi x^ j =tÞ
for graph filter selection in an unsupervised or self-super-
vised fashion. The goal is to derive supervised signals from where t is a temperature hyperparameter.
the input graph data itself for “supervising” the selection of
graph filters. Algorithm 2. Propagation
An additional principle is that the self-supervised loss Input: Normalized ajacency matrix A; ^
^ Initial embedding X;
function Qf of AutoML in the smoothing phase should be Type of graph filter gf;
different from the one utilized in getting the representations Output: Enhanced embedding H
in the transformation phase. The reason lies in that the rep- 1: Order of Expansion k = 5
resentations to be smoothed have already achieved the best 2: Laplacian matrix L ^ ¼IA ^X ^ ð0Þ ¼ X
^
loss in the transformation step, and a different Qf can help 3: if gf 2 [Heat kernel, Gaussian] then
further improve the node representations. 4: Xð1Þ ¼ LX
In AutoProNE, our strategy for automatic graph filter 5: H ¼ Bð0ÞX ^ ð0Þ  2Bð1ÞX ^ ð1Þ
search is built upon the recent development of self-super- 6: for i ¼ 2 to k do
vised learning techniques, specifically contrastive learning 7: XðiÞ ¼ 2L^X^ ði1Þ  X ^ ði2Þ
based methods [11], [12], [31]. Note that AutoProNE is gen- 8: H ¼ H þ ð1Þ BðiÞX i ^ ðiÞ
erally designed for improving node representations learned 9: end for
by both network embedding methods (e.g., DeepWalk) and 10: else if gf is PPR then
GNNs. To accommodate their unique properties, we 11: H ¼ aXð0Þ
employ the mutual information maximization and instance 12: for i ¼ 1 to k do
13: ^ ðiÞ ¼ ð1  aÞA
X ^X^ ði1Þ
discrimination as the loss function for both sets of methods, ðiÞ
14: H¼HþX ^
respectively.
15: end for
For unsupervised network embedding methods, such
16: else
as skip-gram or matrix factorization based methods, we 1
17: H ¼ NormalizeðAsðD ^ ÞÞX
use the local-global mutual information maximization
18: end if
loss proposed in [11] to guide the search of graph filters 19: return H
for each graph. Let Z ~ be the row-wise shuffling of X. ^ X^
~
and Z are both propagated with selected graph filters,
with the results as H ¼ PropðXÞ ^ and H ~ ¼ PropðZÞ
^ respec-
~
tively. As H is derived from the propagation based on 3.3.3 Automatic Searching
shuffled features, it is “corrupted” and should be less Now that search space and unsupervised loss functions
similar to initial embeddings X. ^ We leverage the mean- have been defined, we employ automatic machine learning
readout function to obtain a graph level representation: to search for the best combination of graph filters and
P
s ¼ N1 h 2H h . In such case, H is considered as “positive” hyperparameters. As the loss functions are often non-con-
samples for s , H~ as “negative” samples for H ~ is derived vex and it is difficult to get the derivative and gradients,
from shuffled features. The goal is to maximize the hyperparameter optimization in machine learning is gener-
mutual information between node features H and global ally considered as a problem of black-box optimization.
features s in the mean time minimizing the mutual infor- Bayesian optimization [32] is a common and powerful
mation between s and negative node features H. ^ Then method to tackle such cases. The basic idea of bayesian opti-
the loss is formalized as follows: mization is to use the bayesian rule to estimate the posterior
distribution of the objective function based on dataset, and
 N
1 X then select the next sampling hyperparameter combination
Qf ¼  E ^ ^ ½log sðh
hTi sÞ according to the distribution. It aims to find the parameters
2N i¼1 ðX;AÞ
that achieve the most improvement by optimizing the objec-
X
N 
tive function, or loss function. In the implementation, we
þ EðZ; ~T
^ ½log sð1  h j s Þ :
~ AÞ (21) employ Optuna with bayesian optimization as the backend
j¼1
AutoML framework for automatic searching. As shown in
For graph convolution based methods, InfoNCE pro- Algorithm 1, AutoProNE explores better filters and parame-
posed in [14] is utilized as the optimization target. From the ters based on previous attempts and losses.
perspective of instance discrimination, the contrastive loss
is a function whose value is low when a query q is similar to 4 ANALYSIS AND DISCUSSIONS
its positive key kþ and dissimilar to all other keys (negative
keys). In most cases, contrastive learning is often used in We provide analyses of the impact of graph filters on the
conjunction with data augmentation, but data augmentation expressiveness of graph representations.
in graph is still a open problem. In this paper, we simply
take input features and the smoothed result as positive 4.1 Insight Into Graph Filters
pairs. For each node hi 2 H, x ^ is the only positive key
^i 2 X 4.1.1 PPR as a Low-Pass Filter
and all the other x^ j 2 X;
^ i 6¼ j are negative keys. Therefore, PPR is defined in the spatial domain. Intuitively, through
we can have the InfoNCE loss with respect to instance random walk, a node can capture the information of

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2291

both direct and distant neighbors. The probability distri-


bution of random walk with restart converges to a sta-
tionary distribution

Ap ¼ aðIn  ð1  aÞAD1 Þ1


^ 1
¼ aðaIn  ð1  aÞLÞ
a
¼ Udiagð ÞUT : (23)
ð1  aÞi þ a
In the frequency domain, PPR kernel can be written as
a
fp ðÞ ¼ ð1aÞþa and is a low-pass filter. This equals to our
intuition that random walk usually assigns higher weights
to low-order neighbors.

4.1.2 Spectral Properties of Graph Filters


Generally, for the convenience of writing and without loss
of generality, we denote a general graph filter matrix with
L, where Ap ; Lh and Lq are special cases of L. Let i be the
ith eigenvalue of L, adjacency matrix A ¼ In  L ¼ In
Udiagð1 ; . . . ; n ÞUT .
GCN in fact works like a low-pass filter from the spectral
domain. The propagation of node features by multiplying
the (augmented) adjacency matrix A ^ corresponds to apply-
ing graph filter gðÞ ¼ 1  . The eigenvalues of A lie on the
interval [0, 2] and it has been proved in [15] that max , the
largest eigenvalue of A and ^max , the largest eigenvalue of
^ satisfies
A,
0 < ^max < max 2: (24)

Thus gðÞ ¼ 1   resembles a band-stop filter. The


graph filter of GCN can be described as gði Þ ¼ ð1  i ÞK ,
where K ¼ 1 in one GCN layer and can be any positive inte-
ger in Simplified GCN. The spectrum of A ^ K actually works
as a low-pass-type filter for K > 1 [33].
PPR and Heat kernel work as a low-pass filter. a and u
adjust the pass rate. Gaussian kernel can be generally con-
sidered as a band-pass filter. As m determines the only
peak, by adjusting m and u, we can implement different
types of filter that passes frequencies within a certain range:
m 0 as low-pass; m ^max as high-pass; 0 < m < ^max as
band-pass [28].
In the spectral domain, the graph filter extracts features
in a certain band of the network. Generally, the true features
are concentrated in a certain frequency band and the noises
conform to uniform or Gaussian distribution. Thus graph
filter helps preserve the intrinsic information and reduce
Fig. 2. Visualization of adjacency matrix A, row normalized adjacency
the noise. In the spatial domain, the graph filters aggregate matrix with self-loops A^ rw , PPR matrix AP , Gaussian filter matrix Lg ,
the information of both direct and distant neighborhoods Heat kernel filter matrix Lh and signal rescaling matrix Ar on dataset
into the node embedding. Therefore, the features are prop- Karate Clubs. The darker the color means the higher the weight of the
erly smoothed and the learned embedding can be more node. The weight is between [0, 1].
expressive.
its neighbors of a high degree, but neighbors with a lower
4.1.3 Spatial Property of SR degree. As a result, attenuating node signals of a high degree
The signal rescaling function is inspired by the fact that helps improve embedding performance in some structure
many graph convolutional networks are actually performing information dominated networks.
signal rescaling. In many networks, there exist nodes of very
high degrees. For example, the average degree of nodes in
BlogCatalog is 32, but the maximum degree is 1,996; DBLP’s 4.1.4 Visualization of Graph Filters
average node degree is only 2.5, but the maximum degree is Fig. 2 exhibits the visualization results of the functions of
also up to 90. The class of most nodes is not determined by different filters on Karate Clubs dataset. Compared to

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2292 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

merely row normalization, as we can see from the figures, the graph and benefiting node classification. In practical
PPR, HeatKernel and Gaussian kernel enlarge the receptive applications, different types of graph filters can be selected
field and assign different weights to neighbors and also pay to fit the characteristics of graphs.
attention to nonneighbor nodes.They do show different
forms when capturing the characteristics of higher-order
4.3 Complexity
neighbors. In addition, the three filters assign high weights
The computation of graph filters in Table 1 can be efficiently
to the node itself even without adding self-loops. Signal
executed in a recurrent manner. Let n ¼ jVj; m ¼ jEj. With
rescaling adjusts weights according to the node degrees of
Taylor and Chebyshev expansion, we only apply repeated
neighbors.
Sparse Matrix-Matrix multiplication (SPMM) between a
Fig. 2 shows the impact of the hyperparameters on filters.
sparse n  n matrix and a dense n  d feature matrix with
And it can be observed that PPR, HeatKernel and Gaussian
time complexity OðmdÞ and avoid explicit eigendecomposi-
Kernel are sensitive to hyperparameters, which mainly
tion with complexity Oðn3 Þ. In addition, Intel Math Kernel
determine how a node aggregates the features of its neigh-
Library (MKL) provides efficient SPMM operators which
bors. For PPR, the bigger restart probability a, the more
can handle graph of millions or even billions of nodes [36].
attention a node pays to neighbor nodes. Gaussian and heat
This makes it possible for AutoProNE to handle large-scale
kernel are more complex in the spatial domain.
graph data.
For Taylor expansion, we denote XðiÞ ¼ ui X ^ ðiÞ ; X
^ ðiÞ ¼
4.2 Connection With Graph Partition ^X^ ði1Þ ð0Þ
^ ¼ X. ^ For Chebyshev expansion, we denote
A with X
Graph partition aims to reduce a graph into a set of smaller ^ ðiÞ ; X
Xðiþ1Þ ¼ ci ðuÞX ^ ðiÞ ¼ 2L
^X^ ði1Þ  X ^ ð0Þ ¼ X.
^ ði2Þ with X ^ As L ^
graphs by partitioning its set of nodes into mutually exclu- ^
and A are both sparse matrix, the complexity of Propagation
sive groups. In many cases, graphs have good locality and a
is OðkdmÞ. The dimension reduction (SVD) on small matrix
node often tends to have similar features and be in the same
is Oðnd2 Þ. And the complexity of computing Infomax and
class with its neighbors. The assumption is also the basis of
InfoNCE loss is also Oðnd2 Þ as it only involves the element-
many algorithms like DeepWalk, LINE and GCN. Therefore,
wise multiplication of matrix. All together, the complexity
the locality and connectivity of a graph are highly related to
of AutoProNE is Oðnd2 þ kdmÞ.
the effectiveness of the Propagation in our model. Luckily,
high-order Cheeger’s inequality[34], [35] suggests that
eigenvalues of Laplacian matrix are closely associated with 5 EXPERIMENTS
a graph’s locality and connectivity. We conduct extensive experiments to evaluate the effective-
The effect of graph partition can be measured by graph ness and efficiency of the proposed AutoProNE framework.
conductance(a.k.a. Cheeger constant).
P For a node subset S Specifically, we examine to what extent AutoProNE can
V, the volume of S is volðSÞ ¼ x2S dðxÞ, dðxÞ is the degree improve both shallow network embedding methods (with-
of node x. The edge boundary of S is eðSÞ ¼ fðx; yÞ 2 Ejx 2 out input feature matrix) and graph neural networks (with
S; y 2
= Sg. Then the conductance of S is defined to be input feature matrix).
jeðSÞj
FðSÞ ¼  :
minðvolðSÞ; volðSÞÞ 5.1 Experiment Setup
To measure the conductance of the whole graph G when 5.1.1 Datasets
partitioned into k parts, let S1 ; . . . ; Sk V and Si \ Sj ¼ ; if We use eight datasets that are commonly used for evaluat-
i 6¼ j. The k-way Cheeger constant is defined as ing graph representation learning, 6 of which are relatively
small scale but are widely used in graph representation
rG ðkÞ ¼ minfmaxfFðSi Þ; i ¼ 1; . . . ; kgg; learning literature, including BlogCatalog, PPI, Wiki, Cora,
rG ðkÞ is positively correlated to graph connectivity. if a Citeseer and Pubmed. DBLP and Youtube are relatively
graph has higher conductance(bigger rG ðkÞ), it is better con- large scale networks. Table 2 lists their statistics.
nected. High-order Cheeger equality bridges the gap We use the following five datasets without input node
between graph partition and graph spectrum and shows features.
that the eigenvalues of Laplacian matrix control the bounds  BlogCatalog [37] is a network of social relationships of
of k-way Cheeger constant as follows: online bloggers. The vertex labels represent the inter-
pffiffiffiffiffi ests of the bloggers.
k
rG ðkÞ Oðk2 Þ k : (25)  PPI [38] is a subgraph of the PPI network for Homo
2
Sapiens. The vertex labels are obtained from the hall-
In spectral graph theory, the number of connected compo- mark gene sets and represent biological states.
nents in an undirected graph is equal to the multiplicity of  DBLP [39] is an academic citation network where
the eigenvalue zero in graph Laplacian. authors are treated as nodes and their dominant con-
Eq. (25) indicates that we can increase(decrease) the ferences as labels.
graph conductance rG ðkÞ by increasing(decreasing) the  Wiki [40] is a word co-occurrence network in part of
lower bound k . For example, low-pass filters increases k the Wikipedia dump. Node labels are the Part-of-
for small k and decreases eigenvalues for big k. As a result, Speech tags.
different parts become more isolated when the graph is par-  Youtube [37] is a video-sharing website that allows
titioned into different groups, thus improving the locality of users to upload, view, rate, share, add to their

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2293

TABLE 2  GCN [1] Graph convolution network (GCN) simpli-


Statistics of Datasets fies graph convolutions by restricting the graph fil-
ters to operate in an 1-step neighborhood around
Dataset Nodes Edges Classes Features Degree
each node.
BlogCatalog 10,312 333,983 39 - 32.4  GAT [9] Graph attention networks adopt attention
PPI 3,890 76,584 50 - 19.7
mechanisms to learn the relative weights between
Wiki 4,777 184,812 40 - 38.7
DBLP 51,264 127,968 60 - 2.5 two connected nodes.
Youtube 1,138,499 2,990,443 47 - 2.6 Stacked with the proposed graph filters block, we evalu-
Cora 2,708 5,429 7 1,433 2.0 ate how the algorithm performance can be improved.
Citeseer 3,327 4,732 6 3,703 1.4
Pubmed 19,717 44,338 3 500 2.2
5.1.3 Implementation Details
For a fair comparison, we set the dimension of embedding
d = 128 for all network embedding methods. For all the
favorites, report, comment on videos. The labels rep- other parameters, we follow the authors’ original setup and
resent groups of viewers that enjoy common video have the following settings: For DeepWalk, windows size m
genres. = 10, #walks per node r = 80, walk length t = 40; For Node2-
We consider the following three datasets with input node Vec, window size m = 10, #walks per node r = 80, walk
features. length t = 40, p, q are searched over {0.25, 0.50, 1, 2, 4}. For
LINE, #negative-samples k = 5 and total sampling budget
 Cora [41] is a paper citation network. Each publica- T ¼ r  t  jV j. For HOPE, b is set to be 0.01. For NetMF,
tion is associated with a word vector indicating the window size r = 5, rank = 256, #negative-samples k = 5. For
absence/presence. DeepWalk, LINE and Node2Vec, we use the official code
 Citeseer [41] is also a paper citation network. Each provided by the original authors and for NetMF and HOPE,
node has a human-annotated topic as its label and we use the code implemented in CogDL.
content-based features. To further validate the effectiveness of AutoProNE, we
 Pubmed [42] contains 19717 diabetes-related publica- also implement AutoProNE on unsupervised convolution-
tions. Each paper in Pubmed is represented by a based methods DGI [11] and unsupervised GraphSAGE
term frequency-inverse document frequency vector. [10], and compare with semi-supervised GCN [1] and GAT
[9]. All parameters are set the same as in the authors’ origi-
nal settings. We use the official code provided by the origi-
5.1.2 Baselines nal authors of DGI and unsupervisd GraphSAGE code
We compare with several state-of-the-art methods including: implemented in CogDL.
For AutoProNE, each filter can only be selected once.
 DeepWalk [4] DeepWalk generalizes language model The term number of Taylor and Chebyshev expansion k
to graph learning. For each vertex, truncated random is set to be 5. For PPR, a is searched between [0.1, 0.9].
walks starting from the vertex are used to obtain the For Heat kernel, u in [0.1, 0.9]. For Gaussian Kernel, m in
contextual information. [0, 2], u in [0.2, 1.5].
 LINE [5] LINE preserves the first-order and second- Evaluation. For non-convolution based methods, we follow
order proximity between vertexes. And we use LINE the same experimental settings used in baseline works [4],
with the second order proximity. [5], [6], [16] . We randomly sample different percentages of
 Node2Vec [6] Node2vec designs a second order ran- labeled nodes to train a liblinear classifier and use the remain-
dom walk strategy to sample the neighborhood ing for testing. The training ratio for small datasets(PPI, Wiki,
nodes, which interpolates between breadth-first BlogCatalog) is 0.1/0.5/0.9, and 0.01/0.05/0.09 for relatively
sampling and depth-first sampling. big datasets(DBLP and Youtube). The remaining is for pre-
 HOPE [7] HOPE preserves high-order proximities of dicting. We repeat the training and predicting 10 times and
graphs and capable of capturing the asymmetric report the average Micro-F1 for all methods. Analogous
transitivity using matrix factorization. results also hold for Macro-F1, which are not shown due to
 NetMF [8] NetMF unifies skip-gram based methods space constraints. For unsupervised convolutional methods,
as matrix factorization. we train a multi-layer perception with a fixed train/valid/
 ProNE [16] ProNE is a fast network embedding test data splitting the same as in [41]. We repeat it 50 times
method including sparse matrix factorization and and report the average accuracy for all methods.
spectral propagation. And we use ProNE(SMF) to rep-
resent ProNE only with sparse matrix factorization.
 GraphSAGE [10] GraphSAGE is an inductive GNN 5.2 Results
framework generating node embeddings by sam- 5.2.1 Overall Performance
pling and aggregating features from a node’s local Table 3 lists the node classification performance based on
neighborhood. the embeddings learned by different network embedding
 DGI [11] Deep Graph Infomax (DGI) maximizes algorithms with and without AutoProNE. We test different
mutual information and classifies local-global pairs ratios (0.1, 0.5, 0.9) of labeled data for node classification fol-
and negative-sampled counterparts. lowing existing work [4], [5], [6] to train a liblinear classifier

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2294 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

TABLE 3
Micro-F1 Results of Node Classification by Different Algorithms w/ and w/o AutoProNE

Datasets Ratio DeepWalk ProDeepWalk LINE ProLINE node2vec ProNode2vec NetMF ProNetMF HOPE ProHOPE

BlogCatalog 0.1 35.6 36.2 (+1.7%) 30.3 32.3 (+6.6%) 35.7 35.7 (0.0%) 35.6 36.8 (+3.4%) 30.4 33.5 (+10.2%)
0.5 40.5 41.8 (+3.2%) 37.1 38.3 (+3.2%) 40.6 41.5 (+2.2%) 41.1 41.9 (+2.0%) 34.0 37.4 (+10.0%)
0.9 42.3 44.2 (+4.5%) 39.6 40.5 (+2.3%) 41.8 43.9 (+5.0%) 41.6 42.3 (+1.7%) 35.3 38.3 (+8.5%)
PPI 0.1 16.7 17.6 (+5.4%) 12.5 17.0 (+34.9%) 16.1 17.5 (+8.7%) 17.9 18.0 (+0.6%) 16.0 17.3 (+8.1%)
0.5 21.6 24.7 (+14.3%) 16.4 23.7 (+44.5%) 20.6 24.1 (+17.0%) 23.1 24.6 (+6.5%) 21.0 23.3 (+10.9%)
0.9 24.0 27.0 (+12.5%) 19.5 26.2 (+34.4%) 23.1 25.7 (+11.2%) 25.5 26.5 (+3.9%) 23.2 24.9 (+7.3%)
Wiki 0.1 43.3 44.5 (+2.8%) 41.8 45.8 (+9.6%) 44.8 44.2 (-1.3%) 45.7 45.9 (+0.5%) 48.8 48.0 (+2.4%)
0.5 49.2 50.0 (+1.6%) 52.5 53.2 (+1.3%) 51.1 50.9 (-0.3%) 50.1 50.9 (+1.6%) 53.1 52.8 (-0.5%)
0.9 50.0 51.4 (+2.8%) 54.7 55.0 (+0.5%) 52.8 52.5 (-0.5%) 50.7 51.9 (+2.4%) 53.1 54.3 (+0.4%)
DBLP 0.01 51.5 55.8 (+8.3%) 49.7 52.4 (+5.4%) 53.4 57.8 (+8.2%) 51.5 52.9 (+2.7%) - -
0.05 58.1 59.0 (+1.5%) 54.9 56.2 (+2.4%) 58.3 60.0 (+2.9%) 57.1 59.5 (+4.2%) - -
0.09 59.4 59.9 (+0.9%) 56.3 57.0 (+1.2%) 59.5 60.6 (+1.8%) 57.9 60.2 (+4.0%) - -
Youtube 0.01 38.2 39.2 (+2.6%) 33.2 39.8 (+19.8%) 38.2 39.7 (+3.9%) - - - -
0.05 41.6 44.7 (+6.0%) 36.2 43.5 (+20.1%) 40.0 45.3 (+12.2%) - - - -
0.09 42.8 46.2 (+7.9%) 38.3 45.9 (+19.8%) 43.0 47.1 (+9.5%) - - - -

Ratio indicates the percentage of labeled data. Numbers in the brackets indicate performance improvements and all are statistically significant (p-value0.01;
t-test).

and repeat the training and predicting for ten times and Gaussian kernel for Cora, Citeseer, and Pubmed. We leave
report the average Micro-F1 for all methods. the reason behind this difference for future work.
We observe that the performance of all the base algo-
rithms can be significantly improved by the AutoProNE
framework and also the improvements are statistically sig- 5.2.2 The Role of AutoML
nificant. For DeepWalk, LINE, node2vec, NetMF, and Tables 5 and 6 report the performance of base methods with
HOPE, the improvements brought by AutoProNE are up to each of the four graph filters in AutoProNE’s search space.
14.3%, 44.5%, 17%, 6.5%, and 10.9%, respectively. On aver- We can observe that different filters have unique impacts on
age, LINE benefits more from AutoProNE as it is in nature the performance across datasets. For example, Signal
an embedding method without incorporating high-order
structural information. TABLE 5
Table 4 reports the results of unsupervised GraphSAGE Results of Base Embedding Methods With Different
Graph Filters
and DGI as well as the AutoProNE version of them. As a
reference point, we also list the performance of two popular Dataset Type DeepWalk LINE node2vec NetMF HOPE
semi-supervised GNNs—GCN and GAT. We observe that
the unsupervised AutoProNE framework can help improve BlogCatalog Heat 41.4 38.3 41.0 41.6 34.6
the performance of both DGI and GraphSAGE in most PPR 41.7 38.8 41.3 41.8 35.6
Gaussian 41.4 38.4 41.1 41.3 37.5
cases. This suggests that by automated usage of graph fil-
SR 41.9 38.7 40.9 41.6 34.5
ters, the simple AutoProNE strategy is capable of enhancing AutoProNE 41.8 38.3 41.5 41.9 37.4
graph representation learning with or without input node
PPI Heat 23.4 21.1 22.8 24.0 21.7
features. With the help of AutoProNE, DGI can even yield PPR 23.9 22.7 23.6 24.3 22.3
better performance than the end-to-end semi-supervised Gaussian 23.2 21.3 22.8 23.7 21.1
GCN and GAT models on Pubmed. SR 24.5 23.0 24.8 25.0 23.0
Finally, we observe that the AutoProNE prefers the AutoProNE 24.7 23.7 24.1 24.6 23.3
newly proposed signal rescaling (SR) filter for BlogCatalog, Wiki Heat 48.2 52.8 50.5 47.4 53.1
PPI, Wiki, DBLP and Youtube, while it tends to favor PPR 48.4 52.6 49.9 48.1 53.2
Gaussian 48.2 52.6 50.3 47.2 53.0
SR 49.0 54.1 52.2 50.7 53.0
AutoProNE 50.0 53.2 50.9 50.9 52.8
TABLE 4
Accuracy Results of Node Classification by Unsupervised GNNs DBLP Heat 58.8 54.7 59.4 58.8 -
PPR 59.0 55.4 59.7 59.0 -
Dataset Cora Pubmed Citeseer Gaussian 58.9 54.7 59.7 58.5 -
SR 58.7 54.1 59.7 58.9 -
Semi-supervised GCN 81.5 79.0 70.3 AutoProNE 59.0 56.2 60.0 59.5 -
GAT 83.0 0.7 79.0 0.3 72.5 0.7 Youtube Heat 44.5 42.7 44.7 - -
Unsupervised DGI 82.0 0.1 77.1 0.1 71.7 0.2 PPR 44.6 43.5 45.1 - -
ProDGI 82.9 0.2 81.0 0.1 70.8 0.2 Gaussian 44.3 40.5 44.3 - -
GraphSAGE 77.2 0.1 78.0 0.1 61.2 0.1 SR 44.5 43.0 45.1 - -
ProSAGE 78.1 0.2 79.5 0.1 62.1 0.2 AutoProNE 44.7 43.5 45.3 - -

Significant test (p-value0.01; t-test). Ratio=0.5 for BlogCatalog, PPI, and Wiki; 0.05 for DBLP and Youtube.

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2295

TABLE 6 TABLE 9
Results of Base GNNs With Different Graph Filters Efficiency of AutoProNE (Seconds)

Dataset Type Cora Citeseer Pubmed Dataset DGI GraphSAGE AutoProNE


DGI Heat 82.0 71.8 77.5 Cora 341 118 +15 (4.1%)
PPR 64.2 70,1 77.4 Citeseer 490 131 +21 (4.2%)
Gaussian 82.9 71.4 80.7 Pubmed 4,863 1,254 +183 (3.7%)
SR 13.1 70.1 77.2
AutoProNE 82.9 70.8 81.0
Gaussian kernel perform better if high-order neighbor infor-
GraphSAGE Heat 76.5 61.7 77.4
PPR 62.1 62.0 77.6 mation matters such as Cora, because SR only aggregates
Gaussian 77.8 62.3 76.6 1st-order neighbors. Besides, low-order methods (like LINE)
SR 16.5 62.7 77.1 may benefit more because the original embeddings fail to
AutoProNE 78.1 62.2 79.5 incorporate abundant neighborhood information.

Rescaling(SR) and PPR perform relatively better than other 5.2.3 Efficiency and Scalability
filters in dataset PPI, BlogCatalog and Wiki, but for Cora, We follow the common practice for efficiency evaluation by
Citeseer and Pubmed, Gaussian filter exhibits better perfor- the wall-clock time and AutoProNE ’s scalability is ana-
mance. And the tables show that the performance of Auto- lyzed by the time cost in multiple-scale networks [5].
ProNE is equal to the best result of one single filter, which Tables 8 and 9 report the extra running time (seconds)
may imply that AutoProNE finally picks this filter, or our when stacked with AutoProNE for 100 searching iterations
model yields better performance than any single filter, which in 10 threads. Note that AutoProNE is a dataset specific and
means AutoProNE learns a better combination of graph fil- base agnostic framework. The percentage of extra running
ters. These all suggest the need for AutoML to select the best time in terms of DeepWalk and DGI is also reported inside
filter(s) for each dataset. The proposed AutoProNE strategy the brackets, respectively.
consistently and automatically picks the optimal filter(s) and We can observe that AutoProNE is very efficient com-
parameters in an unsupervised manner. pared to base methods. Overall, AutoProNE costs only 2.3–
The graph filter used for the embedding smoothing in 4.7% of DeepWalk’s or DGI’s running time. Take the You-
ProNE [16] is a modified 2nd-order Gaussian kernel. For sim- tube graph as an example, it takes DeepWalk 68,272 seconds
plicity, the 1st-order Gaussian kernel is covered in ( 19 hours) for generating the embeddings of its 1,000,000+
AutoProNE’s search space. Table 7 reports the performance for nodes. However, AutoProNE only needs 4.7% of its time to
ProNE(SMF) enhanced by ProNE graph filter and AutoProNE offer 2.6–7.9% performance improvements (Cf. Table 3).
respectively. We can see that the automated search strategy For convolutional methods, GNNs are usually less efficient
empowers AutoProNE to generate better performance than for large scale datasets. This suggests that AutoProNE will
ProNE in most cases, further demonstrating the effectiveness take less of the additional percentage of time to achieve
of using AutoML for smoothing graph representations. improvement.
From the experiment results of AutoML searching, we We use synthetic networks to demonstrate the scalability
also have some interesting findings. PPR, Heat kernel and of AutoProNE. We generate random regular graphs with a

TABLE 7
Results of ProNE and AutoProNE

Method BlogCatalog PPI Wiki DBLP Youtube


Ratio 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 0.01 0.05 0.09 0.01 0.05 0.09
ProNE(SMF) 33.0 37.8 39.1 15.1 22.0 24.5 47.8 54.5 55.6 46.7 54.0 55.6 36.4 41.4 42.5
With ProNE 36.4 40.9 42.2 17.5 24.0 26.5 48.0 54.6 56.0 46.7 55.2 56.3 37.6 42.9 43.9
With AutoProNE 35.8 41.0 42.2 17.7 24.3 26.5 48.6 55.5 56.8 50.0 56.1 57.5 38.7 43.7 44.6

With ProNE and With AutoProNE mean the result of ProNE(SMF) improved by the graph filter of ProNE and AutoProNE respectively.

TABLE 8
Efficiency of AutoProNE (Seconds)

Dataset DeepWalk LINE node2vec SMF NetMF HOPE AutoProNE


PPI 272 70 716 2 10 12 +11 (4.0%)
Wiki 494 87 819 4 23 17 +12 (2.4%)
BlogCatalog 1,231 185 2,809 12 144 136 +29 (2.3%)
DBLP 3,825 1,204 4,749 15 186 - +100 (2.6%)
Youtube 68,272 5,890 30,218 302 - - +3,213 (4.7%)

SMF stands for ProNE(SMF).

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2296 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

HOPE [7] proposes to use generalized SVD to preserve the


asymmetric transitivity in directed networks. [46] also pro-
poses a framework to unify the aforementioned methods.
ProNE [16] formalizes network embedding as sparse matrix
factorization and preserves the distributional similarity.
ProNE enhances the result of sparse matrix factorization with
spectral propagation, which modulates the adjacency matrix
mainly to incorporate the global properties in the spatial
domain. AutoProNE generalizes the operation to be stacked
with existing unsupervised representation learning algo-
rithms in an automated way and improves their performance.

6.2 Graph Convolution Networks


Recently semi-supervised graph learning with graph neural
networks, such as graph convolution networks (GCNs) [1],
[47], [48], draws considerable attention. Various variants
[49], [50], have also been designed to boost the performance.
In GCNs, the convolution operation is defined in the spec-
tral space and parametric filters are learned via back-propa-
gation. [24], [51], [52], [53] replace the adjacency matrix in
GCNs with graph filters (PPR and Heat kernel), which they
called graph diffusion matrix, to combine the strengths of
Fig. 3. AutoProNE’s Scalability w.r.t. network volume. Running time as both spatial and spectral methods and the performance
#node grows with node degree fixed to 10. As the network size
increases, the time cost of both graph filters and computing loss also improves. They apply graph filters to semi-supervised
grows linearly. learning in which graph filters are entangled with the mod-
el’s training process. AutoProNE proposes a more flexible
fixed node degree as 10 and the number of nodes ranging combination of graph filters in both spectral and spatial
between 1,000 and 10,000,000. In addition, we also add the domains, which can filter the frequency in any specific band
running time on each dataset with the heat kernel and the and better extract the intrinsic information and is model-
corresponding loss. The results are shown in Fig. 3. agnostic. Besides, self-supervised learning[54] is emerging
AutoProNE is ideal for parallel implementation. The com- recently and contrastive learning shows great potential.
putation of our model is mainly spent on iteratively selecting Contrastive methods may employ a scoring function to
different filters and hyperparameters to evaluate the effec- evaluate the similarity of pairs of data. [14], [55] employ
tiveness. Therefore, running multi-progresses will speed up InfoNCE to maximize the gap between positive and nega-
the searching of AutoML and achieves great efficiency. tive pairs. One possible intuition behind this is that
Table 8 shows that ProNE(SMF) is a fast network embed- InfoNCE can shorten the distance between positive pairs,
ding method. With ProNE(SMF) as the base embedding and make the distance between negative pairs as far as pos-
method, AutoProNE works as a general network embed- sible. [13] is based on classifying local-global pairs and neg-
ding method with high efficiency and also achieves compa- ative-sampled pairs. And the ultimate goal is to maximize
rable performance, as shown in Table 7. the local-global mutual information. [11], [56], [57], [58],
[59], [60] apply contrastive methods in graph representation
learning and achieves promising results. [56] contrasts the
6 RELATED WORK diffusion result of two graph filters, which also indicates
6.1 Network Embedding that different filters captures different views of the graph.
[12] employs contrastive coding to pre-train graph neural
Network embedding has been extensively studied by
networks. AutoProNE also maximizes InfoNCE and local-
machine learning communities in the past few years and
global mutual information in AutoML loss for optimization.
aims to train low dimension vectors that are available for a
wide range of downstream tasks.
The recent emergency of network embedding research 6.3 Automated Machine Learning
begins when skip-gram model [43], [44], which is originally With the arising of AutoML [61], [62], several works also
used in word representation learning and network mining, apply it in graph representation learning. GraphNAS [63]
is applied to derive the embedding of nodes in networks. and AutoGNN [64] aim to generate neural architectures of
DeepWalk [4] and Node2Vec [6] employ random walks to GNNs in spatial domain with recurrent neural network gen-
explore the network structure and LINE [5] takes a similar erator by using reinforcement learning with neural architec-
idea with an explicit objective function by setting the walk ture search [65]. Both of them suffer from high computation
length as one. These random walk based methods can be costs to find the best model architecture for a given dataset.
understood as implicit matrix factorization [8]. AutoProNE is more like a plug-and-play framework that can
The other explicit matrix factorization based network be applied to any graph node embeddings efficiently.
embedding methods have also been proposed. GraRep [45] Besides, bayesian optimization [32], with the Gaussian pro-
directly factorizes different k-order proximity matrices and cess [66] as the underlying surrogate model, is a popular

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
HOU ET AL.: AUTOMATED UNSUPERVISED GRAPH REPRESENTATION LEARNING 2297

technique for finding the globally optimal solution of an opti- [9] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li o, and Y.
Bengio, “Graph attention networks,” in Proc. 6th Int. Conf. Learn.
mization problem. And this technique is widely used espe- Representations, 2018.
cially in hyperparameter optimization. As AutoProNE [10] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
mainly aims to search for the best hyperparameters for graph learning on large graphs,” in Proc. 31st Int. Conf. Neural Inf. Process.
filters, bayesian Optimization is the principle behind the Syst., 2017, pp. 1025–1035.
[11] P. Velickovic, W. Fedus, W. L. Hamilton, P. Li o, Y. Bengio, and R.
searching. AutoProNE is an unsupervised and task-indepen- D. Hjelm, “Deep graph infomax,” in Proc. 7th Int. Conf. Learn. Rep-
dent model that aims to pre-train general embeddings. resentations, 2019.
[12] J. Qiu et al., “GCC: Graph contrastive coding for graph neural net-
work pre-training,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl.
Discov. Mining, 2020, pp. 1150–1160.
7 CONCLUSION [13] R. H. Devon et al., “Learning deep representations by mutual
information estimation and maximization,” in Proc. 7th Int. Conf.
In this paper, we investigate the role of graph filters and Learn. Representations, 2019.
propose an automated and unsupervised framework Auto- [14] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con-
ProNE to generate improved graph embeddings for unsu- trast for unsupervised visual representation learning,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9726–9735.
pervised graph representation learning. AutoProNE [15] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger,
comprises four graph filters (PPR, HeatKernel, Gaussian “Simplifying graph convolutional networks,” in Proc. 36th Int.
Kernel and Signal Rescaling) and automatically searches for Conf. Mach. Learn., 2019, pp. 6861–6871.
a combination of graph filters and corresponding hyper- [16] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding, “ProNE: Fast
and scalable network representation learning,” in Proc. 28th Int.
parameters for the given dataset. Specifically, AutoProNE Joint Conf. Artif. Intell., 2019, pp. 4278–4284.
operates on the adjacency matrix to enrich the context infor- [17] Y. Cen et al., “CogDL: An extensive toolkit for deep learning on
mation of each node. It is a flexible and adaptive frame- graphs,” 2021, arXiv:2103.00959.
[18] B. Girault, A. Ortega, and S. S. Narayanan, “Irregularity-aware
work, as the graph filter set can simulate many kinds of graph fourier transforms,” IEEE Trans. Signal Process., vol. 66, no.
filtering functions. It’s also very efficient and costs only a lit- 21, pp. 5746–5761, Nov. 2018.
tle extra time to obtain the improvement in performance. [19] A. Sandryhaila and J. M. Moura, “Discrete signal processing on
The developed method is model-agnostic and can be eas- graphs,” IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644–1656,
Apr. 2013.
ily stacked with all unsupervised graph representation [20] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P.
learning methods such as DeepWalk, LINE, node2vec, Vandergheynst, “The emerging field of signal processing on
NetMF, and HOPE. On eight publicly available datasets, graphs: Extending high-dimensional data analysis to networks
and other irregular domains,” IEEE Signal Process. Mag., vol. 30,
AutoProNE helps significantly improve the performance of no. 3, pp. 83–98, May 2013.
various algorithms. In addition, AutoProNE can also [21] X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-
enhance the performance of self-supervised/unsupervised the-art,” 2019, arXiv: 1908.00709.
GNN methods, e.g., DGI and GraphSage. We show that the [22] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A
next-generation hyperparameter optimization framework,” in
self-supervised DGI model with the unsupervised Auto- Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,
ProNE can generate comparable or even better performance 2019, pp. 2623–2631.
than semi-supervised end-to-end GNN methods, such as [23] K. Tu, J. Ma, P. Cui, J. Pei, and W. Zhu, “AutoNE: Hyperparameter
GCN and GAT. We have implemented AutoProNE in optimization for massive network embedding,” in Proc. 25th ACM
SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 216–225.
CogDL, an open-source graph learning library, to help [24] J. Klicpera, S. Weißenberger, and S. G€ unnemann, “Diffusion
more graph representation learning methods. improves graph learning,” in Proc. Int. Conf. Neural Inf. Process.
Syst., 2019, pp. 13333–13345.
[25] B. Xu, H. Shen, Q. Cao, K. Cen, and X. Cheng, “Graph convolu-
REFERENCES tional networks using heat kernel for semi-supervised learning,”
in Proc. 28th Int. Joint Conf. Artif. Intell., 2019, pp. 1928–1934.
[1] T. N. Kipf and M. Welling, “Semi-supervised classification with [26] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank cita-
graph convolutional networks,” in Proc. 5th Int. Conf. Learn. Repre- tion ranking: Bringing order to the web,” Stanford InfoLab, 1999.
sentations, 2017. [27] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and other
[2] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. discrete structures,” in Proc. 19th Int. Conf. Mach. Learn., 2002,
Leskovec, “Graph convolutional neural networks for web-scale pp. 315–322.
recommender systems,” in Proc. 24th ACM SIGKDD Int. Conf. [28] D. I. Shuman, B. Ricaud, and P. Vandergheynst, “Vertex-fre-
Knowl. Discov. Data Mining, 2018, pp. 974–983. quency analysis on graphs,” Appl. Comput. Harmon. Anal., vol. 40,
[3] M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang, “Cognitive no. 2, pp. 260–291, 2016.
graph for multi-hop reading comprehension at scale,” in Proc. [29] L. C. Andrews, Special Functions of Mathematics for Engineers, vol.
57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 2694–2703. 49. Bellingham, WA, USA: SPIE, 1998.
[4] B. Perozzi, R. Al-Rfou , and S. Skiena, “DeepWalk: Online learning [30] C. Deng, Z. Zhao, Y. Wang, Z. Zhang, and Z. Feng,
of social representations,” in Proc. 20th ACM SIGKDD Int. Conf. “GraphZoom: A multi-level spectral approach for accurate
Knowl. Discov. Data Mining, 2014, pp. 701–710. and scalable graph embedding,” in Proc. 8th Int. Conf. Learn.
[5] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: Representations, 2020.
Large-scale information network embedding,” in Proc. 24th Int. [31] F.-Y. Sun, J. Hoffmann, V. Verma, and J. Tang, “InfoGraph: Unsu-
Conf. World Wide Web, 2015, pp. 1067–1077. pervised and semi-supervised graph-level representation learning
[6] A. Grover and J. Leskovec, “node2vec: Scalable feature learning via mutual information maximization,” in Proc. Int. Conf. Learn.
for networks,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Dis- Representations, 2020.
cov. Data Mining, 2016, pp. 855–864. [32] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian opti-
[7] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transi- mization of machine learning algorithms,” in Proc. 25th Int. Conf.
tivity preserving graph embedding,” in Proc. 22nd ACM SIGKDD Neural Inf. Process. Syst., 2012, pp. 2951–2959.
Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 1105–1114. [33] H. NT and T. Maehara, “Revisiting graph neural networks: All we
[8] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network have is low-pass filters,” 2019, arXiv: 1905.09550.
embedding as matrix factorization: Unifying deepWalk, LINE, [34] J. R. Lee, S. O. Gharan, and L. Trevisan, “Multiway spectral parti-
PTE, and node2vec,” in Proc. 11th ACM Int. Conf. Web Search Data tioning and higher-order Cheeger inequalities,” J. ACM, vol. 61,
Mining, 2018, pp. 459–467. no. 6, 2014, Art. no. 37.

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.
2298 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 3, MARCH 2023

[35] A. S. Bandeira, A. Singer, and D. A. Spielman, “A Cheeger [61] J. Bergstra, D. Yamins, and D. Cox, “Making a science of model
inequality for the graph connection Laplacian,” SIAM J. Matrix search: Hyperparameter optimization in hundreds of dimensions
Anal. Appl., vol. 34, no. 4, pp. 1611–1630, 2013. for vision architectures,” in Proc. 30th Int. Conf. Mach. Learn., 2013,
[36] J. Qiu, L. Dhulipala, J. Tang, R. Peng, and C. Wang, “LightNE: A pp. 115–123.
lightweight graph processing system for network embedding,” in [62] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and
Proc. Int. Conf. Manage. Data, 2021, pp. 2281–2289. F. Hutter, “Towards automatically-tuned neural networks,” in
[37] R. Zafarani and H. Liu, “Social computing data repository at Proc. Workshop Autom. Mach. Learn., 2016, pp. 58–65.
ASU,” 2009. [63] Y. Gao, H. Yang, P. Zhang, C. Zhou, and Y. Hu, “GraphNAS:
[38] B.-J. Breitkreutz et al., “The biogrid interaction database: 2008 Graph neural architecture search with reinforcement learning,”
update,” Nucleic Acids Res., vol. 36, no. suppl_1, pp. D637–D640, 2019, arXiv: 1904.09981.
2007. [64] K. Zhou, Q. Song, X. Huang, and X. Hu, “Auto-GNN: Neural architec-
[39] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner: ture search of graph neural networks,” 2019, arXiv: 1909.03184.
Extraction and mining of academic social networks,” in Proc. 14th [65] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2008, pp. 990–998. ment learning,” 2016, arXiv:1611.01578.
[40] M. Mahoney, “Large text compression benchmark,” 2009. [Online]. [66] J. Mockus, “On Bayesian methods for seeking the extremum,” in
Available: http://www. mattmahoney.net/text/text.html Proc. IFIP Tech. Conf. Optim. Techn., 1975, pp. 400–404.
[41] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore,
“Automating the construction of internet portals with machine Zhenyu Hou received the undergraduate degree
learning,” Inf. Retrieval, vol. 3, no. 2, pp. 127–163, 2000. from the Department of Computer Science and
[42] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and Technology, Tsinghua University, Beijing, China.
T. Eliassi-Rad, “Collective classification in network data,” AI His main research interests include graph neural
Mag., vol. 29, no. 3, pp. 93–93, 2008. networks and representation learning.
[43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
of word representations in vector space,” 2013, arXiv:1301.3781.
[44] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Proc. 26th Int. Conf. Neural Inf. Process. Syst.,
2013, pp. 3111–3119.
[45] S. Cao, W. Lu, and Q. Xu, “GraRep: Learning graph representa- Yukuo Cen received the bachelor’s degree in
tions with global structural information,” in Proc. 24th ACM Int. computer science and technology from Tsinghua
Conf. Inf. Knowl. Manage., 2015, pp. 891–900. University, Beijing, China. He is currently working
[46] T. Chen and Y. Sun, “Task-guided and path-augmented heteroge- toward the PhD degree in the Department of
neous network embedding for author identification,” in Proc. 10th Computer Science and Technology, Tsinghua
ACM Int. Conf. Web Search Data Mining, 2017, pp. 295–304. University, Beijing, China. His research interests
[47] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks include social influence and graph embedding.
on graph-structured data,” 2015, arXiv:1506.05163.
[48] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neu-
ral networks on graphs with fast localized spectral filt-ering,” in Proc.
30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 3844–3852.
[49] H. Chen, H. Yin, X. Sun, T. Chen, B. Gabrys, and K. Musial,
“Multi-level graph convolutional networks for cross-platform Yuxiao Dong received the PhD degree in computer
anchor link prediction,” in Proc. 26th ACM SIGKDD Int. Conf. science from the University of Notre Dame, Notre
Knowl. Discov. Data Mining, 2020, pp. 1503–1511. Dame, Indiana, in 2017. He is a research scientist
[50] H. Chen, H. Yin, T. Chen, Q. V. H. Nguyen, W.-C. Peng, and X. Li, with Facebook AI, Seattle, was previously a senior
“Exploiting centrality information with graph convolutions for researcher with Microsoft Research, Redmond. His
network representation learning,” in Proc. IEEE 35th Int. Conf. research focuses on data mining, representation
Data Eng., 2019, pp. 590–601. learning, and networks, with an emphasis on devel-
[51] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-I. Kawarabayashi, and oping machine learning models to addressing prob-
S. Jegelka, “Representation learning on graphs with jumping lems in large-scale graph systems.
knowledge networks,” in Proc. 35th Int. Conf. Mach. Learn., 2018,
pp. 5449–5458.
[52] J. Klicpera, A. Bojchevski, and S. G€ unnemann, “Predict then prop-
Jie Zhang received the bachelor’s degree in
agate: Graph neural networks meet personalized pagerank,”
mathematics and physics from the Department of
in Proc. 7th Int. Conf. Learn. Representations, 2019.
Physics, Tsinghua University, Beijing, China, in
[53] B. Jiang, D. Lin, J. Tang, and B. Luo, “Data representation and learn-
2016, and the master’s degree from the Depart-
ing with graph diffusion-embedding networks,” in Proc. IEEE/CVF
ment of Computer Science and Technology,
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10406–10415.
Tsinghua University, Beijing, China, in 2019. His
[54] X. Liu et al., “Self-supervised learning: Generative or contras-
research interests include graph neural networks
tive,” IEEE Trans. Knowl. Data Eng., early access, Jun. 22, 2021,
and data mining on graphs.
doi: 10.1109/TKDE.2021.3090866.
[55] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
framework for contrastive learning of visual representations,”
2020, arXiv: 2002.05709.
[56] A. K. Sankararaman, S. De, Z. Xu, R. W. Huang, and T. Goldstein, Jie Tang (Fellow, IEEE) received the PhD degree
“Contrastive multi-view representation learning on graphs,” from Tsinghua University, Beijing, China. He is a
in Proc. Int. Conf. Mach. Learn., 2020. professor with the Department of Computer Sci-
[57] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Deep graph ence and Technology, Tsinghua University. His
contrastive representation learning,” 2020, arXiv: 2006.04131. main research interests include data mining,
[58] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph social network, and machine learning. He has
contrastive learning with augmentations,” in Proc. Int. Conf. Neu- published more than 200 research papers in top
ral Inf. Process. Syst., 2020, pp. 5812–5823. international journals and conferences.
[59] Y. You, T. Chen, Y. Shen, and Z. Wang, “Graph contrastive learn-
ing automated,” in Proc. 38th Int. Conf. Mach. Learn., 2021,
pp. 12121–12132.
[60] Y. Jiao, Y. Xiong, J. Zhang, Y. Zhang, T. Zhang, and Y. Zhu, “Sub-
graph contrast for scalable self-supervised graph representation " For more information on this or any other computing topic,
learning,” in Proc. IEEE Int. Conf. Data Mining, 2020, pp. 222–231. please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Lanzhou University. Downloaded on April 17,2023 at 14:06:07 UTC from IEEE Xplore. Restrictions apply.

You might also like