04 GNNBasic
04 GNNBasic
04 GNNBasic
1
Network Science: An Overview
network
(e.g., Patterns, laws, connectivity, etc.)
node/link
We are (most) here
• Level 1: diameter, connectivity, graph-level classification, graph-level embedding, graph kernel, graph structure learning, graph generator,…
• Level 2: frequent subgraphs, clustering, community detection, motif, teams, dense subgraphs, subgraph matching, NetFair, …
• Level 3: node proximity, node classification, link prediction, anomaly detection, node embedding, network alignment, NetFair,
• Beyond:, network of X, ….
2
Graph Neural Networks
❑ Overview
❑ Summary
This Lecture: deep graph learning = graph neural networks = graph representation learning
3
Deep Graph Learning (Graph Neural Networks)
• Node-level representations
• Graph-level representations
This lecture: deep graph learning = graph neural networks = graph representation learning 4
GNNs Example #1: Context + Learning
• Step 1: Generate a Context Matrix
• Step 2: Learn Embedding by Preserving the Context
• e.g., minimizing an objective w.r.t. the context
• Examples: LINE, Deepwalk, node2vec and many more
Random walk generation (generate node Predict the nearby nodesin the random walks
contexts through random search)
• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15 5
GNNs Example #2: Propagate + Transform
• Step 1: Propagate/Aggregate the current embedding
• Step 2: Transform the aggregated embedding
• An Example: GCN
1 1
෩ −2 ෩ ෩ −2
𝒁 = 𝝈 (𝑫 𝑨𝑫 𝐗𝚯)
• Related GNNs
• Spatial GNN
• GraphSage
• GAT
6
Graph Neural Networks
❑ Overview
❑ Summary
This Lecture: deep graph learning = graph neural networks = graph representation learning
7
Problem Definition: Node Embedding
• Given a network/graph G=(V, E, W), where V is the set of nodes, E is
the set of edges between the nodes, and W is the set of weights of
the edges,
• Represent each node i with a vector , which preserves the
structure of networks.
9
LINE: Large-scale Information Network Embedding
• Arbitrary types of networks
• Directed, undirected, and/or weighted
• Clear objective function
• Preserve the first-order and second-order proximity
• Scalable
• Asynchronous stochastic gradient descent
• Millions of nodes and billions of edges: a coupe of hours on a single machine
10 • Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15
First-order Proximity
11
Second-order Proximity
“The degree of overlap of two people’s friendship networks correlates
with the strength of ties between them” --Mark Granovetter
“You shall know a word by the company it keeps” --John Rupert Firth
12
Preserving the First-order Proximity
(LINE 1st)
• Distributions: : (defined on the undirected edge i - j)
wij
Empirical distribution of first- p̂1 (vi , v j ) =
order proximity: å wmn
(m,n)ÎE
: Embedding of i
Model distribution of
first-order proximity:
• Objective:
13
Preserving the Second-order Proximity
(LINE 2nd)
• Distributions: (defined on the directed edge i -> j)
wij
Empirical distribution of p̂2 (v j | vi ) =
neighborhood structure: åw ik
kÎV
• ui: node i is as the central node
Model distribution of • u’i: node i is treated as context
neighborhood structure:
• Objective:
O2 = å KL( p̂2 (× | vi ), p2 (× | vi )) = - å wij log p2 (v j | vi )
i (i, j )ÎE
14
Optimization Tricks
• Stochastic gradient descent + Negative Sampling
• Randomly sample an edge and multiple negative edges
15
Discussions
• Embed nodes with few neighbors
• Expand the neighbors by adding higher-order neighbors
• Breadth-first search (BFS)
• Adding only second-order neighbors works well in most cases
• Embed new nodes
• Fix the embeddings of existing nodes
• Optimize the objective w.r.t. the embeddings of new nodes
16
DeepWalk (Perozzi et al. 2014)
• Learning node representations with the technique for learning word
representations, i.e., Skipgram
• Treat random walks on networks as sentences
17 Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
Node2Vec (Grover and Leskovec, 2016)
18 Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. KDD’16
Expand Node Contexts with Biased Random Walk
20
Applications
• Node classification (Perozzi et al. 2014, Tang et al. 2015a, Grover et al.
2015 )
• Node visualization (Tang et al. 2015a)
• Link prediction (Grover et al. 2015)
• Recommendation (Zhao et al. 2016)
• Text representation (Tang et al. 2015a, Tang et al. 2015b)
•…
21
Many Extensions …
• Leverage global structural information (Cao et al. 2015)
• Non-linear methods based on autoencoders (Wang et al. 2016)
• Matrix-factorization based approaches (Qiu et al. 2018)
• Directed network embedding (Ou et al. 2016)
• Signed network embedding (Wang et al. 2017)
• Multi-view networks ( Qu and Tang et al. 2017)
• Networks with node attributes (Yang et al. 2015)
• Heterogeneous networks (Chang et al. 2015)
• Task-specific network embedding (Chen et al. 2017)
22
Graph Neural Networks
❑ Overview
❑ Summary This Lecture: deep graph learning = graph neural networks = graph representation learning
23
New power source for AI
preliminaries
24
preliminaries
Inspired by a biological neuron
Image credit:
25
http://cs231n.github.io/neural-networks-1/
preliminaries
How to model a single neuron?
x1
w1
w2 w1x1+w2x2+
x2
w3x3
w3
x3
26
preliminaries
Perceptron: Predecessor of a Neural Network
𝑦ො = 𝑓 𝑾𝑇 𝒙 = 𝑓(∑𝑊𝑖 𝑥𝑖 + 𝑏)
Multilayer Network
Loss 𝐸
Update Parameters
❑ The gradient of 𝑤𝑖𝑗 in the 𝑙th layer (corresponding to unit j in layer l, connected to unit i
in layer l-1) is a function of
(𝑙+1)
❑ All ‘error’ terms from layer l+1 𝛿𝑘 -- An auxiliary term for computation, not to be
confused with gradients
❑ Output from unit i in layer l-1 (input to unit j in layer l) -- Can be stored at the feed
forward phase of computation
31
Gradient Computation: Backpropagation
(𝑙+1)
… 𝛿1
(𝑙) (𝑙+1)
𝛿𝑗 𝛿2
… Unit i Unit j …
𝑜𝑖
… (𝑙+1)
𝛿3
32
From Neural Networks to Deep Learning
❑ Deep Learning refers to training (deep) neural networks with many layers
❑ More neurons, more layers
❑ More complex ways to connect layers
❑ Deep Learning has some major advantages, making it popular
❑ Tremendous improvement of performance in many tasks
❑ Image recognition, natural language processing, AI game playing…
❑ Requires no (or less) feature engineering, making end-to-end models possible
❑ Several factors lead to deep learning’s success
❑ Very large data sets
❑ Massive amounts of computation power (GPU acceleration)
❑ Advanced neural network structures and tricks
❑ Convolutional neural networks, recurrent neural networks, graph convolution network …
33 ❑ Dropout, ReLU, residual connection, …
Overview of Typical Deep Learning Architectures
34
Techniques to Improve Deep Learning Training
❑ Key Challenges for Training Deep Learning Models
❑ Dropout
❑ Pretraining
❑ Cross Entropy
35
Convolutional Neural Networks (CNN)
preliminaries
❑ What is convolution?
1D Convolution 2D Convolution
❑ The outputs are computed by sliding a kernel (of weights) on the inputs, and
36 computing weighted sum locally
CNN Motivation
preliminaries
37
CNN Motivation: Sparse Interactions
preliminaries
MLP
CNN
38
CNN Motivation: Parameter Sharing
preliminaries
39 Fully connected layer Locally connected (sparse interaction) Convolution (sparse + sharing)
CNN Motivation: Translation Equivalence
preliminaries
❑Translational equivalence
❑ Intuitions: Same input at different location gives same output
❑ Math details:
❑ Examples
❑ #1: a cat at the upper right corner and at the lower left corner of an image, will
produce the same outputs
❑ #2: “University of Illinois” at the start of the sentence and at the end of the
sentence produce the same outputs
40
CNNs: Translation Equivalence
preliminaries
image 𝑰 (𝑁 × 𝑁)
move right by 𝑙
Filter 𝑲
(𝑀 × 𝑀) 𝑪𝑖 ′,𝑗 ′ = 𝑰𝑖−𝑝,𝑗−𝑙−𝑞 𝑲𝑝,𝑞
𝑝,𝑞
Also moves right by 𝑙!
image 𝑰′ (𝑁 × 𝑁) (i.e., 𝑖 ′ = 𝑖, 𝑗 ′ = 𝑗 − 𝑙)
41
CNN: Pooling Layer
preliminaries
❑ Pooling (Subsampling)
❑ Pool hidden units in the same neighborhood
❑ Introduces invariance to local translations
❑ Reduces the number of hidden units in hidden layer
42
CNN: Convolutional Layer
preliminaries
43
Convolutional Neural Networks
preliminaries
44
Importance of # of layers
preliminaries
❑ Models goes deeper (# of layers increases), CNNs tend to learn “better” representation
❑ Why? The higher layer is able to learn from a larger receptive field.
❑ Examples
❑ In 1990s: LeNet has 8 layers
❑ Now: it is common to have 100+ layers (DenseNet, resNet)
45
CovoluNN for Image Recognition: Example
46
Graph Neural Networks
❑ Overview
❑ Summary This Lecture: deep graph learning = graph neural networks = graph representation learning
47
Representation Learning on Graphs/Networks
❑Node-level representations
❑ To learn low-dimensional node features
48
Representation Learning on Graphs/Networks
❑Graph-level representations
❑ To learn low-dimensional graph features
49
Applications
❑ Node classification
❑ Predict node categories
❑ Link prediction
❑ Predict edges in the network
❑ Graph classification
❑ Predict graph categories
❑ Many more
❑ Recommendation
❑ Multiple network mining
❑ …
50
GNN #2: Propagation + Transformation
❑ Graph Convolutional Neural Networks
❑ GraphSage
❑ Graph Attention Networks
❑ Graph GAN
❑ Graph Recurrent Neural Networks
❑ Many Many More!
51
Classic CNN on Graphs
❑ Can we directly apply CNN?
No! fixed-size
kernel?
fixed-size kernel
52
Challenges
❑Irregular graph structure (non-Euclidean)
❑ Unfixed size of node neighborhoods
❑Permutation invariance
❑ Node ordering does not matter
53
Graph Convolution
❑ Convolution in spectral domain
54
Graph Conv in Spectral Domain
❑ Graph signal processing
❑Graph signal:
❑ vector 𝒙 ∈ 𝑅 𝑛 where 𝒙 𝑖 is for node-𝑖
3
2
1
4
node attribute
matrix
55
Graph Conv in Spectral Domain
❑ Convolution in spectral domain (due to convolution theorem)
❑ (1) Fourier transformation on singal and filter 𝑓መ = FT(f), 𝑔
ො = FT(g); (2) pointwise
መ 𝑔;
multiplication 𝑟Ƹ = 𝑓. ො (3) inverse Fourier transformation: 𝑟 = FT −1 (𝑟)Ƹ
❑ Graph spectral convolution: How to read this table? By analogy!
Classic signal Graph signal
Laplace ∆(𝑒 2𝜋𝑖𝜔𝑡 ) 𝑳=𝑫−𝑨
Operator (Laplacian matrix)
Frequency 2𝜋𝜔 Eigenvalues of Laplacian {𝜆𝑙 }
Eigenfunction 𝑒 2𝜋𝑖𝜔𝑡 Eigenvectors of Laplacian 𝒖𝑙
Fourier 𝑓መ 𝜔 = 𝑓, 𝑒 2𝜋𝑖𝜔𝑡 ෝ 𝜆𝑙 = 𝒙, 𝒖𝑙 = 𝒙 𝑖 𝒖∗𝑙 (𝑖)
𝒙
transform 𝑖
Convolution 𝑓∗𝑔 𝑡 𝑁
ෝ 𝜆𝑙 𝐲(𝜆
𝒙 ∗𝑔 𝒚 𝑖 = 𝒙 ො 𝑙 )𝒖𝑙 (𝑖)
= න𝑓 𝑡′ 𝑔 𝑡− 𝑡′ 𝑑𝑡′ 𝑙=1
56
Spectral-Based Model (GCN)
❑ Spectral graph convolution
𝑁
ෝ 𝜆𝑙 𝐲(𝜆
𝒙 ∗𝑔 𝒚 𝑖 = 𝒙 ො 𝑙 )𝒖𝑙 (𝑖)
𝑙=1
filter filter in spectral domain
57 Thomas N. Kipf, Max Welling: Semi-Supervised Classification with Graph Convolutional Networks. 2016
Spectral-Based Model (GCN)
❑ Approximate convolution (with Cheby polynomials)
1 1
−
෩ 2𝑨 ෩𝑫 −
෩ 2 ෩ = 𝑨 + 𝑰𝑛
𝒙 ∗𝜃 𝒚 ≈ 𝜃 𝑫 𝒙, 𝑨
58
Spectral-Based Model (GCN)
❑ A two-layer architecture for node classification
1 1
෩ = softmax(𝑨
𝒀 𝜎 𝑨
𝑿𝚯1 𝚯2 ), ෩ −2 ෩ ෩ −2
= 𝑫 𝑨𝑫
𝑨
❑ Transductive learning
❑ Spatial interpretations
1 1
−
෩ 2𝑨 ෩𝑫 −
෩ 2 𝐗𝚯
𝒁= 𝑫
59
Graph Conv in Spatial Domain
❑ Local aggregations local neighborhood
60
Spatial-Based Model
❑Key idea: message passing
❑ Defines how to aggregate node representations
❑ Key components:
❑ Messages: 𝜙 𝒙𝑖 , 𝒙𝑗 by differentiable functions
❑ Neighborhood aggregation
Aggregate 𝜙 𝒙𝑖 , 𝒙𝑗 , 𝑗 ∈ 𝒩 𝑖
❑ Update by differentiable functions
𝒙𝑖 ← 𝛾 𝒙𝑖 , Aggregate 𝜙 𝒙𝑖 , 𝒙𝑗 , 𝑗 ∈ 𝒩 𝑖
61
Spatial-Based Model (GraphSage)
❑ Message passing:
62
Spatial-Based Model (GAT)
❑ Message Passing: spatial convolution
𝑗∈𝒩 𝑖
𝚯𝒙𝑖 𝚯𝒙𝑗
63
Graph Neural Networks
❑ Overview (node-level vs. graph-level embedding)
64
Open Challenges & Future Directions optional
❑ F4: GNN+X
https://www.researchgate.net/publication/281550777_Upscaling_In_Situ_Soil_Moisture_Observations_To_Pixel_Averages_With_Spatio-
Temporal_Geostatistics/figures?lo=1&utm_source=google&utm_medium=organic
https://medium.com/technicity/worlds-100-largest-companies-by-revenue-in-2019-d6d53dd1851d
GNN+X:
Net3: Overview
• Problem: given the 𝜔 historical snapshots, predict the next 𝜏 snapshots.
• Challenge 1: Explicit Relations - TGCN
• Challenge 2: Implicit Relations - TRNN
• Baoyu Jing, Hanghang Tong and Yada Zhu: Network of Tensor Time Series. TheWebConf 2021
Missing value recovery & Future value prediction
• The red arrows point (or the left-most bars) to NET3.
• NET3 performs the best: lowest RMSE.
• Baoyu Jing, Hanghang Tong and Yada Zhu: Network of Tensor Time Series. TheWebConf 2021
68
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• Dynamic graph embedding: nodes with time series to describe their dynamics.
• Methods purely based on time series information: BRITS, rGAIN.
• Methods utilize graphs that model underlying relationship within entities: NET3, GRIN, SPIN.
• Network time series imputation (missing components exist in time series and graph edges):
• Dynamics of social networks: graph that models retweet activities of different users evolves with time.
• Evolution of a pandemic: individuals like human beings or animals may move around and thus the graph that
models the spread of disease is dynamic.
• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 69
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• NTS imputation problem from the perspective of multi-task learning and model design:
• Key insight: we should learn a good generative model that can capture the distribution of
observed multivariate node feature time series and observed graph structures.
• Objective from the perspective of information bottleneck in unsupervised learning:
• Based on the objective, we can further decompose the first term as:
• Note that when and are independent from each other (even given the latent representation z), then we
have:
• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 70
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• Experiments of the proposed PoGeVon method over various real-world datasets:
• Datasets: three real-world dataset
• COVID-19: infection cases of 50 states in USA from 01/21/2020 to 12/31/2020.
• AQ36: air pollutants AQI values of 36 monitoring stations in Beijing.
• PeMS: traffic flow from 01/01/2022 to 03/31/2022 in Bay Area, Los Angeles and San Diego.
• Metrics: MAE, MSE, MRE
• Sensitivity analysis of PoGeVon over PeMS dataset by changing the mask rates.
• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 71
Temporal Graph Learning
• Temporal network structure:
Time-encoding
Node-encoder function
MLP ! Link-encoder
" MLP
Link-classifier
We don’t need complicated method if we could
select the “right” input data
#
MLP
GraphMixer: Experiments
• GraphMixer achieves outstanding performance • Larger average time-gaps
• Larger average node-degree
• Two variant of GraphMixer: • Larger maximum timestamps
• GraphMixer-L: only use link-encoder + link-classifier
• GraphMixer-N: only use node-encoder + link-classifier
SENSEI:
Reconciling Competing Sampling Strategies
❑ Typical sampling strategy 𝑢10 𝑢9
o Close nodes as positive samples and distant nodes as negative samples
𝑢8
❑ Competing sampling strategies 𝑢3 𝑢2
• Yuchen Yan, et al. Reconciling competing sampling strategies in network embedding. NeuIPS 2023. - 76 -
SENSEI: Algorithm
Theoretical analysis: Impossible to capture discrimination and monotonicity
property for all node pairs (association)
❑ Key ideas
o Partial monotonicity for top-K recommendation list
o First fulfill the discrimination property (Step 1) then satisfy the partial monotonicity (Step 2)
❑ Two models
o SENSEI-P for plain network
o SENSEI-M for multi-layered network
• Within-layer association
• Cross-layer association
- 77 -
SENSEI: Experimental Results
AUC-ROC and AUC-PR on the link prediction task Hit@10 on the node recommendation task
❑ Observations:
o SENSEI-P generally outperforms baselines on the link prediction task.
• Fulfill the discrimination property
o SENSEI-P beats baselines on the node recommendation task.
• Fulfill the partial monotonicity property
[1] Aditya Grover, et al. Node2vec: scalable feature learning for networks. KDD’2016.
[2] Thomas N Kipf, et al. Variational graph auto-encoders. arXiv’2016.
[3] Petar Veličković, et al. Graph attention networks. arXiv’2017.
[4] Shirui Pan, et al. Adversarially regularized graph autoencoder for graph embedding. arXiv’2018.
- 78 -
[5] Zexi Huang, et al. A broader picture of random-walk based graph embedding. KDD’2021.
GNNs Design #1: Context + Learning
• Step 1: Generate a Context Matrix
• Step 2: Learn Embedding by Preserving the Context
• e.g., minimizing an objective w.r.t. the context
• Examples: LINE, Deepwalk, node2vec and many more
Random walk generation (generate node Predict the nearby nodesin the random walks
contexts through random search)
• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15 80
GNNs Design #2: Propagate + Transform
• Step 1: Propagate/Aggregate the current embedding
• Step 2: Transform the aggregated embedding
• An Example: GCN
1 1
෩ −2 ෩ ෩ −2
𝒁 = 𝝈 (𝑫 𝑨𝑫 𝐗𝚯)
• Related GNNs
• Spatial GNN
• GraphSage
• GAT
81
GNNs Alternative Design:
Position + Transform
• Current work
– Step 1: Propagate/Aggregate the current embedding
– Step 2: Transform the aggregated feature
1 1
– An Example: GCN 𝒁 = 𝝈 (𝑫 −
෩ 2𝑨 ෩𝑫−
෩ 2 𝐗𝚯)
• Our work
– Step 1: Position the nodes wrt anchor nodes/node-pairs
– Step 2: Transform the position vectors (e.g., MLP)
– An Example: Bright (PGNN for Network Alignment)
• Random walk with restart for position, from anchor links
• Linear layer for transformations
• Margin-based loss for training
• Yuchen Yan, Si Zhang and Hanghang Tong: BRIGHT: A Bridging Algorithm for Network Alignment. TheWebConf 2021.
• Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing and Hanghang Tong:Dynamic Knowledge Graph Alignment. AAAI 2021
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143 82
GNNs Alternative Design:
Position + Transform
• BRIGHT: A Bridging Algorithm for Network Alignment
• Simultaneously Addressing Limitations of
– Consistency Optimization Based Methods: Same steps, same weights
– Embedding Based Methods: relaxation of consistency based methods
• Empirical Results
• Yuchen Yan, Si Zhang and Hanghang Tong: BRIGHT: A Bridging Algorithm for Network Alignment. TheWebConf 2021.
• Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing and Hanghang Tong:Dynamic Knowledge Graph Alignment. AAAI 2021
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143 83
Semi-Supervised Node Classification
Data Mining Computer Vision
• Given:
• the graph topology 𝐀
• node features 𝐗
• a part of the node labels 𝐲𝐭𝐫𝐚𝐢𝐧
• Predict:
• labels of test nodes 𝐲𝐭𝐞𝐬𝐭
?
[1] Yang, Zhilin, William Cohen, and Ruslan Salakhudinov. "Revisiting semi-supervised learning with graph embeddings." ICML 2016.
84
Existing Low-Pass Filter GNNs
• Based on homophily assumption (e.g., [1,2]): same-label nodes tend to
be connected → node embeddings should be smooth → low-pass filter
new
Graph signal processing [3] is an
analogy of classic signal processing.
[1] Yang, Zhilin, William Cohen, and Ruslan Salakhudinov. "Revisiting semi-supervised learning with graph embeddings." ICML 2016.
[2] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." NeurIPS 2020.
[3] Shuman, David I., et al. "The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains." IEEE signal processing magazine 30.3
(2013): 83-98. 85
[4] https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_ID=8613
Homophily Assumption is not Always True
• Many model-centric methods [2, 3]:
• multi-hop aggregation
• GNN with adaptive filters
• Low-pass filters smooth the node features (for homophily)
• High-pass filters differentiate the node features (for heterophily)
• …
?
Node
Original Graph Embedding
GNN
� �
� ⨀� Aggregated Node
Edge
� ,� Embedding � Representation � Classification
Edge
Reweighting GNN
( � − � ) ⨀� � �
Our framework can handle
Node Features Offset MLP • different degrees of
heterophily locally and
� MLP
�
• a broad range of GNNs
� ��� � �
• Zhe Xu, Yuzhong Chen, Qinghai Zhou, Yuhang Wu, Menghai Pan, Hao Yang, Hanghang Tong: Node 87
Classification Beyond Homophily: Towards a General Solution. KDD 2023: 2862-2873
Empirical Evaluation: Heterophilic Graphs
88
TEDGCN: Negative Depth for Heterophily
❑ Graph Homophily
o Close nodes tend to own similar attributes/labels. Measurement
∑𝑖,𝑗,𝐀 𝑖,𝑗 =1(𝑦 𝑖 ≠𝑦[𝑗])
ℎ 𝐺 = ∈ [0,1]
❑ Graph Heterophily ∑𝑖,𝑗 𝐀 𝑖,𝑗 =1
USA Russia
Ukraine
Ukraine
Graph heterophily ℎ 𝐺 = 0.33 Graph heterophily ℎ 𝐺 = 1
[1] Hongbin Pei, et al. Geom-gcn: Geometric graph convolutional networks. ICLR’2020.
[2] Yuchen Yan, et al. Trainable negative depth is all you need for graph heterophily. NeuIPS 2023 - 89 -
Key Idea: Depth of GCN & Graph Heterophily
❑ Use depth of GCN (𝑑) to control GCN’s expressive power of graph heterophily.
𝑑
o Graph Laplacian & 𝑑: 𝑑 controls weights of different eigen-graphs (∑𝑛 ሚ
𝑖=1 1 − 𝜆𝑖 𝐮⊤
𝐮𝑖 𝑖 ).
o Graph Laplacian & graph heterophily: eigen-graphs w.r.t. large 𝜆ሚ𝑖 represents heterophilic
components in the graph 𝐺.
Directly enable the 𝒅 to be trainable in
GCN to make weights small/large for
⊤
𝑖 𝐮
𝐮 ෨
𝑖 w.r.t. large 𝝀𝒊 in
homophilic/heterophilic graphs?
• Yuchen Yan, et al. Trainable negative depth is all you need for graph heterophily. NeuIPS 2023
- 90 -
Redefine the Depth of GCN
1
❑ Select 𝑔 𝜆𝑖 = 1 − 2 𝜆𝑖
❑ Define 𝐒 = 𝑔 𝐋𝑠𝑦𝑚 = 𝐔 𝐈 − 0.5𝚲 𝐔⊤
❑ Required properties
1
o 𝑔 𝜆𝑖 = 1 − 𝜆𝑖 > 0 for 𝜆𝑖 ∈ 0,2
2
o When 𝑑 > 0, 𝑔(𝜆𝑖 ) monotonically decreases (homophily).
o When 𝑑 < 0, 𝑔(𝜆𝑖 ) monotonically increases (heterophily).
o Geometric interpretability compared to 𝐒෨ in GCN
• 𝐒෨ in GCN: (1) add self-loops; (2) normalization.
• 𝐒 : (1) normalization; (2) half normalization + half self-loops.
[1] Shuman, et al. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing
magazine’2013..
- 91 -
Experimental Results
Homophilic datasets
Heterophilic datasets
- 92 -
Key Papers
Core Papers
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15
• Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
Further Reading
• Graph convolutional networks: a comprehensive review. Si Zhang, Hanghang Tong, Jiejun Xu, Ross Maciejewski Computational Social Networks,
2019, PDF: https://sizhang2.web.illinois.edu/resources/gcn_survey_journal.pdf
• Inductive Representation Learning on Large Graphs. W.L. Hamilton, R. Ying, and J. Leskovec arXiv:1706.02216 [cs.SI], 2017.
• Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka: How Powerful are Graph Neural Networks? ICLR 2019
• Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang: Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE,
PTE, and node2vec. WSDM 2018: 459-467
• Yuxiao Dong, Nitesh V. Chawla, Ananthram Swami: metapath2vec: Scalable Representation Learning for Heterogeneous Networks. KDD 2017: 135-
144
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143
• Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, Martin Grohe: Weisfeiler and Leman Go
Neural: Higher-Order Graph Neural Networks. AAAI 2019: 4602-4609
• Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, Nitesh V. Chawla: Heterogeneous Graph Neural Network. KDD 2019: 793-803
• Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, Hanghang Tong, Christos Faloutsos: It's who you know: graph mining
using recursive structural features. KDD 2011: 663-671
• Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).
• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Shuman, David I., et al. "The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains." IEEE signal processing magazine 30.3 (2013): 83-98.
• Z. Xu, Y. Chen, Q. Zhou, Y. Wu, M. Pan, H. Yang, H. Tong: Node Classification Beyond Homophily: Towards a General Solution. KDD 2023
94
More on GNNs (not covered)
❑ Graph convolutional networks: a comprehensive review [pdf] Si Zhang,
Hanghang Tong, Jiejun Xu, Ross Maciejewski Computational Social
Networks, 2019
95
Spatial-Based Model (HGCN)
❑Hyperbolic graph convolutional networks
❑ Key idea: message passing in hyperbolic space
❑ Why: better capture graph hierarchical structure
96
Other Aspects in Graph Conv Nets
❑ Pooling methods
❑ To better capture graph hierarchical structure.
❑ Deterministic methods:
❑ Multi-level graph coarsening/clustering, etc.
❑ Adaptive methods:
❑ Neural pooling layer
❑ Training methods:
❑ Stochastic training – minibatch sampling
❑ Uniform sampling, importance sampling, etc.
97
Self-Attention Pooling
❑ Key idea: self-attention based importance
1 1
−
෩ 2𝑨 ෩𝑫 −
෩ 2 𝐗𝜽𝑎𝑡𝑡
𝒁= 𝑫
❑Tasks:
❑ Graph representation
❑ Node representation (need a good architecture)
98
Graph Conv Nets for Recommendation
❑ User-item interactions → Bipartite graph
99
Graph Conv Nets for Recommendation
❑ Message construction
𝑚𝑢←𝑖 = 𝑓 𝒙𝑖 , 𝒙𝑢 , 𝑝𝑢𝑖
1
𝑚𝑢←𝑖 = 𝑾1 𝒙𝑖 + 𝑾2 𝒙𝑖 ⊙ 𝒙𝑢
𝒩𝑢 𝒩𝑖
❑ Message Aggregation
(𝑙) 𝑙 𝑙
𝒙𝑢 = LeakyReLU 𝑚𝑢←𝑢 + 𝑚𝑢←𝑖
𝑖∈𝒩𝑢
101
Graph Conv Nets for Network Alignment
❑ Key idea: allow message passing across networks
❑ Cross-network aggregation via alignment
102
Key Challenges
❑ Optimization Problem in Deep Learning
❑ Minimize the (approximated) training error E
❑ (Stochastic) gradient descent to find the model parameter
❑ Challenge 1: Optimization
❑ E is non-convex in general
❑ How to find a high-quality local optima
❑ Challenge 2: Generalization
❑ What we do: minimize (approximated) training error
❑ What we really want: minimize the generalization error
❑ How to mitigate over-fitting
103
Responsive Activation Function
❑ Saturation of Sigmoid Activation Function
❑ The output
❑ The derivative decaying
❑ The error of an output unit
❑ Saturation of Sigmoid Activation Function
❑ If , , both derivative and error will be close to 0
❑ Further exacerbated due to backpropagation → gradient vanishing → B.P. is stuck or
takes long time to terminate
❑ A More Responsive Activation Function: Rectified Linear Unit (RELU)
❑ The output O=f(I) = I if I>0, O=0 otherwise
❑ The gradient:
❑ The error: 0 if the unit is inactive (I<0), otherwise, aggregate all error terms from the units
in the next higher layer the unit is connected to, w/o decaying → avoid gradient vanishing
104
Activation Functions
105
Adaptive Learning Rate
❑ Stochastic gradient descent to find a local optima
❑ Default choice for 𝜂: a fixed, small positive constant
❑ Problems: slow progress, or jump over ‘gradient cliff’, or oscillation
❑ Adaptive learning rate: let 𝜂 change over epoch
❑ Strategy #1:
❑ Strategy #2:
❑ Intuitions: smaller adjustment as algorithm progresses
❑ 𝑒. 𝑔., 𝜂0 = 0.9; 𝜂∞ = 10−9
❑ Strategy #3 (AdaGrad):
❑ Intuition: The magnitude of gradient gt : indicator of the overall progress
❑ 𝑒. 𝑔. , 𝜌 = 10−8
❑ Strategy #4 (RMSProp): exponential decaying weighted sum of squared historical gradients
106
Dropout
❑ The Purpose of Dropout: to Prevent Overfitting
❑ How does it works?
❑ At each epoch (𝜌 dropout rate, e.g., 𝜌 = 0.5)
❑ randomly dropout some non-output units
❑ Perform B.P. on the dropout network
❑ Scale the final model parameters
108
Cross Entropy
❑ Measure the disagreement between the actual (T) and predicted (O) target values
❑ For regression: mean-squared error
❑ For (binary) classification: cross-entropy
❑ Mean-squared error vs. Cross-entropy
109