04 GNNBasic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

CS 514 Advanced Topics in Network Science

Lecture 4: Graph Neural Networks (Basics)


Hanghang Tong, Computer Science, Univ. Illinois at Urbana -Champaign, 2024

1
Network Science: An Overview

network
(e.g., Patterns, laws, connectivity, etc.)

(e.g., clusters, communities,


dense subgraphs, etc.)
subgraph

(e.g., ranking, link prediction, embedding, etc.)

node/link
We are (most) here
• Level 1: diameter, connectivity, graph-level classification, graph-level embedding, graph kernel, graph structure learning, graph generator,…
• Level 2: frequent subgraphs, clustering, community detection, motif, teams, dense subgraphs, subgraph matching, NetFair, …
• Level 3: node proximity, node classification, link prediction, anomaly detection, node embedding, network alignment, NetFair,
• Beyond:, network of X, ….

2
Graph Neural Networks

❑ Overview

❑ GNN #1: Context + Learning

❑ GNN #2: Propagation + Transformation

❑ Summary

This Lecture: deep graph learning = graph neural networks = graph representation learning
3
Deep Graph Learning (Graph Neural Networks)
• Node-level representations

• Graph-level representations

Input graphs Graph classification

This lecture: deep graph learning = graph neural networks = graph representation learning 4
GNNs Example #1: Context + Learning
• Step 1: Generate a Context Matrix
• Step 2: Learn Embedding by Preserving the Context
• e.g., minimizing an objective w.r.t. the context
• Examples: LINE, Deepwalk, node2vec and many more

Random walk generation (generate node Predict the nearby nodesin the random walks
contexts through random search)

• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15 5
GNNs Example #2: Propagate + Transform
• Step 1: Propagate/Aggregate the current embedding
• Step 2: Transform the aggregated embedding
• An Example: GCN
1 1
෩ −2 ෩ ෩ −2
𝒁 = 𝝈 (𝑫 𝑨𝑫 𝐗𝚯)

• Related GNNs
• Spatial GNN
• GraphSage
• GAT

6
Graph Neural Networks

❑ Overview

❑ GNN #1: Context + Learning

❑ GNN #2: Propagation + Transformation

❑ Summary

This Lecture: deep graph learning = graph neural networks = graph representation learning
7
Problem Definition: Node Embedding
• Given a network/graph G=(V, E, W), where V is the set of nodes, E is
the set of edges between the nodes, and W is the set of weights of
the edges,
• Represent each node i with a vector , which preserves the
structure of networks.

Networks Node representations

8 Slides credit: Jian Tang HEC @Montreal


Related Work
• Classical graph embedding algorithms
• MDS, IsoMap, LLE, Laplacian Eigenmap, …
• Hard to scale up
• Graph factorization (Ahmed et al. 2013)
• Not specifically designed for network representation
• Some are for Undirected graphs only
• Neural word embeddings (Bengio et al. 2003)
• Neural language model
• word2vec (skipgram), paragraph vectors, etc.

9
LINE: Large-scale Information Network Embedding
• Arbitrary types of networks
• Directed, undirected, and/or weighted
• Clear objective function
• Preserve the first-order and second-order proximity
• Scalable
• Asynchronous stochastic gradient descent
• Millions of nodes and billions of edges: a coupe of hours on a single machine

10 • Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15
First-order Proximity

• The local pairwise proximity between the nodes


• However, many links between the nodes are not observed
• Not sufficient for preserving the entire network structure

11
Second-order Proximity
“The degree of overlap of two people’s friendship networks correlates
with the strength of ties between them” --Mark Granovetter

“You shall know a word by the company it keeps” --John Rupert Firth

• Proximity between the neighborhood structures of the nodes

12
Preserving the First-order Proximity
(LINE 1st)
• Distributions: : (defined on the undirected edge i - j)
wij
Empirical distribution of first- p̂1 (vi , v j ) =
order proximity: å wmn
(m,n)ÎE
: Embedding of i
Model distribution of
first-order proximity:

• Objective:

O1 = KL( p̂1, p1 ) = - å wij log p1 (vi , v j )


(i, j )ÎE

13
Preserving the Second-order Proximity
(LINE 2nd)
• Distributions: (defined on the directed edge i -> j)
wij
Empirical distribution of p̂2 (v j | vi ) =
neighborhood structure: åw ik
kÎV
• ui: node i is as the central node
Model distribution of • u’i: node i is treated as context
neighborhood structure:

• Objective:
O2 = å KL( p̂2 (× | vi ), p2 (× | vi )) = - å wij log p2 (v j | vi )
i (i, j )ÎE

14
Optimization Tricks
• Stochastic gradient descent + Negative Sampling
• Randomly sample an edge and multiple negative edges

• The gradient w.r.t the embedding with edge (i, j)

• Problematic when the variances of weights of the edges are large


• The variance of the gradients are large
• Solution: edge sampling
• Sample the edges according to their weights and treat the edges as binary
• Complexity: O(d*K*|E|)
• Linear to the dimensionality d, the number of negative samples K, and the number of edges

15
Discussions
• Embed nodes with few neighbors
• Expand the neighbors by adding higher-order neighbors
• Breadth-first search (BFS)
• Adding only second-order neighbors works well in most cases
• Embed new nodes
• Fix the embeddings of existing nodes
• Optimize the objective w.r.t. the embeddings of new nodes

16
DeepWalk (Perozzi et al. 2014)
• Learning node representations with the technique for learning word
representations, i.e., Skipgram
• Treat random walks on networks as sentences

Random walk generation Predict the nearby nodes


(generate node contexts in the random walks
through random search)

17 Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
Node2Vec (Grover and Leskovec, 2016)

• Find the node context with a hybrid strategy of


• Breadth-first Sampling (BFS): homophily
• Depth-first Sampling (DFS): structural equivalence

18 Aditya Grover and Jure Leskovec. node2vec: Scalable Feature Learning for Networks. KDD’16
Expand Node Contexts with Biased Random Walk

Un-normalized transition probability of (v, x):

• Biased random walk with two parameters p and q


• Assumption: random walk has just traversed from t to v
• dtx = {0, 1, 2}: the shortest path distance between t and x
• wvx: the edge weight
• p: controls the probability of revisiting a node in the walk [larger p, less likely to revisit t]
• q: controls the probability of exploring “outward” nodes [larger q, less likely to go outward]
• Find optimal p and q through cross-validation on labeled data
• Intuition: p and q approximately interpolate between BFS and DFS
• Optimized through similar objective as LINE with first-order proximity
19
Comparison between LINE, DeepWalk, and Node2Vec

Algorithm Neighbor Expansion Proximity Optimization Validation Data

LINE BFS 1st or 2nd Negative Sampling No

DeepWalk Random 2nd Hierarchical Softmax No

Node2Vec BFS + DFS 1st Negative Sampling Yes

20
Applications
• Node classification (Perozzi et al. 2014, Tang et al. 2015a, Grover et al.
2015 )
• Node visualization (Tang et al. 2015a)
• Link prediction (Grover et al. 2015)
• Recommendation (Zhao et al. 2016)
• Text representation (Tang et al. 2015a, Tang et al. 2015b)
•…

21
Many Extensions …
• Leverage global structural information (Cao et al. 2015)
• Non-linear methods based on autoencoders (Wang et al. 2016)
• Matrix-factorization based approaches (Qiu et al. 2018)
• Directed network embedding (Ou et al. 2016)
• Signed network embedding (Wang et al. 2017)
• Multi-view networks ( Qu and Tang et al. 2017)
• Networks with node attributes (Yang et al. 2015)
• Heterogeneous networks (Chang et al. 2015)
• Task-specific network embedding (Chen et al. 2017)

22
Graph Neural Networks
❑ Overview

❑ GNN #1: Context + Learning

❑ GNN #2: Propagation + Transformation

❑ Preliminaries (MLP, BP, CNN)

❑ Graph Convolutional Networks

❑ Summary This Lecture: deep graph learning = graph neural networks = graph representation learning
23
New power source for AI
preliminaries

Big(ger) data Better model High performance


& (learning) alg. computing

24
preliminaries
Inspired by a biological neuron

Image credit:
25
http://cs231n.github.io/neural-networks-1/
preliminaries
How to model a single neuron?

x1
w1

w2 w1x1+w2x2+
x2
w3x3
w3

x3

26
preliminaries
Perceptron: Predecessor of a Neural Network

𝑦ො = 𝑓 𝑾𝑇 𝒙 = 𝑓(∑𝑊𝑖 𝑥𝑖 + 𝑏)

Activation Function Weights Bias


Adding non-linearity to A measure of how easy it is to get the
the model perceptron to output a 1

❑ Computes a weighted sum of inputs


❑ Invented in 1957 by Frank Rosenblatt. The original perceptron model does not
27 have a non-linear activation function
preliminaries
Multilayer Perceptron (MLP)

❑ Stacking multiple layers of perceptrons (adding hidden layers) makes a multilayer


perceptron (MLP)
❑ MLP can engage in sophisticated decision making, where perceptrons fail
❑ E.g. XOR problem
❑ Try it: http://playground.tensorflow.org
28
preliminaries
Feed-forward Neural Networks

❑ Why multiple layers


❑ Automatic feature/representation learning
29 ❑ Learn complicate (nonlinear) mapping function
preliminaries
Learning NN Parameters
Loss (error) function

Predicted output Ground truth


Input 𝒙𝑖 𝐽𝜃 (. )
ෝ𝑖
𝒚 𝒚𝑖

Multilayer Network
Loss 𝐸
Update Parameters

❑ Gradient Descent Algorithm


❑ Input: Training sample 𝒙𝑖 and its label 𝒚𝑖
1. ෝ𝑖 = MLP(𝑥𝑖 ), and loss 𝐸 = 𝐽(ෝ
Feed Forward: Get prediction 𝒚 𝒚𝑖 , 𝒚 𝑖 )
𝜕
2. Compute Gradient: For each parameter 𝜃𝑗 (weights, bias), compute its gradient 𝐽
𝜕𝜃𝑗 𝜃
𝜕 Explained later
3. Update Parameter: 𝜃𝑗 = 𝜃𝑗 − 𝜂 ⋅ 𝐽𝜃
30 𝜕𝜃𝑗
Gradient Computation: Backpropagation
(𝑙+1)
… 𝛿1
(𝑙) (𝑙+1)
𝛿𝑗 𝛿2
… Unit i Unit j …
𝑜𝑖
… (𝑙+1)
𝛿3

Layer l-1 Layer l Layer l+1

❑ The gradient of 𝑤𝑖𝑗 in the 𝑙th layer (corresponding to unit j in layer l, connected to unit i
in layer l-1) is a function of
(𝑙+1)
❑ All ‘error’ terms from layer l+1 𝛿𝑘 -- An auxiliary term for computation, not to be
confused with gradients
❑ Output from unit i in layer l-1 (input to unit j in layer l) -- Can be stored at the feed
forward phase of computation
31
Gradient Computation: Backpropagation
(𝑙+1)
… 𝛿1
(𝑙) (𝑙+1)
𝛿𝑗 𝛿2
… Unit i Unit j …
𝑜𝑖
… (𝑙+1)
𝛿3

Layer l-1 Layer l Layer l+1


(𝑙)
❑ The ’error’ terms 𝛿𝑗 is a function of
(𝑙+1)
❑ All 𝛿𝑘 in the layer l+1, if layer l is a hidden layer
❑ The overall loss function value, if layer l is the output layer
❑ We can compute the error at the output, and distribute backwards throughout the
network’s layers (backpropagation)

32
From Neural Networks to Deep Learning
❑ Deep Learning refers to training (deep) neural networks with many layers
❑ More neurons, more layers
❑ More complex ways to connect layers
❑ Deep Learning has some major advantages, making it popular
❑ Tremendous improvement of performance in many tasks
❑ Image recognition, natural language processing, AI game playing…
❑ Requires no (or less) feature engineering, making end-to-end models possible
❑ Several factors lead to deep learning’s success
❑ Very large data sets
❑ Massive amounts of computation power (GPU acceleration)
❑ Advanced neural network structures and tricks
❑ Convolutional neural networks, recurrent neural networks, graph convolution network …
33 ❑ Dropout, ReLU, residual connection, …
Overview of Typical Deep Learning Architectures

34
Techniques to Improve Deep Learning Training
❑ Key Challenges for Training Deep Learning Models

❑ Responsive Activation Functions

❑ Adaptive Learning Rate

❑ Dropout

❑ Pretraining

❑ Cross Entropy
35
Convolutional Neural Networks (CNN)
preliminaries

❑ What is convolution?

1D Convolution 2D Convolution

❑ The outputs are computed by sliding a kernel (of weights) on the inputs, and
36 computing weighted sum locally
CNN Motivation
preliminaries

❑ Why not deep MLP (or feed-forward neural network)?


❑ Deep multilayer perceptrons are computationally expensive, and hard to train.
❑ Long training time, slow convergence, local minima
❑ Motivations of convolution
❑ Sparse interactions
❑ Parameter sharing
❑ Translational equivalence
❑ The properties of CNNs are well aligned with properties of many forms of data (e.g.
images, text), making them very successful

37
CNN Motivation: Sparse Interactions
preliminaries

❑ E.g. 1D convolution with kernel size 3


❑ Units in deeper layers still connect to a wide range of inputs

MLP
CNN

38
CNN Motivation: Parameter Sharing
preliminaries

❑ Each kernel is used on all locations of input


❑ Reduce #parameters
❑ An Example
❑ Given image of size 320 × 280 × 3,
❑ Generate a feature map of size 320×280.
❑ Option #1: use a kernel of size 5×5×3 → 5 × 5 × 3 = 75 parameters.
❑ Option #2: use fully-connected layer: 320 × 280 × 3 × 320 × 280 = 24, 084, 480,
000 (more than 24 billion!) parameters.

39 Fully connected layer Locally connected (sparse interaction) Convolution (sparse + sharing)
CNN Motivation: Translation Equivalence
preliminaries

❑Translational equivalence
❑ Intuitions: Same input at different location gives same output
❑ Math details:
❑ Examples
❑ #1: a cat at the upper right corner and at the lower left corner of an image, will
produce the same outputs
❑ #2: “University of Illinois” at the start of the sentence and at the end of the
sentence produce the same outputs

40
CNNs: Translation Equivalence
preliminaries

𝑪𝑖,𝑗 = ෍ 𝑰𝑖−𝑝,𝑗−𝑞 𝑲𝑝,𝑞


𝑝,𝑞

image 𝑰 (𝑁 × 𝑁)
move right by 𝑙
Filter 𝑲
(𝑀 × 𝑀) 𝑪𝑖 ′,𝑗 ′ = ෍ 𝑰𝑖−𝑝,𝑗−𝑙−𝑞 𝑲𝑝,𝑞
𝑝,𝑞
Also moves right by 𝑙!
image 𝑰′ (𝑁 × 𝑁) (i.e., 𝑖 ′ = 𝑖, 𝑗 ′ = 𝑗 − 𝑙)

41
CNN: Pooling Layer
preliminaries

❑ Pooling (Subsampling)
❑ Pool hidden units in the same neighborhood
❑ Introduces invariance to local translations
❑ Reduces the number of hidden units in hidden layer

42
CNN: Convolutional Layer
preliminaries

❑ Multiple kernels (organized in a tensor)


❑ Applied separately to the input volumn
❑ Full in depth: producing a 2D feature map
❑ Multiple feature maps are organized in a 3D tensor
❑ Followed by
❑ nonlinear activation
❑ pooling

43
Convolutional Neural Networks
preliminaries

❑ CNN = NN has at least one convolutional layer

44
Importance of # of layers
preliminaries

❑ Models goes deeper (# of layers increases), CNNs tend to learn “better” representation
❑ Why? The higher layer is able to learn from a larger receptive field.
❑ Examples
❑ In 1990s: LeNet has 8 layers
❑ Now: it is common to have 100+ layers (DenseNet, resNet)

45
CovoluNN for Image Recognition: Example

An example CNN for hand written digit recognition

46
Graph Neural Networks
❑ Overview

❑ GNN #1: Context + Learning

❑ GNN #2: Propagation + Transformation

❑ Preliminaries (MLP, BP, CNN)

❑ Graph Convolutional Networks

❑ Summary This Lecture: deep graph learning = graph neural networks = graph representation learning
47
Representation Learning on Graphs/Networks
❑Node-level representations
❑ To learn low-dimensional node features

48
Representation Learning on Graphs/Networks
❑Graph-level representations
❑ To learn low-dimensional graph features

Input graphs Graph classification

49
Applications
❑ Node classification
❑ Predict node categories
❑ Link prediction
❑ Predict edges in the network
❑ Graph classification
❑ Predict graph categories
❑ Many more
❑ Recommendation
❑ Multiple network mining
❑ …

50
GNN #2: Propagation + Transformation
❑ Graph Convolutional Neural Networks
❑ GraphSage
❑ Graph Attention Networks
❑ Graph GAN
❑ Graph Recurrent Neural Networks
❑ Many Many More!

51
Classic CNN on Graphs
❑ Can we directly apply CNN?

No! fixed-size
kernel?
fixed-size kernel

52
Challenges
❑Irregular graph structure (non-Euclidean)
❑ Unfixed size of node neighborhoods

❑Permutation invariance
❑ Node ordering does not matter

❑ Undefined convolution computation

𝑪𝑖,𝑗 = ෍ 𝑰𝑖−𝑝,𝑗−𝑞 𝑲𝑝,𝑞 Translations on graphs?


𝑝,𝑞

53
Graph Convolution
❑ Convolution in spectral domain

❑ Convolution in spatial domain

54
Graph Conv in Spectral Domain
❑ Graph signal processing

❑Graph signal:
❑ vector 𝒙 ∈ 𝑅 𝑛 where 𝒙 𝑖 is for node-𝑖

node attributes graph signal-1

3
2
1
4
node attribute
matrix

55
Graph Conv in Spectral Domain
❑ Convolution in spectral domain (due to convolution theorem)
❑ (1) Fourier transformation on singal and filter 𝑓መ = FT(f), 𝑔
ො = FT(g); (2) pointwise
መ 𝑔;
multiplication 𝑟Ƹ = 𝑓. ො (3) inverse Fourier transformation: 𝑟 = FT −1 (𝑟)Ƹ
❑ Graph spectral convolution: How to read this table? By analogy!
Classic signal Graph signal
Laplace ∆(𝑒 2𝜋𝑖𝜔𝑡 ) 𝑳=𝑫−𝑨
Operator (Laplacian matrix)
Frequency 2𝜋𝜔 Eigenvalues of Laplacian {𝜆𝑙 }
Eigenfunction 𝑒 2𝜋𝑖𝜔𝑡 Eigenvectors of Laplacian 𝒖𝑙
Fourier 𝑓መ 𝜔 = 𝑓, 𝑒 2𝜋𝑖𝜔𝑡 ෝ 𝜆𝑙 = 𝒙, 𝒖𝑙 = ෍ 𝒙 𝑖 𝒖∗𝑙 (𝑖)
𝒙
transform 𝑖

Convolution 𝑓∗𝑔 𝑡 𝑁

ෝ 𝜆𝑙 𝐲(𝜆
𝒙 ∗𝑔 𝒚 𝑖 = ෍ 𝒙 ො 𝑙 )𝒖𝑙 (𝑖)
= න𝑓 𝑡′ 𝑔 𝑡− 𝑡′ 𝑑𝑡′ 𝑙=1
56
Spectral-Based Model (GCN)
❑ Spectral graph convolution
𝑁

ෝ 𝜆𝑙 𝐲(𝜆
𝒙 ∗𝑔 𝒚 𝑖 = ෍ 𝒙 ො 𝑙 )𝒖𝑙 (𝑖)
𝑙=1
filter filter in spectral domain

❑ Filter choice: K-polynomial filter (K=2)


𝐾 𝜃1 = −𝜃2 = 𝜃
ෝ 𝜆𝑙 = ෍ 𝜃𝑘 𝜆𝑘−1
𝒚 𝑙 = 𝜃 − 𝜃𝜆𝑙
𝑘=1

57 Thomas N. Kipf, Max Welling: Semi-Supervised Classification with Graph Convolutional Networks. 2016
Spectral-Based Model (GCN)
❑ Approximate convolution (with Cheby polynomials)
1 1

෩ 2𝑨 ෩𝑫 −
෩ 2 ෩ = 𝑨 + 𝑰𝑛
𝒙 ∗𝜃 𝒚 ≈ 𝜃 𝑫 𝒙, 𝑨

❑ For node features 𝑿 ∈ 𝑅𝑛×𝐶 (i.e., 𝐶 signals)


1 1

෩ 2𝑨 ෩𝑫 −
෩ 2 𝐗𝚯
𝒁= 𝑫 Parameters
Output
Adjacency Input features
features
matrix with
self-loops

58
Spectral-Based Model (GCN)
❑ A two-layer architecture for node classification
1 1
෩ = softmax(𝑨
𝒀 ෡𝜎 𝑨
෡ 𝑿𝚯1 𝚯2 ), ෩ −2 ෩ ෩ −2
෡ = 𝑫 𝑨𝑫
𝑨

❑ Transductive learning
❑ Spatial interpretations
1 1

෩ 2𝑨 ෩𝑫 −
෩ 2 𝐗𝚯
𝒁= 𝑫

❑ Linear transformation on input features (= a Feed-forward layer)


❑ Propagations along edges
❑ Optionally followed up by a point-wise non-linear activation (RELU, Sigmod)

59
Graph Conv in Spatial Domain
❑ Local aggregations local neighborhood

❑ Spatial graph convolution


𝒙𝑜𝑢𝑡 𝑖 = 𝑤𝑖,𝑖 𝒙 𝑖 + ෍ 𝑤𝑖,𝑗 𝒙(𝑗)
𝑗∈𝒩 𝑖,𝐾
K-hop neighborhood

60
Spatial-Based Model
❑Key idea: message passing
❑ Defines how to aggregate node representations

❑ Key components:
❑ Messages: 𝜙 𝒙𝑖 , 𝒙𝑗 by differentiable functions
❑ Neighborhood aggregation
Aggregate 𝜙 𝒙𝑖 , 𝒙𝑗 , 𝑗 ∈ 𝒩 𝑖
❑ Update by differentiable functions
𝒙𝑖 ← 𝛾 𝒙𝑖 , Aggregate 𝜙 𝒙𝑖 , 𝒙𝑗 , 𝑗 ∈ 𝒩 𝑖

61
Spatial-Based Model (GraphSage)
❑ Message passing:

❑ Aggregation: mean, max pooling, LSTM

62
Spatial-Based Model (GAT)
❑ Message Passing: spatial convolution

𝒙𝑜𝑢𝑡 𝑖 = 𝑤𝑖,𝑖 𝒙 𝑖 + ෍ 𝑤𝑖,𝑗 𝒙(𝑗)


𝒙𝑖 ← 𝛼𝑖,𝑖 𝚯𝒙𝑖 + ෍ 𝛼𝑖,𝑗 𝚯𝒙𝑗 𝑗∈𝒩 𝑖,𝐾

𝑗∈𝒩 𝑖

❑ Attention coefficients can be viewed as importance

exp(𝒂𝑇 𝚯𝒙𝑖 𝚯𝒙𝑗


𝛼𝑖𝑗 =
∑𝑘∈𝒩 𝑖 exp(𝒂𝑇 𝚯𝒙𝑖 𝚯𝒙𝑘

𝚯𝒙𝑖 𝚯𝒙𝑗
63
Graph Neural Networks
❑ Overview (node-level vs. graph-level embedding)

❑ GNN #1: Context + Learning

❑ LINE, Deepwalk, Node2vec and many more

❑ GNN #2: Propagation + Transformation

❑ GCN, GAT, GraphSage and many more

64
Open Challenges & Future Directions optional

❑ F1: GNNs Design

❑ F2: Sampling in GNNs

❑ F3: GNNs with Heterophilly

❑ F4: GNN+X

❑ F5: Inverse Problem of GNNs

❑ Many many more…


65
GNN+X:
Network of Tensor Time Series
• Co-evolving time series is ubiquitous.
– Each time series is related to each other.

Environmental Monitoring Financial Analysis Smart Transportation

https://www.researchgate.net/publication/281550777_Upscaling_In_Situ_Soil_Moisture_Observations_To_Pixel_Averages_With_Spatio-
Temporal_Geostatistics/figures?lo=1&utm_source=google&utm_medium=organic
https://medium.com/technicity/worlds-100-largest-companies-by-revenue-in-2019-d6d53dd1851d
GNN+X:
Net3: Overview
• Problem: given the 𝜔 historical snapshots, predict the next 𝜏 snapshots.
• Challenge 1: Explicit Relations - TGCN
• Challenge 2: Implicit Relations - TRNN

• Baoyu Jing, Hanghang Tong and Yada Zhu: Network of Tensor Time Series. TheWebConf 2021
Missing value recovery & Future value prediction
• The red arrows point (or the left-most bars) to NET3.
• NET3 performs the best: lowest RMSE.

• Baoyu Jing, Hanghang Tong and Yada Zhu: Network of Tensor Time Series. TheWebConf 2021
68
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• Dynamic graph embedding: nodes with time series to describe their dynamics.
• Methods purely based on time series information: BRITS, rGAIN.
• Methods utilize graphs that model underlying relationship within entities: NET3, GRIN, SPIN.

• Network time series imputation (missing components exist in time series and graph edges):
• Dynamics of social networks: graph that models retweet activities of different users evolves with time.
• Evolution of a pandemic: individuals like human beings or animals may move around and thus the graph that
models the spread of disease is dynamic.

• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 69
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• NTS imputation problem from the perspective of multi-task learning and model design:
• Key insight: we should learn a good generative model that can capture the distribution of
observed multivariate node feature time series and observed graph structures.
• Objective from the perspective of information bottleneck in unsupervised learning:

• Based on the objective, we can further decompose the first term as:
• Note that when and are independent from each other (even given the latent representation z), then we
have:

which means that the model can learn and independently.


• Real world dataset demonstrates that they are not independent.
• This motivates us to design VAE based model to solve the problem
since we have:

which allows us predict and in a reciprocal way.

• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 70
GNN+X:
Networked Time Series Imputation with Graph Enhanced VAE
• Experiments of the proposed PoGeVon method over various real-world datasets:
• Datasets: three real-world dataset
• COVID-19: infection cases of 50 states in USA from 01/21/2020 to 12/31/2020.
• AQ36: air pollutants AQI values of 36 monitoring stations in Beijing.
• PeMS: traffic flow from 01/01/2022 to 03/31/2022 in Bay Area, Los Angeles and San Diego.
• Metrics: MAE, MSE, MRE

• Sensitivity analysis of PoGeVon over PeMS dataset by changing the mask rates.

• D. Wang, Y. Yan, R. Qiu, Y. Zhu, K. Guan, A. J Margenot, H. Tong: Networked Time Series Imputation via Position-aware Graph Enhanced 71
Temporal Graph Learning
• Temporal network structure:

• Goal: prediction if 𝑣𝑖 , 𝑣𝑗 interact at 𝑡0 based on all the available temporal graph


information during {𝑡1 , … , 𝑡6 }
• Application: recommender system, traffic prediction
• Weilin Cong, Si Zhang, Jian Kang, Baichuan Yuan, Hao Wu, Xin Zhou, Hanghang Tong, Mehrdad Mahdavi: Do We Really Need Complicated Model Architectures For
Temporal Networks? ICLR 2023
• Weilin Cong, Jian Kang, Hanghang Tong, Mehrdad Mahdavi: On the Generalization of Temporal Graph Learning: Theoretical Insights and Simpler Algorithms
Motivation
• Existing works: RNN and self-attention mechanism (SAM) are the de facto
standard for temporal graph learning, e.g.,
• JODIE: Inputs → RNN + Memory blocks
• TGAT: Inputs → Self-attention mechanism
• TGN: Inputs → RNN + Memory blocks → Self-attention mechanism

• Indeed, such architecture design matches our intuition

• It has the following drawbacks:


• These methods are conceptually and technically complicated with advanced
model architecture, which is hard to implement
• Hard to understand which parts of the model truly contribute to its success,
and whether these components are indispensable
?
Rethinking
t really ... Temporal Graph Learning
propose G raphM ixer that based entirely on the M LP s and
• Q: Are RNN and self-attention indispensable for TGL?
ghbor m ean-pooling;
• A: Not really …
aphM ixer achi
• We eves SO
propose TA perform
GraphMixer ance
that based w ithoneven
entirely sm and
the MLPs aller
neighbor mean-pooling;
putation cost andachieves
• GraphMixer num ber
SOTAofperformance
param eters.with smaller computation cost and number of parameters
Node features Link features Timestamps

Time-encoding
Node-encoder function

MLP ! Link-encoder
" MLP

Link-classifier
We don’t need complicated method if we could
select the “right” input data
#
MLP
GraphMixer: Experiments
• GraphMixer achieves outstanding performance • Larger average time-gaps
• Larger average node-degree
• Two variant of GraphMixer: • Larger maximum timestamps
• GraphMixer-L: only use link-encoder + link-classifier
• GraphMixer-N: only use node-encoder + link-classifier
SENSEI:
Reconciling Competing Sampling Strategies
❑ Typical sampling strategy 𝑢10 𝑢9
o Close nodes as positive samples and distant nodes as negative samples
𝑢8
❑ Competing sampling strategies 𝑢3 𝑢2

o First category: distant nodes as positive samples 𝑣 𝑢5


𝑢6
𝑢4 𝑢1 𝑢7
o Second category: close nodes as negative samples

❑ Two required properties


o Discrimination property
• Similarity score 𝑠(⋅,⋅) : close >> distant close

o Monotonicity property intermediate


• Similarity score 𝑠(⋅,⋅): distant
– close > intermediate>distant

• Yuchen Yan, et al. Reconciling competing sampling strategies in network embedding. NeuIPS 2023. - 76 -
SENSEI: Algorithm
Theoretical analysis: Impossible to capture discrimination and monotonicity
property for all node pairs (association)

❑ Key ideas
o Partial monotonicity for top-K recommendation list
o First fulfill the discrimination property (Step 1) then satisfy the partial monotonicity (Step 2)

❑ Two models
o SENSEI-P for plain network
o SENSEI-M for multi-layered network
• Within-layer association
• Cross-layer association

- 77 -
SENSEI: Experimental Results

AUC-ROC and AUC-PR on the link prediction task Hit@10 on the node recommendation task

❑ Observations:
o SENSEI-P generally outperforms baselines on the link prediction task.
• Fulfill the discrimination property
o SENSEI-P beats baselines on the node recommendation task.
• Fulfill the partial monotonicity property

[1] Aditya Grover, et al. Node2vec: scalable feature learning for networks. KDD’2016.
[2] Thomas N Kipf, et al. Variational graph auto-encoders. arXiv’2016.
[3] Petar Veličković, et al. Graph attention networks. arXiv’2017.
[4] Shirui Pan, et al. Adversarially regularized graph autoencoder for graph embedding. arXiv’2018.
- 78 -
[5] Zexi Huang, et al. A broader picture of random-walk based graph embedding. KDD’2021.
GNNs Design #1: Context + Learning
• Step 1: Generate a Context Matrix
• Step 2: Learn Embedding by Preserving the Context
• e.g., minimizing an objective w.r.t. the context
• Examples: LINE, Deepwalk, node2vec and many more

Random walk generation (generate node Predict the nearby nodesin the random walks
contexts through random search)

• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15 80
GNNs Design #2: Propagate + Transform
• Step 1: Propagate/Aggregate the current embedding
• Step 2: Transform the aggregated embedding
• An Example: GCN
1 1
෩ −2 ෩ ෩ −2
𝒁 = 𝝈 (𝑫 𝑨𝑫 𝐗𝚯)

• Related GNNs
• Spatial GNN
• GraphSage
• GAT

81
GNNs Alternative Design:
Position + Transform
• Current work
– Step 1: Propagate/Aggregate the current embedding
– Step 2: Transform the aggregated feature
1 1
– An Example: GCN 𝒁 = 𝝈 (𝑫 −
෩ 2𝑨 ෩𝑫−
෩ 2 𝐗𝚯)

• Our work
– Step 1: Position the nodes wrt anchor nodes/node-pairs
– Step 2: Transform the position vectors (e.g., MLP)
– An Example: Bright (PGNN for Network Alignment)
• Random walk with restart for position, from anchor links
• Linear layer for transformations
• Margin-based loss for training

• Yuchen Yan, Si Zhang and Hanghang Tong: BRIGHT: A Bridging Algorithm for Network Alignment. TheWebConf 2021.
• Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing and Hanghang Tong:Dynamic Knowledge Graph Alignment. AAAI 2021
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143 82
GNNs Alternative Design:
Position + Transform
• BRIGHT: A Bridging Algorithm for Network Alignment
• Simultaneously Addressing Limitations of
– Consistency Optimization Based Methods: Same steps, same weights
– Embedding Based Methods: relaxation of consistency based methods

• Overview of Proposed Method

• Empirical Results

• Yuchen Yan, Si Zhang and Hanghang Tong: BRIGHT: A Bridging Algorithm for Network Alignment. TheWebConf 2021.
• Yuchen Yan, Lihui Liu, Yikun Ban, Baoyu Jing and Hanghang Tong:Dynamic Knowledge Graph Alignment. AAAI 2021
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143 83
Semi-Supervised Node Classification
Data Mining Computer Vision
• Given:
• the graph topology 𝐀
• node features 𝐗
• a part of the node labels 𝐲𝐭𝐫𝐚𝐢𝐧
• Predict:
• labels of test nodes 𝐲𝐭𝐞𝐬𝐭
?

Citation network [1]

[1] Yang, Zhilin, William Cohen, and Ruslan Salakhudinov. "Revisiting semi-supervised learning with graph embeddings." ICML 2016.

84
Existing Low-Pass Filter GNNs
• Based on homophily assumption (e.g., [1,2]): same-label nodes tend to
be connected → node embeddings should be smooth → low-pass filter

Data Mining Computer Vision

Figure credit: [4]

new
Graph signal processing [3] is an
analogy of classic signal processing.
[1] Yang, Zhilin, William Cohen, and Ruslan Salakhudinov. "Revisiting semi-supervised learning with graph embeddings." ICML 2016.
[2] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." NeurIPS 2020.
[3] Shuman, David I., et al. "The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains." IEEE signal processing magazine 30.3
(2013): 83-98. 85
[4] https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_ID=8613
Homophily Assumption is not Always True
• Many model-centric methods [2, 3]:
• multi-hop aggregation
• GNN with adaptive filters
• Low-pass filters smooth the node features (for homophily)
• High-pass filters differentiate the node features (for heterophily)
• …
?

Dating network [1]


Any general data-centric solution to benefit broad GNNs for
node classification on homophilic/heterophilic graphs?
[1] Zhu, Jiong, et al. "Beyond homophily in graph neural networks: Current limitations and effective designs." NeurIPS 2020.
[2] Zheng, Xin, et al. "Graph neural networks for graphs with heterophily: A survey." arXiv preprint 2022.
[3] Bo, Deyu, et al. "Beyond low-frequency information in graph convolutional networks." AAAI 2021. 86
ALT-local: Overview
Augmenter Decomposed Graphs Backbone Dual GNNs

Node
Original Graph Embedding
GNN
� �
� ⨀� Aggregated Node
Edge
� ,� Embedding � Representation � Classification

Edge
Reweighting GNN
( � − � ) ⨀� � �
Our framework can handle
Node Features Offset MLP • different degrees of
heterophily locally and
� MLP

• a broad range of GNNs
� ��� � �

• 𝐇1 = GNN(𝐖⨀𝐀, 𝐗, 𝜃1 ) and 𝐇2 = GNN((𝟏 − 𝐖)⨀𝐀, 𝐗, 𝜃2 )


• Parameterizing 𝐖 to lower the number of parameters:
• 𝐖[i, j] = sigmoid(MLP(𝐇[i, : ]||𝐇 j, : , 𝜑1 )
1 1
− −
• 𝐇 = GNNaug 𝐀, 𝐗, 𝜑2 = (𝜖𝐈 − 𝐃 𝐀𝐃 )2 MLP(𝐗, 𝜑2 ) /*high-pass filter*/
2 2

• Zhe Xu, Yuzhong Chen, Qinghai Zhou, Yuhang Wu, Menghai Pan, Hao Yang, Hanghang Tong: Node 87
Classification Beyond Homophily: Towards a General Solution. KDD 2023: 2862-2873
Empirical Evaluation: Heterophilic Graphs

With ALT, GNNs’


performance on
heterophilic graphs
improves significantly.

88
TEDGCN: Negative Depth for Heterophily
❑ Graph Homophily
o Close nodes tend to own similar attributes/labels. Measurement
∑𝑖,𝑗,𝐀 𝑖,𝑗 =1(𝑦 𝑖 ≠𝑦[𝑗])
ℎ 𝐺 = ∈ [0,1]
❑ Graph Heterophily ∑𝑖,𝑗 𝐀 𝑖,𝑗 =1

o Close nodes tend to own disparate attributes/labels. beliefs


beliefs Ukraine Ukraine Russia USA
USA
USA
Russia

USA Russia USA

USA Russia
Ukraine
Ukraine
Graph heterophily ℎ 𝐺 = 0.33 Graph heterophily ℎ 𝐺 = 1

[1] Hongbin Pei, et al. Geom-gcn: Geometric graph convolutional networks. ICLR’2020.
[2] Yuchen Yan, et al. Trainable negative depth is all you need for graph heterophily. NeuIPS 2023 - 89 -
Key Idea: Depth of GCN & Graph Heterophily
❑ Use depth of GCN (𝑑) to control GCN’s expressive power of graph heterophily.
𝑑
o Graph Laplacian & 𝑑: 𝑑 controls weights of different eigen-graphs (∑𝑛 ሚ
𝑖=1 1 − 𝜆𝑖 ෥ 𝐮⊤
𝐮𝑖 ෥ 𝑖 ).

o Graph Laplacian & graph heterophily: eigen-graphs w.r.t. large 𝜆ሚ𝑖 represents heterophilic
components in the graph 𝐺.
Directly enable the 𝒅 to be trainable in
GCN to make weights small/large for
෥⊤
෥𝑖 𝐮
𝐮 ෨
𝑖 w.r.t. large 𝝀𝒊 in
homophilic/heterophilic graphs?

(1) Not Monotonic !!!


(2) if 𝜆ሚ 𝑖 =1, 1- 𝜆ሚ 𝑖 =0, 00 is
meaningless!!!

• Yuchen Yan, et al. Trainable negative depth is all you need for graph heterophily. NeuIPS 2023
- 90 -
Redefine the Depth of GCN
1
❑ Select 𝑔 𝜆𝑖 = 1 − 2 𝜆𝑖
❑ Define 𝐒෠ = 𝑔 𝐋𝑠𝑦𝑚 = 𝐔 𝐈 − 0.5𝚲 𝐔⊤
❑ Required properties
1
o 𝑔 𝜆𝑖 = 1 − 𝜆𝑖 > 0 for 𝜆𝑖 ∈ 0,2
2
o When 𝑑 > 0, 𝑔(𝜆𝑖 ) monotonically decreases (homophily).
o When 𝑑 < 0, 𝑔(𝜆𝑖 ) monotonically increases (heterophily).
o Geometric interpretability compared to 𝐒෨ in GCN
• 𝐒෨ in GCN: (1) add self-loops; (2) normalization.
• 𝐒෠ : (1) normalization; (2) half normalization + half self-loops.

[1] Shuman, et al. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing
magazine’2013..
- 91 -
Experimental Results

Homophilic datasets

Heterophilic datasets

❑ Observation: TEDGCN generally outperforms other GCNs.

- 92 -
Key Papers
Core Papers
• Jian Tang, Meng Qu, Mingzhe Wang, Jun Yan, Ming Zhang and Qiaozhu Mei. LINE: Large-scale Information Network Embedding. WWW’15
• Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).

Further Reading
• Graph convolutional networks: a comprehensive review. Si Zhang, Hanghang Tong, Jiejun Xu, Ross Maciejewski Computational Social Networks,
2019, PDF: https://sizhang2.web.illinois.edu/resources/gcn_survey_journal.pdf
• Inductive Representation Learning on Large Graphs. W.L. Hamilton, R. Ying, and J. Leskovec arXiv:1706.02216 [cs.SI], 2017.
• Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka: How Powerful are Graph Neural Networks? ICLR 2019
• Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang: Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE,
PTE, and node2vec. WSDM 2018: 459-467
• Yuxiao Dong, Nitesh V. Chawla, Ananthram Swami: metapath2vec: Scalable Representation Learning for Heterogeneous Networks. KDD 2017: 135-
144
• Jiaxuan You, Rex Ying, Jure Leskovec: Position-aware Graph Neural Networks. ICML 2019: 7134-7143
• Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, Martin Grohe: Weisfeiler and Leman Go
Neural: Higher-Order Graph Neural Networks. AAAI 2019: 4602-4609
• Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, Nitesh V. Chawla: Heterogeneous Graph Neural Network. KDD 2019: 793-803
• Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, Hanghang Tong, Christos Faloutsos: It's who you know: graph mining
using recursive structural features. KDD 2011: 663-671
• Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).
• Bryan Perozzi, Rami Al-Rfou, Steven Skiena. DeepWalk: Online Learning of Social Representations. KDD’14
• Shuman, David I., et al. "The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains." IEEE signal processing magazine 30.3 (2013): 83-98.
• Z. Xu, Y. Chen, Q. Zhou, Y. Wu, M. Pan, H. Yang, H. Tong: Node Classification Beyond Homophily: Towards a General Solution. KDD 2023
94
More on GNNs (not covered)
❑ Graph convolutional networks: a comprehensive review [pdf] Si Zhang,
Hanghang Tong, Jiejun Xu, Ross Maciejewski Computational Social
Networks, 2019

95
Spatial-Based Model (HGCN)
❑Hyperbolic graph convolutional networks
❑ Key idea: message passing in hyperbolic space
❑ Why: better capture graph hierarchical structure

96
Other Aspects in Graph Conv Nets
❑ Pooling methods
❑ To better capture graph hierarchical structure.
❑ Deterministic methods:
❑ Multi-level graph coarsening/clustering, etc.
❑ Adaptive methods:
❑ Neural pooling layer
❑ Training methods:
❑ Stochastic training – minibatch sampling
❑ Uniform sampling, importance sampling, etc.

97
Self-Attention Pooling
❑ Key idea: self-attention based importance
1 1

෩ 2𝑨 ෩𝑫 −
෩ 2 𝐗𝜽𝑎𝑡𝑡
𝒁= 𝑫

❑Tasks:
❑ Graph representation
❑ Node representation (need a good architecture)

98
Graph Conv Nets for Recommendation
❑ User-item interactions → Bipartite graph

❑ Key idea: define message passing between users &


items

99
Graph Conv Nets for Recommendation
❑ Message construction
𝑚𝑢←𝑖 = 𝑓 𝒙𝑖 , 𝒙𝑢 , 𝑝𝑢𝑖

1
𝑚𝑢←𝑖 = 𝑾1 𝒙𝑖 + 𝑾2 𝒙𝑖 ⊙ 𝒙𝑢
𝒩𝑢 𝒩𝑖

❑ Message Aggregation
(𝑙) 𝑙 𝑙
𝒙𝑢 = LeakyReLU 𝑚𝑢←𝑢 + ෍ 𝑚𝑢←𝑖
𝑖∈𝒩𝑢

❑ Training: ranking loss + minibatch training


100
Graph Conv Nets for Network Alignment
❑Network alignment
❑ To find node correspondence across networks

101
Graph Conv Nets for Network Alignment
❑ Key idea: allow message passing across networks
❑ Cross-network aggregation via alignment

❑ Non-rigid point set alignment


❑ To address space difference

102
Key Challenges
❑ Optimization Problem in Deep Learning
❑ Minimize the (approximated) training error E
❑ (Stochastic) gradient descent to find the model parameter
❑ Challenge 1: Optimization
❑ E is non-convex in general
❑ How to find a high-quality local optima

❑ Challenge 2: Generalization
❑ What we do: minimize (approximated) training error
❑ What we really want: minimize the generalization error
❑ How to mitigate over-fitting
103
Responsive Activation Function
❑ Saturation of Sigmoid Activation Function
❑ The output
❑ The derivative decaying
❑ The error of an output unit
❑ Saturation of Sigmoid Activation Function
❑ If , , both derivative and error will be close to 0
❑ Further exacerbated due to backpropagation → gradient vanishing → B.P. is stuck or
takes long time to terminate
❑ A More Responsive Activation Function: Rectified Linear Unit (RELU)
❑ The output O=f(I) = I if I>0, O=0 otherwise
❑ The gradient:
❑ The error: 0 if the unit is inactive (I<0), otherwise, aggregate all error terms from the units
in the next higher layer the unit is connected to, w/o decaying → avoid gradient vanishing
104
Activation Functions

105
Adaptive Learning Rate
❑ Stochastic gradient descent to find a local optima
❑ Default choice for 𝜂: a fixed, small positive constant
❑ Problems: slow progress, or jump over ‘gradient cliff’, or oscillation
❑ Adaptive learning rate: let 𝜂 change over epoch
❑ Strategy #1:
❑ Strategy #2:
❑ Intuitions: smaller adjustment as algorithm progresses
❑ 𝑒. 𝑔., 𝜂0 = 0.9; 𝜂∞ = 10−9
❑ Strategy #3 (AdaGrad):
❑ Intuition: The magnitude of gradient gt : indicator of the overall progress
❑ 𝑒. 𝑔. , 𝜌 = 10−8
❑ Strategy #4 (RMSProp): exponential decaying weighted sum of squared historical gradients
106
Dropout
❑ The Purpose of Dropout: to Prevent Overfitting
❑ How does it works?
❑ At each epoch (𝜌 dropout rate, e.g., 𝜌 = 0.5)
❑ randomly dropout some non-output units
❑ Perform B.P. on the dropout network
❑ Scale the final model parameters

❑ Why does Dropout Work?


❑ Dropout can be viewed as a regularization technique
❑ Force the model to be more robust to noise, and to learn more generalizable features
❑ Dropout vs. Bagging
❑ Bagging: Each base model is trained *independently* on a bootstrap sample
❑ Dropout: the model parameters of the current dropout network are updated based on that of the previous dropout
network(s)
107
Pretraining
❑ Pretraining: the process of initializing the model in a suitable region
❑ Greedy supervised pretraining
❑ pre-set the model parameters layer-by-layer in a greedy way
❑ Start with simple model, add one additional layer at a time,
❑ pretrain the parameters of the newly added layer, while fixing for other layers
❑ Each time, equivalent to training a two-layered network
❑ Followed up a fine—tuning process
❑ Other Pretraining Strategies
❑ Unsupervised pretraining: based on auto-encoder
❑ Hybrid strategy
❑ Pretraining for transfer learning

108
Cross Entropy
❑ Measure the disagreement between the actual (T) and predicted (O) target values
❑ For regression: mean-squared error
❑ For (binary) classification: cross-entropy
❑ Mean-squared error vs. Cross-entropy

❑ Cross-entropy for Multiclass Problem


❑ Actual target
Assume positive example (T=1)
❑ Predicted output

109

You might also like