Academia.eduAcademia.edu

FOBE and HOBE: First- and High-Order Bipartite Embeddings

2019, arXiv (Cornell University)

Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a "best of both worlds" embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. Our embeddings are found to perform better on recommendation tasks and equally competitive in link prediction. Although all considered embeddings are beneficial in particular applications, we demonstrate that none of those considered is clearly superior (in contrast to what is claimed in many papers). Therefore, we discuss the trade offs among them, noting that the methods proposed here are robust for applications relying on same-typed comparisons.

First- and High-Order Bipartite Embeddings Justin Sybrandt Ilya Safro [email protected] Clemson University Clemson, SC [email protected] Clemson University Clemson, SC arXiv:1905.10953v2 [cs.LG] 23 Jul 2020 ABSTRACT Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a łbest of both worldsž embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. Our embeddings are found to perform better on recommendation tasks and equally competitive in link prediction. Although all considered embeddings are beneficial in particular applications, we demonstrate that none of those considered is clearly superior (in contrast to what is claimed in many papers). Therefore, we discuss the trade offs among them, noting that the methods proposed here are robust for applications relying on same-typed comparisons. Reproducibility: Our code, data sets, and results are all publicly available online at: https://sybrandt.com/2020/fobe_hobe/. CCS CONCEPTS · Computing methodologies → Learning latent representations; · Mathematics of computing → Hypergraphs; · Information systems → Social recommendation; Social networks; Recommender systems; · Networks → Network structure; · Humancentered computing → Social networks; Social network analysis; · Theory of computation → Unsupervised learning and clustering. KEYWORDS bipartite graphs, hypergraphs, graph embedding, algebraic distance on graphs, recommendation, link prediction ACM Reference Format: Justin Sybrandt and Ilya Safro. 2020. First- and High-Order Bipartite Embeddings. In Proceedings of MLG 2020: 16th International Workshop on Mining Justin Sybrandt is now at Google Brain. Contact via [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MLG’20, August 24, 2020, San Diego, CA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 https://doi.org/10.1145/1122445.1122456 and Learning with Graphs (MLG’20). ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/1122445.1122456 1 INTRODUCTION Graph embedding methods place nodes into a continuous vector space in order to capture structural properties that enable machine learning tasks [9]. While many have made significant progress embedding general graphs [10, 21, 26, 27], we find that bipartite graphs have received less study [8], and that the field is far from settled on this interesting case. There exist a variety of special algorithmic cases for bipartite graphs, which are utilized in applications such as user-product or user-group recommender systems [30], hypergraph based load balancing and mapping [20], gene-disease relationships [2], and drug-to-drug targets [29]. We define a simple, undirected, and unweighted bipartite graph to be G = (V , E) where V = {v 1 , v 2 , . . . , vn+m } is composed of the disjoint subsets A = {α 1 , . . . , α n } and B = {β 1 , . . . , βn } (V = A∪B). Here, A and B represent the two halves of the network, and are sometimes called łtypes.ž We use vi to indicate any node in V , α i for nodes in A, and βi for those in B. In a bipartite graph, edges only occur across types, and E ⊆ {A × B} indicates those connections within G. A single edge is notated as α i β j ∈ E, and because our graph is undirected, α i β j = β j α i . The neighborhood of a node is indicated by the function Γ(·). If α i ∈ A then Γ(α i ) = {β j |α i β j ∈ E}, and vice-versa for nodes in B. In order to sample an element from a set, such as selecting a random α i from A with uniform probability, we notate α i ∼A. The problem of graph embedding is to determine a representation of the nodes in G in a vector space of r dimensions such that r << |V | and that a select node-similarity measure defined on V is encoded by these vectors [27]. We notate this embedding as the function ϵ(·) : V → Rr , that maps each node to an embedding. We propose two methods for embedding bipartite graphs. These methods fit embeddings by optimizing nodes of each type separately, which we find can lead to higher quality type-specific latent features. Our first method, First-Order Bipartite Embedding (FOBE), samples for the existence of direct, and first-order similarities within the bipartite structure. This approach maintains the separation of types by reformulating edges in E into indirect same-typed observations. For instance, the connection α i β j ∈ E decomposes into a set of observed pairs (α i , α k ∼Γ(β j )) and (β j , βk ∼Γ(α i )). Our second method, High-Order Bipartite Embedding (HOBE), samples direct, first-, and second-order relationships, and weighs samples using algebraic distance on bipartite graphs [5]. Again, we represent sampled relationships between nodes of different types by decomposing them into collections of same-typed relationships. While this sampling approach is similar to FOBE, algebraic distance allows us to improve embedding quality by accounting for broader graph-wide trends. Algebraic distance on bipartite graphs has the effect of capturing strong local similarities between nodes, and MLG’20, August 24, 2020, San Diego, CA Justin Sybrandt and Ilya Safro reduces the effect of less meaningful relationships. This behavior is beneficial in many applications, such as shopping, where two users are likely more similar if they both purchase a niche hobby product, and may not be similar even if they both purchase a generic cleaning product. Because FOBE and HOBE each make different prior assumptions about the relevance of bipartite relationships, we propose a method for combining bipartite embeddings to get łbest of both worldsž performance. This ensemble approach learns a joint representation from multiple pre-trained embeddings. The łdirectž combination method fits a non-linear transformation of the original embeddings into a fixed-size hidden layer in accordance to sampled similarities. The łauto-regularizedž combination extends the direct method by introducing a denoising-autoencoder layer in order to regulate the learned joint embedding [28]. The architecture of both approaches maintains a separation between nodes of different types, which allows for type-specific embeddings, without the constraint of a shared global structure. Evaluation of all proposed embeddings is performed on link prediction reinforced with holdout experiments and recommender system tasks. the same hidden layer. In contrast, we create entirely separate embedding spaces for the nodes of different types. BiNE by Gao et al. focuses directly on the bipartite case [8]. This approach uses the biased random-walks described in Node2Vec, and samples these walks in proportion to each node’s HITS centrality [13]. While our methods differ, again, in the use of skip-gram, BiNE also fundamentally differs from our proposed approaches by enforcing global structure through cross-type similarities. Tsitsulin et al. present VERSE, a versatile graph embedding method that allows multiple different node-similarity measures to be captured by the same overarching embedding technique [27]. This method requires that the user specify a node-similarity measure that will be encoded in the dot product of resulting embeddings. A key difference between the methods presented here, and the methods presented in VERSE, come from differences in objective values when training embeddings. VERSE uses a range of methods to sample node-pairs, from direct sampling to Noise Contrastive Estimation [11], and updates embeddings according to their observed similarity or dissimilarity (in the case of negative samples). However, the optimization method proposed here enforces only same-typed comparisons. Our contribution in summary: (1) We introduce First- and HighOrder Bipartite Embeddings that learn representations of bipartite structure that retaining type-specific semantic information. (2) We present the direct and the auto-regularized methods to leverage multiple pre-trained graph embeddings to produce a łbest of both wordsž embedding. (3) We discuss the strengths and weaknesses of our proposed methods as they compare to a range of graph embedding techniques. We identify certain graph properties that suit different graph types, and report that none of the proposed embeddings is clearly superior. However, we find that applications wanting to make many same-typed comparisons are often best suited by a type-sensitive embedding. 2 1.1 Related Work Low-rank embeddings project high-order data into a compressed real-valued space, often for the purpose of facilitating machine learning models. Inspired by the Skip-Gram approach[18], Perozzi et al. demonstrate that for a similar method can capture latent structural features of traditional graphs [21]. An alternative approach, LINE by Tang et al., models first- and second-order node relationships explicitly [26]. Node2Vec blends the intuitions behind both LINE and Deepwalk by combining homophilic and structural similarities through a biased random walk [10]. Our proposed methods are certainly influenced by LINE’s approach, but differ in a few key areas. Firstly, we split our model in order to only make same-typed comparisons. Furthermore, we introduce terms that compare nodes with relevant negihborhoods, and can weigh different samples with algebraic distance [5]. While the three previously listed embedding approaches are designed for traditional graphs, Metapath2Vec++ by Dong et al. presents a heterogeneous approach using extended type-sensitive skip-gram model [6]. Our method differs from Dong et al.’s in a number of ways. Again, we do not apply random walks or the skip-gram model. Furthermore, the Metapath2Vec++ model implicitly asserts that output type-specific embeddings be a linear combination of METHODS AND TECHNICAL SOLUTIONS We present two sibling strategies for learning bipartite embeddings. First-Order Bipartite Embedding (FOBE) samples direct links from E and first-order relationships between nodes sharing common neighbors. We then fit embeddings to minimize the KL-Divergence between our observations and our embedding-based estimations. The second method, High-Order Bipartite Embedding (HOBE), begins by computing algebraic similarity estimates for each edge [5, 23]. Using these heuristic weights, HOBE samples direct, first- and second-order relationships, to which we fit embeddings using meansquared error. At a high level, both embedding methods begin by observing structural relationships within a graph G and then fitting an embedding ϵ in order to encode structural features via dot product of embeddings. We combine three types of observations for a single graph These observations are represented through the functions SA (·, ·), SB (·, ·), and SV (·, ·). Each function maps two nodes to an observed similarity: V ×V → R. The result of SA is nonzero only if both arguments are in A, SB is similarly nonzero only if both arguments are in B. In this manner, these functions capture type-specific similarities. The SV function, in contrast, captures cross-typed observations, and is nonzero if its arguments are of different types. We define a reciprocal set of functions to model these similarities: e SA (·, ·), e SB (·, ·), and e SV (·, ·). These functions are defined in terms of ϵ(·), and each method must select some embedding such that the difference between each corresponding set of S, e S pairs. Because we estimate similarities within type-specific subsets of ϵ separately, we can better preserve typed latent features. This is important for many applications. Consider an embedding of the bipartite graph of viewers and movies, often used for applications such as video recommendations. Within łmovie spacež one would expect to uncover latent features such as genre, budget, or the presence of high-profile actors. These features are undefined within łviewer space,ž wherein one would expect to observe latent features corresponding to demographics and viewing preferences. Clearly First- and High-Order Bipartite Embeddings MLG’20, August 24, 2020, San Diego, CA these two spaces are correlated in a number of ways, such as the alignment between viewer tastes and movie genres. However, we find methods that enforce direct comparisons between viewer and movie embeddings can result in an erosion of type-specific features, which can lead to lower downstream performance. In contrast, the methods proposed here do not encode cross-type relationships as a linear transformation of embeddings, and instead captures cross-typed relationships through the aggregate behavior of node neighborhoods within same-typed subspaces. 2.1 First-Order Bipartite Embedding The goal of FOBE is to model direct and first-order relationships from the original structure. This very simple method only detects the existence of a relationship between two nodes, and therefore does not distinguish between two nodes that share only one neighbor from two nodes that share many. However, we find that this simplicity enables scalability at little cost to quality. Here, a direct relationship is any edge from the original bipartite graph, while a first-order relationship is defined as {(α i , α j ) | Γ(α i ) ∩ Γ(α j ) , ∅}. Note that nodes in a first-order relationship share the same type. We define observations corresponding with each relationship. Direct observations simply detect the presence of an edge, while first-order relationships similarly detect a common neighbor. Formally: ( 1 α i , α j ∈ A & Γ(α i ) ∩ Γ(α j ) , ∅ SA (α i , α j ) = (1) 0 otherwise ( 1 βi , β j ∈ B & Γ(βi ) ∩ Γ(β j ) , ∅ SB (βi , β j ) = (2) 0 otherwise ( 1 αi β j ∈ E SV (α i , β j ) = (3) 0 otherwise By sampling γ neighbors, we allow our later embedding model to approximate the effects of Γ, similar to the k-ary set sampling in [19]. Note also that each sample contains one nonzero S value. By fitting all three observations simultaneously, we implicitly generate two negative samples for each positive sample. Furthermore, we generate a fixed number of samples for each node’s direct and first-order relationships. Given these observations SA , SB , and SV , we fit the ϵ embedding SA , e SB , e SV . To according to corresponding estimation functions e e e estimate a first-order relationship (SA and SB ) we calculate the sigmoid of the dot product of embeddings (5), namely, σ (x) = (1 + e −x )−1 .  (4) e SA (α i , α j ) = σ ϵ(α i )⊺ϵ(α j ) (5)  e SB (βi , β j ) = σ ϵ(βi )⊺ϵ(β j ) (6) Building from this, we train embeddings based on direct relationships by composing relevant first-order relationships. Specifically, if α i β j ∈ E then we would expect α i to be similar to α k ∈ Γ(β j ) and vice-versa. Intuitively, a viewer has a higher chance of watching a movie if they are similar to others that have. We formulate our direct relationship estimate to be the product of each node’s average first-order estimate to the other’s neighborhood. Formally: h i h i e e e SV (α i , β j ) = SA (α i , α k ) SB (β j , βk ) (7) E E α k ∈Γ(β j ) β k ∈Γ(α i ) In order to train our embedding function ϵ for the FOBE method, we minimize the KL-Divergence [14] between our observed simiS. We minimize for each larities S and our estimated similarities e simultaneously, for both direct and first-order similarities, using the Adagrad optimizer [7], namely, we solve: !  SA (vi , v j )   e  SA (vi , v j ) log  e  SA (vi , v j )   !   Õ  SB (vi , v j )   +e  min (8) S (v , v ) log i j B   ϵ e S (v , v )   i j B v i ,v j ∈V ×V   !  SV (vi , v j )   e  +SV (vi , v j ) log   e SV (vi , v j )   2.2 High-Order Bipartite Embedding The goal of HOBE is to capture distant relationships between nodes that are related, but may not share an edge or a neighborhood. In order to differentiate the meaningful distant connections from those that are spurious, we turn to algebraic distance on graphs [23]. This method is fast to calculate and provides a strong signal for local similarity. For example, algebraic distance can tell us which neighbor of a high-degree node is the most similar to the root. As a result, we can utilize this signal to estimate which multi-hop connections are the most important to preserve in our embedding. Algebraic distance is a measure of dependence between variables popularized in algebraic multigrid (AMG) [4, 17, 22]. Later, it has been shown to be a reliable and fast way to capture implicit similarities between nodes in graphs [12, 16] and hypergraphs that are represented as bipartite graphs [23] (which is leveraged in this paper) taking into account distant neighborhoods. Technically, it is a process of relaxing randomly initialized test vectors using stationary iterative relaxation applied on graph Laplacian homogeneous system of equations, where in the end the algebraic distance between system’s variables x i and x j (that correspond to linear system’s rows i and j) is defined as an maximum absolute value between the ith and jth components of the test vectors (or, depending on application, as sum or sum of squares of them). In our context, a variable is a node, and we apply K iterations of Jacobi over-relaxation (JOR) on the bipartite graph Laplacian as in [22] (K = 20 typically ensures good stabilization as we do not need full convergence, see Theorem 4.2 [5]). Initially, each node’s coordinate is assigned a random value, but on each iteration a node’s coordinate is updated to move it closer its neighbors’ average. Weights corresponding to each neighbor are inversely proportional their degree in order to increase the łpullž of small communities. Intuitively, this acknowledges that two viewers who both watch a niche new-wave movie are more likely similar than two viewers who watched a popular blockbuster. We run JOR on R independent trials (called test vectors in AMG works, convergence proven in [5]). Formally, for r th test vector ar the update step of (t ) JOR is performed as follows, where ar (vi ) represents node vi ’s algebraic coordinate on iteration t ∈ {1, .., K }, and λ is a damping factor (suggested λ = 0.5 in [23]). Í (t ) −1 v j ∈Γ(v i ) ar (v j )|Γ(v j )| (t +1) (t ) (9) ar (vi ) = λar (vi ) + (1 − λ) Í −1 v j ∈Γ(v i ) |Γ(v j )| MLG’20, August 24, 2020, San Diego, CA Justin Sybrandt and Ilya Safro We use the l 2 -norm in order to summarize the algebraic distance of two nodes across R trails with different random initializations. As a result, two nodes will be close in our distance calculation if they remain nearby across many trials, which lessens the effect of too slow convergence in a single trial. For our purposes we select R = 10. Additionally, we define łalgebraic similarityž, s(i, j), as a closeness across trials. We subtract the distance between two embeddings from the maximum distance in our space, and rescale the result to the unit interval. Because we know that the maximum distance between any two coordinates in the same trial is 1, we can compute this in constant time: v u tR  2 Õ (K ) (K ) d(vi , v j ) = ar (vi ) − ar (v j ) (10) r =1 √ R − d(vi , v j ) (11) √ R After calculating algebraic similarities for pairs of nodes of all edges, we begin to sample direct, first-order, and second-order similarities from the bipartite structure. Here, a second-order connection is one wherein α i and β j share a neighbor that shares a neighbor: α i ∈ Γ(Γ(Γ(β j ))). Note that the set of second-order relationships is a superset of the direct relationships. We can extend to these higher-order connections with HOBE, as opposed to FOBE, because of the information provided in algebraic distances. Many graphs contain a small number of high degree nodes, which creates a very dense second-order graph. Algebraic distances are therefore needed to distinguish which of the sampled second-order connections are meaningful, especially when the refinement is normalized by |Γ(vi )| −1 . We formulate our first-order observations to be equal to the strongest shared bridge between two nodes. This indicates that both nodes are closely related to something that is mutually representative, such as two viewers that watch new-wave cinema. Formally:    max min s(α i , βk ), s(α j , βk )   β ∈Γ(α )∩Γ(α )   i j k ′ SA (α i , α j ) = (12)  if α i , α j ∈ A    0 otherwise     max min s(α k , βi ), s(α k , β j )    α k ∈Γ(βi )∩Γ(β j )  ′ (13) SB (βi , β j ) = if βi , β j ∈ B      0 otherwise When observing second-order relationships between nodes α i and β j if different types, we again construct a measurement from shared first-order relationships. Specifically, we are looking for the strongest first-order connection between i and j’s neighborhood, and vice-versa. In the context of viewers and movies this represents the similarity between a viewer and a movie watched by a friend. Formally:   ′ ′ ′ SA (α i , α k ), max SB (β j , βk ) (14) SV (α i , β j ) = max α max ∈Γ(β ) β ∈Γ(α ) s(vi , v j ) = k j k i We again collect a fixed number of samples for each relationship type: direct, first- and second-order. We then train embeddings using cosine similarities, however we select the ReLU activation function to replace sigmoid in order to capture the weighted relationships. We optimize for all three observations simultaneously, which again has the effect of creating negative samples for nonobserved phenomena. Our estimated similarities are defined as follows:  ′ e (15) SA (α i , α j ) = max 0, ϵ(α i )⊺ϵ(α j )  ′ e SB (βi , β j ) = max 0, ϵ(βi )⊺ϵ(β j ) (16) i h i h ′ ′ ′ e e e (17) SB (β j , βk ) SA (α i , α k ) SB (α i , β j ) = E E α k ∈Γ(β j ) β k ∈Γ(α i ) We use the same model as FOBE to train HOBE, but with our new estimation functions and a new objective. We now optimize for the mean-squared error between our observed and estimated samples, as KL-Divergence is ill-defined for the weighted samples we collect. Formally, we minimize: 2.3 ′  (S′ (vi , v j ) − e SA (vi , v j ))2  A    ′  +(S′ (vi , v j ) − e min SB (vi , v j ))2  E  B ϵ v i ,v j ∈V ×V   ′  +(S′ (vi , v j ) − e SV (vi , v j ))2   V (18) Combination Bipartite Embedding In order to unify our proposed approaches, we present a method to create a joint embedding from multiple pre-trained bipartite embeddings. This combination method maintains our initial assertion that nodes of different types ought to participate in different global embedding structures. We fit a non-linear projection of the input embeddings such that an intermediate embedding can accurately uncover direct relationships. This raises a question as to whether it is better to create an intermediate that succeeds in this training task, or whether it is better to fully encode the input embeddings. To address this concern we propose two flavors of our combination method: the łdirectž approach maximizes performance on the training task, while the łauto-regularizedž approach enforces a full encoding of input embeddings. We begin by taking the edge list of the original bipartite graph E as our set of positive samples. We then generate five negative samples for each node by selecting random pairs α i β j < E. For each sample, we create an input vector by concatenating each of the e ′ pre-trained embeddings. In(vi ) = [ϵ1 (vi ) ϵ2 (vi ) ... ϵe ′ (vi )] (19) After generating In(α i ) and In(β j ), our models assert 50% dropout in these input vectors [24]. We do so in the auto-regularized case so that we follow the pattern of denoising auto-encoders, which have shown high performance in robust dimensionality reductions [28]. However, we also find that this dropout increases performance in the direct combination model as well. This is because in either case, we anticipate both redundant and noisy signals to be present across the concatenated embeddings. This is especially necessary for larger values of k and e ′ , where the risk of overfitting increases. We then project In(α i ) and In(β j ) separately onto two hidden ′ layers of size d(I n)+k /2 where d(·) indicates the dimensionality ′ of the input, and k represents the desired dimensionality of the combined embeddings. By separating these hidden layers, we only allow signals from within embeddings of the same node to affect First- and High-Order Bipartite Embeddings MLG’20, August 24, 2020, San Diego, CA its combination. We then project down to two combination embed′ dings of size k , which act as input to both the joint link-prediction model, as well as to the optional auto-encoder layers. In the direct case, we simply minimize the mean-squared error between the predicted links and the observed links. Formally, let ′′ ′′ S (α i , β j ) → {0, 1} equal the sampled value, and let e S (α i , β j ) → R be combination estimate. In the auto-regularized case we introduce a factor to enforce that the original (pre-dropout) embeddings can be recovered from the combined embedding. We weight these factors so they are half as important as performing the link prediction training task. The neural architecture used to learn these combination embeddings is depicted in the supplemental information. If Θ is the set of free parameters of our neural network model, N is the set of negative samples, and Out(vi ) is the output of the auto-encoder corresponding to In(vi ), then we optimize the following (direct followed by auto-regularized):  ′′ 2 ′′ min S (α i , β j ) − e S (α i , β j ) (20) E Θ α i , β j ∈(E+N ) 2 e © 4 S (α i , β j ) − S (α i , β j ) ª ­ ® ­ E +||In(α i ) − Out(α i )||2 ®® Θ α i , β j ∈(E+N ) ­ +||In(β j ) − Out(β j )||2 ¬ « min 3  ′′ ′′ 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: (21) ALGORITHMIC ANALYSIS In order to efficiently compute FOBE and HOBE, we collect a fixed number of samples per node for each of the observation functions, S. As later explored in Table 4, we find that the performance of our proposed methods does not significantly increase beyond a relatively small, fixed sampling rate sr , where sr << |V |. Using this observation, we can efficiently minimize the FOBE and HOBE objective values by approximating the expensive O(n 2 ) set of comparisons (vi , v j ∈ V × V ) with a linear number of samples (specifically O(|V |sr )). Furthermore, we can estimate the effect of each node’s neighborhood in observations SV and SV′ by following a similar approach. Instead of considering each node’s total O(V )-sized neighborhood, we can randomly sample sγ neighboring nodes with replacement. These specifically samples nodes are recorded during the sampling procedure so that they may be referenced during training. Algorithm 1 describes the sampling algorithm formally. 4 Procedure 1 FOBE/HOBE Sampling. Unobserved values per sample are recorded as either zero or empty. EMPIRICAL EVALUATION Link Prediction We evaluate the performance of our proposed embeddings across three link prediction tasks and a range of trainingtest splits. When removing edges, we visit each in random order and remove them with probability h provided the removal does not disconnect the graph. This additional check ensures all nodes appear in all experimental embeddings. The result is the subgraph G ′ = (V , E ′, h). Deleted edges form the positive test-set examples, and we generate set of negative samples (edges not present in original graph) of equal size. These samples are used to train three sets of link-prediction models: the A-Personalized, B-Personalized (where A and B are parts of V ), and unified models. The A-personalized model is a support vector machine trained on the neighborhood of a particular node. A model personalized to i ∈ A learns to identify a region in B-space corresponding to 17: 18: 19: function SameTypeSample(vi , sr , S) v j ∼Γ(Γ(vi )) Record vi ,v j , and S(vi , v j ) function DiffTypeSample(vi , sr , sγ , G, S) v j ∼G(vi ) Let γ α and γ β be sets of size sγ sampled with replacement from the neighborhoods Γ(vi ) and Γ(v j ) according to the types of vi and v j . Record vi , v j , γ α , γ β , and S(vi , v j ). function FobeSampling(G, sr , sγ ) for all vi ∈ V do for sr samples do SameTypeSample(vi , sr , SA ) SameTypeSample(vi , sr , SB ) DiffTypeSample(vi , sr , sγ , Γ(·), SV ) function HobeSampling(G, sr , sγ ) for all vi ∈ V do for sr samples do ′) SameTypeSample(vi , sr , SA SameTypeSample(vi , sr , SB′ ) DiffTypeSample(vi , sr , sγ , Γ(Γ(Γ(·))), SV′ ) its neighborhood in G ′ . We use support vector machines with the radial basis kernel (C = 1, γ = 0.1) because we find these models result in robust performance given limited training data, and because the chosen kernel function allows for non-spherical decision boundaries. We additionally generate five negative samples for each positive sample (a neighbor of i in G ′ ). In doing so we evaluate the ability to capture type-specific latent features, as each personalized model only considers one-type’s embeddings. While the personalized task may not be typical for production link-prediction systems, it is an important measure of latent features found in each space. In many bipartite applications, such as the six we have selected for evaluation, |A| and |B| may be drastically different. For instance, there are typically more viewers than movies, or more buyers than products. Therefore it becomes important to understand the differences in quality between the latent spaces of each type, which we evaluate through these personalized models. The unified link-prediction model, in contrast, learns to associate α i β j ∈ E ′ with a combination of ϵ(α i ) and ϵ(β j ). This model attempts to quantify global trends across embedding spaces. We use a hidden layer of size k with the ReLU activation function, and a single output with the sigmoid activation. We fit this model against mean-squared error using the Adagrad optimizer [7]. Datasets. We evaluate each embedding across six datasets. The Amazon, YouTube, DBLP, Friendster, and Livejournal graphs are all taken from the Stanford Large Network Dataset Collection (SNAP) [15]. We select the distribution of each under the listing łNetworks with Ground-Truth Communities.ž Furthermore, we collect the MadGrades graph, from an online source provided by the University of Wisconsin at Madison [1]. This graph consists of teachers and course codes, wherein an edge signifies that teacher α i has taught course code β j . We clean this dataset by iteratively deleting any instructor or course with degree 1 until none remain. MLG’20, August 24, 2020, San Diego, CA Experimental Parameters. We evaluate the performance of our proposed methods: FOBE and HOBE, as well as our two combination approaches: Direct and Auto-Regularized Combination Bipartite Embedding. We compare against all methods described in Section 1.1. Note, we limit our comparison to other embeddingbased techniques as prior work [10] establishes they considerably outperform alternative heuristic methods. We evaluate each across the six above graphs and nine training-test splits h = 0.1, 0.2, ..., 0.9. For all embeddings we select dimensionality k = 100. For Deepwalk, we select a walk length of 10, a window size of 5, and 100 walks per node. For LINE we apply the model that combines both firstand second-order relationships, selecting 10,000 samples total and 5 negative samples per node. For Node2Vec we select 10 walks per node, walk length of 7 and a window size of 3. Furthermore, we select default parameters for BiNE and Metapath2Vec++. For the latter, we supply the metapath of alternating A − B − A nodes, the only metapath in our bipartite case. For FOBE and HOBE we generate 200 samples per node, and when sampling neighborhoods we select 5 nodes with replacement upon each observation. After training both methods, we fit the Direct and Auto-Regularized Combination methods, each trained using only the results of FOBE and HOBE. Recommendation: We follow the procedure originally described by Gao et al. and evaluate our proposed embeddings through the task of recommendation [8]. Recommendation systems propose products to users in order to maximize the overall interaction rate. These systems fit the bipartite graph model because they are defined on the set of user-product interactions. While many such systems could be reformulated as operations on bipartite networks, methods such as matrix factorization and user-user nearest neighbors do not capture granular local features to the same extent as modern graph embeddings [3, 8]. In contrast, bipartite graph embedding provides a framework to often learn richer latent representations for both users and products. These representations can then be used directly through simple similarity measures, or added to existing solution archetypes, such as k-nearest neighbors, which often provides significant quality benefits. While there are many similarities between recommendation and link prediction, the key difference is the introduction of weighted connections. As a result, recommendation systems are evaluated based on their ability to rank products in accordance to held-out user supplied rankings. This is quantified through a number of metrics defined on the top k system-supplied recommendation for each user. When using embeddings to make a comparison, Gao et al. rank products by their embedding’s dot product with a given user. However, our proposed methods relax the constraint that products and users be directly comparable. As a result, when ranking products for a particular user for our proposed embeddings we must first define a product-space representation. For each user we collect the set of known product ratings, and calculate a product centroid weighted by those ratings. Experimental Procedure. We present a comparison between our proposed methods and all previously discussed embeddings across the DBLP1 and LastFM2 datasets. Note that this distribution 1 https://github.com/clhchtcjj/BiNE/tree/master/data/dblp 2 https://grouplens.org/datasets/hetrec-2011/ Justin Sybrandt and Ilya Safro of DBLP is the bipartite graph of authors and venues, and is different from the community-based version distributed by SNAP. The LastFM dataset consists of listeners and musicians, where an edge indicates listen count, which we log-scale to improve convergence for all methods. We start by splitting each rating set into trainingand test-sets with a 40% holdout. In the case of DBLP we use the same split as Gao et al. We use embeddings from the training bipartite graph to perform link prediction. We then compare the ranked list of training-set recommendations for each user, truncated to 10 items, to the test-set rankings. We calculate 128-dimensional embeddings for each method, and report F1, Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Mean Reciprocal Rate (MRR). 5 SIGNIFICANCE AND IMPACT In contrast to what is typically claimed in papers, we observe that the link prediction data (Table 1) demonstrates that different graphs lead to very different performance results for the existing stateof-the-art and proposed embeddings. Moreover, their behavior is changed with different holdouts when the size of training set is smaller. For instance, our methods are above the state of the art in the Youtube and MadGrades graphs, but Metapath2Vec++, Node2Vec, and LINE each have scenarios wherein they outperform the field. Additionally, while there are scenarios where the combination methods perform as expected, such as in the Youtube, MadGrades, and DBLP B-Personalized cases, we observe that variability in the other proposed embeddings can disrupt this performance gain. When comparing the A- and B-Personalized results, its is important to keep in mind that for all considered graphs there are more A nodes (|A| > |B|), and therefore these nodes tend to have fewer neighbors (E[Γ(α)] < E[Γ(β)]). For this reason, we find that different embedding methods can exhibit significantly different behavior across both personalized tasks. Intuitively, performing well on the A-Personalized set indicates an ability to extrapolate connections between elements with significantly more sparse attachments, such as selecting a new movie given a viewer’s limited history. In contrast, performance on the B-Personalized set indicates an ability to uncover trends among relatively larger sets of connections, such as determining what patterns are common across all the viewers of a particular movie. While these two tasks are certainly related, we observe that the B-Personalized evaluation appears to be significantly more challenging for a number of embedding methods, such as Node2Vec on Lovejournal and YouTube. In contrast, HOBE succeeds in this evaluation for both cases, as well as Friendster and MadGrades. Metapath2Vec++ additionally is superior on LiveJournal and Friendster, but falls behind on DBLP, MadGrades, and Youtube. In the recommendation results (Table 2 and 3), our methods improve the state-of-the art. This is likely due to the behavior of aggregate neighborhood-based comparisons present within FOBE and HOBE, which has the effect of grouping clusters of nodes within one type’s embedding space. Our biggest increase is in MRR for DBLP, indicating that the first few suggestions from our embeddings are often more relevant. The performance of HOBE, demonstrates the ability for algebraic distance to estimate useful local similarity First- and High-Order Bipartite Embeddings FOBE Ð A.R.Comb. - Node2Vec - A-Pers. HOBE Deepwalk BiNE B-Pers. Ð --- D.Comb. LINE Metapath2Vec++ Unified Metric@10: DeepWalk LINE Node2Vec MP2V++ BINE FOBE HOBE D.Comb. A.R.Comb. Amazon Ð Ð -- MLG’20, August 24, 2020, San Diego, CA F1 .0027 .0067 .0279 .0024 .0227 .0729 .0195 .0243 .0388 NDCG .0153 .0435 .1261 .0153 .1551 .3085 .1352 .1285 .1927 MAP .0069 .0229 .0645 .0088 .0982 .1997 .0789 .0795 .1249 MRR .1844 .2477 .2047 .2677 .3539 .3778 .3400 .3520 .3915 DBLP Table 3: LastFM Recommendations. MadGrades Livejournal Friendster When looking at both link prediction and recommendation tasks, we observe a highly variable performance of the combination methods. In some cases, such as the MadGrades and YouTube link prediction tasks, as well as the LastFM recommendation task, these combinations are capable of learning a joint representation from FOBE and HOBE that can improve overall performance. However, in other cases, such as the Amazon link prediction task, the combination method appears to have significantly decreased performance. This effect is due to the increased number of hyperparameters introduced by the combination approach, which are determined not by the complexity of a given dataset, but are instead determined by the number and size of input embeddings. In the Amazon dataset, these free parameters lead to overfitting the combination embeddings. YouTube 6 Table 1: Link Prediction Accuracy vs. Training-Test Ratio. Dashed lines indicate prior work, while solid lines indicate methods proposed here. Metric@10: DeepWalk LINE Node2Vec MP2V++ BINE FOBE HOBE D.Comb. A.R.Comb. F1 .0850 .0899 .0854 .0865 .1137 .1108 .1003 .0753 .0667 NDCG .2414 .1441 .2389 .2514 .2619 .3771 .4054 .2973 .2359 MAP .1971 .0962 .1944 .1906 .2047 .2382 .3156 .2362 .1730 MRR .3153 .1713 .3111 .3197 .3336 .4491 .6276 .5996 .5080 Table 2: DBLP Recommendation. Note: result numbers from prior works are reproduced from [8]. measures. Interestingly, in the LastFM dataset, FOBE outperforms HOBE. One reason for this is that LastFM contains significantly more artists-to-user than DBLP contains venues-to-author. As a result the amount of information present when estimating algebraic similarities is different across datasets, and insufficient to boost HOBE above FOBE. SENSITIVITY STUDY We select the MadGrades network to demonstrate how our proposed methods are effected by the sampling rate. We run ten trials for each experimental sampling rate, consisting of powers of 2 from 1 to 1024. Each trial represents an independent 50% holdout experiment. We present min, mean, and max observed link prediction accuracy. To continue comparing FOBE and HOBE, it would appear that higher-order sampling is often able to produce better results, but that the algebraic distance heuristic introduces added variability that occasionally reduces overall performance. In some applications it would appear that this variability is manageable, as seen in our DBLP recommendation results. However in the case of link prediction on Amazon communities, this caused an unintentional drop when FOBE remained more consistent. Overall, FOBE and HOBE are fast methods that broaden the array of embedding techniques available for bipartite graphs. While no method is clearly superior in every case, there exist a range of graphs and applications that are better suited by these methods. Looking to the sensitivity study (Tables 4), we see the variability of HOBE is significantly larger for small sampling rates. However, we do observe that after approximately 32 samples per node, in the case of MadGrades, this effect is reduced. Still, considering FOBE does not exhibit this same quality, it is likely the variability of the algebraic similarity measure that ultimately leads to otherwise unexpected reductions in HOBES performance. 7 CONCLUSIONS In this work we present FOBE and HOBE, two strategies for modeling bipartite networks that are designed to capture type-specific MLG’20, August 24, 2020, San Diego, CA ś Max ś Mean Per-B ś Min Unified FOBE Per-A Justin Sybrandt and Ilya Safro [7] [8] [9] HOBE [10] Table 4: Link Prediction Accuracy vs. Sampling Rate. Depicts the effect of increasing sr from 2 to 1024 on the MadGrades dataset, running 10-trials of the 50% holdout experiment per value of sr . [11] [12] [13] [14] [15] structural properties. FOBE, which captures first-order relationships, samples nodes in small local neighborhoods. HOBE, in contrast, captures higher-order relationships that are prioritized by a heuristic signal provided by algebraic distance on graphs . In addition we present two variants on an approach to learn joint representations that are designed to identify a łbest of both worldsž embedding. We evaluate these methods against the state-of-the-art via a set of link prediction and recommendation tasks. The most novel component of FOBE and HOBE is that these methods do not encode cross-typed relationships through a linear transformation, but instead model these relationships through the aggregate behavior of node neighborhoods. For this reason, we find that our proposed method performs well in the context of recommendation, where identifying local clusters of similar nodes is important (see example of partitioning application [25]). In the case of link prediction, where the goal is to identify specific attachments between two particular nodes, we find that the methods perform at a level similar to those considered in the benchmark, and only exceed the state-of-the-art in particular graphs. While our personalized classification tasks demonstrate the ability for FOBE and HOBE to capture type-specific latent features, additional work is necessary to study the specific qualities these methods encode. [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] ACKNOWLEDGMENTS This work was supported by NSF awards MRI #1725573 and NRT #1633608. REFERENCES [1] [n. d.]. MadGrades - UW Madison Grade Distributions. https://madgrades.com. ([n. d.]). Accessed: 2018-10-25. [2] Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo. 2011. Network medicine: a network-based approach to human disease. Nature reviews genetics 12, 1 (2011), 56. [3] Jesús Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Gutiérrez. 2013. Recommender systems survey. Knowledge-based systems 46 (2013), 109ś 132. [4] Achi Brandt, James J. Brannick, Karsten Kahl, and Irene Livshits. 2011. Bootstrap AMG. SIAM J. Scientific Computing 33, 2 (2011), 612ś632. [5] Jie Chen and Ilya Safro. 2011. Algebraic distance on graphs. SIAM Journal on Scientific Computing 33, 6 (2011), 3468ś3490. [6] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of [27] [28] [29] [30] the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135ś144. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121ś2159. Ming Gao, Leihui Chen, Xiangnan He, and Aoying Zhou. 2018. BiNE: Bipartite Network Embedding. In The 41st International ACM SIGIR Conference on Research &#38; Development in Information Retrieval (SIGIR ’18). ACM, New York, NY, USA, 715ś724. https://doi.org/10.1145/3209978.3209987 Palash Goyal and Emilio Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems 151 (2018), 78ś94. Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855ś864. Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297ś304. Emmanuel John and Ilya Safro. 2016. Single-and multi-level network sparsification by algebraic distance. Journal of Complex Networks 5, 3 (2016), 352ś388. Jon M Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46, 5 (1999), 604ś632. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79ś86. Jure Leskovec and Andrej Krevl. 2015. {SNAP Datasets }: {Stanford } Large Network Dataset Collection. (2015). Sven Leyffer and Ilya Safro. 2013. Fast response to infection spread and cyber attacks on large-scale networks. Journal of Complex Networks 1, 2 (2013), 183ś199. Oren E Livne and Achi Brandt. 2012. Lean algebraic multigrid (LAMG): Fast graph Laplacian linear solver. SIAM Journal on Scientific Computing 34, 4 (2012), B499śB522. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. 2019. Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs. In International Conference on Learning Representations. https://openreview.net/forum?id=BJluy2RcFm Uwe Naumann and Olaf Schenk. 2012. Combinatorial scientific computing. CRC Press. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701ś710. Dorit Ron, Ilya Safro, and Achi Brandt. 2011. Relaxation-based coarsening and multiscale graph organization. Multiscale Modeling & Simulation 9, 1 (2011), 407ś423. Ruslan Shaydulin, Jie Chen, and Ilya Safro. 2019. Relaxation-Based Coarsening for Multilevel Hypergraph Partitioning. SIAM Multiscale Modeling and Simulation 17 (2019), 482ś506. Issue 1. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929ś1958. Justin Sybrandt, Ruslan Shaydulin, and Ilya Safro. 2019. Hypergraph Partitioning With Embeddings. arXiv preprint arXiv:1909.04016 (2019). Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067ś1077. Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 539ś548. https://doi.org/10.1145/3178876.3186120 Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. ACM, 1096ś1103. Muhammed A Yıldırım, Kwang-Il Goh, Michael E Cusick, Albert-László Barabási, and Marc Vidal. 2007. DrugÐtarget network. Nature biotechnology 25, 10 (2007), 1119. Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and T-H. Hubert Chan. 2017. Re-revisiting Learning on Hypergraphs: Confidence Interval and Subgradient Method. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 4026ś 4034. http://proceedings.mlr.press/v70/zhang17d.html