Academia.eduAcademia.edu

FGMAC: Frequent subgraph mining with Arc Consistency

2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern's quality.

FGMAC: Frequent subGraph Mining with Arc Consistency Brahim Douar and Michel Liquiere Chiraz Latiri and Yahya Slimani LIRMM, 161 rue Ada 34392 - Montpellier, France {douar,liquiere}@lirmm.fr URPAH Team, Faculty of Sciences of Tunis [email protected], [email protected] Abstract—With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NPCompleteness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern’s quality. Index Terms—Graph mining; AC-projection; Graph classification; I. I NTRODUCTION In front of the urgent need to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become a compelling issue in the data mining field. Indeed, discovering frequent subgraphs, i.e., discovering subgraphs which occur frequently enough over the entire set of graphs, is a real challenge due to their exponential number. Indeed, based on the A PRIORI principle [1], a frequent n-edge graph may contain 2n frequent subgraphs. This raises a serious problem related to the exponential search space as well as the counting of complete sub-patterns while the kernel of frequent subgraph mining is subgraph isomorphism test which has been proved to be in NP-complete complexity class [2]. In this paper, we study an innovative projection operator intended to replace the costly subgraph isomorphism. In the second section we give a brief literature review of the subgraph mining field. Then, we present the AC-projection operator initially introduced in [3], as well as its very interesting properties. We propose an efficient graph mining algorithm using the AC-projection operator. Finally, we study the relevance of the AC-reduced patterns for the supervised graph classification. II. F REQUENT S UBGRAPH M INING Given a database consisting of small graphs, for example, molecular graphs, the problem of mining frequent subgraphs 978-1-4244-9925-0/11/$26.00 ©2011 IEEE is to find all subgraphs that are subgraph isomorphic with a large number of example graphs in the database. In this section we define preliminary concepts as well as a brief review of literature related to frequent subgraph mining. A. Preliminary Concepts Definition II.1 (Labeled Graph) A labeled graph can be represented by a 4-tuple, G = (V, A, L, l), where • • • • V is a set of vertices, A ⊆ V × V is a set of edges, L is a set of labels, l : V ∪ A → L, l is a function assigning labels to the vertices and the edges. Definition II.2 (Isomorphism, Subgraph Isomorphism) Given two graphs G1 and G2 , an isomorphism is a bijective function f : V (G1 ) → V (G2 ), such that ∀x ∈ V (G1 ), l(x) = l(f (x)), and ∀(x, y) ∈ A(G1 ), (f (x), f (y)) ∈ A(G2 ) and l(x, y) = l(f (x), f (y)). A subgraph isomorphism from G1 to G2 is an isomorphism from G1 to a subgraph of G2 . Definition II.3 (Frequent Subgraph Mining) Given a graph dataset, GS={Gi | i = 0......n}, and a minimal support (minSup), let ς(g, G) = { 1 0 if there is a projection from g to G otherwise. σ (g, GS) = ∑ ς(g, Gi ) Gi ∈ GS σ (g, GS) denotes the occurrence frequency of g in GS, i.e., the support of g in GS. Frequent subgraph mining is to find every graph g such that σ (g, GS) is greater than or equal to minSup. Known frequent subgraphs miners are based on this definition and deal with the special case where the projection operator is subgraph isomorphism. 112 B. Related Works Algorithms for frequent subgraphs mining are based on two patterns discovery paradigms namely breadth-first search and depth-first search. They aim to find the connected subgraphs that have a sufficient number of edge disjoint embedding in a single large undirected labeled sparse graph. Most of these algorithms use different methods for determining the number of edge-disjoint embeddings of a subgraph and employ different ways for candidates generation and support counting. An interesting quantitative comparison of the most cited subgraph miners is given in [4]. The novel graph mining approach that we present in this paper is based on a breadth-first approach intensively cited in litterature. The following section in intended to present this approach named FSG [5]. different subgraphs miners, say that embedding lists do not considerably speed up the search for frequent fragments. So even though G S PAN [6] does not use them, it is competitive to G ASTON [7] and F FSM [8], at least with not too big fragments. So, it seems that a better way to avoid exponential runtime and/or memory consumption is the use of another projection operator instead of the subgraph isomorphism. This projection has to have a polynomial complexity as well as a polynomial memory consumption. In [3], the author introduced an interesting projection operator named AC-projection which seems to have good properties and ensure polynomial time and memory consumption. The forthcoming sections present this operator with its many interesting properties and show an optimized algorithm for computing it. C. The Fsg Algorithm III. AC- PROJECTION Principal breadth-first approaches take advantage of the A PRIORI [1] levelwise approach. The F SG algorithm finds frequent subgraphs using the same level-by-level expansion adopted. F SG uses a sparse graph representation which minimizes both storage and computation and it increases the size of frequent subgraphs by adding one edge at a time, allowing to generate the candidates efficiently. Various optimizations have been proposed for candidate generation and counting which allowed it to scale to large graph databases. For problems in which a moderately large number of different types of vertices and edges exist, F SG was able to achieve good performance and to scale linearly with the database size. For problems where the number of edge and vertex labels was small, the performance of F SG was worse, as the exponential complexity of graph isomorphism dominates the overall performance. In this paper, we are particularly interested by the F SG algorithm. Indeed, we propose a basic subgraph mining approach which is a modified F SG version that uses a novel operator for the support counting process. D. Critical Discussion Developing algorithms that discover all frequently occurring subgraphs in a large graph database is computationally extensive, as graph and subgraph isomorphisms play a key role throughout the computations. Since subgraph isomorphism testing is a hard problem, fragment miners are exponential in runtime. Many frequent subgraphs miners have tried to avoid the NP-completeness of subgraph isomorphism problem by storing all embeddings in embedding lists which consist of a mapping of the vertices and edges of a fragment to the corresponding vertices and edges in the graph it occurs in. It is clear that with this trick we can avoid excessive subgraph isomorphism tests when counting fragments support and, by the way, avoid exponential runtime. However these approaches face exponential memory consumption instead. So, we can say that they are only trading time versus storage not more and on the other hand can cause problems if not enough memory is available or if the memory throughput is not high enough. The authors in [4], after an extensive experimental study of The approach suggested in [3] advocates a projection operator based on the arc consistency algorithm. This projection method has the required properties: polynomiality, local validation, parallelization, structural interpretation. A. AC-projection And Arc Consistency Definition III.1 (Labeling) Let G1 and G2 be two graphs. We named labeling from G1 into G2 a mapping I : V (G1 ) → 2V (G2 ) |∀x ∈ V (G1 ), ∀y ∈ I(x), l(x) = l(y). Thus, for a vertex x ∈ V (G1 ).I(x) is a set of vertices of G2 with the same label l(x). We can say that I(x) is the set of “possible images” of the vertex x in G2 . This first labeling is trivial but can be refined using the neighborhood relations between vertices. Definition III.2 (AC-compatible y ) Let G be a graph V1 ⊆ V (G), V2 ⊆ V (G) V1 is AC-compatible with V2 iff 1) ∀xk ∈ V1 ∃yp ∈ V2 |(xk , yp ) ∈ A(G) 2) ∀yq ∈ V2 ∃xm ∈ V1 |(xm , yq ) ∈ A(G). We note V1 y V2 Definition III.3 (Consistency for one arc) Let G1 and G2 be two graphs. We say that a labeling I : V (G1 ) → V (G2 ) is consistent with an arc (x, y) ∈ A(G1 ), iff I(x) y I(y). Definition III.4 (AC-labeling) Let G1 and G2 be two graphs. A labeling I from G1 into G2 is an AC-labeling iff I is consistent with all the arcs e ∈ A(G1 ). Definition III.5 (AC-projection ⇁ ) Let G1 and G2 be two graphs. An AC-labeling I : V (G1 ) → V (G2 ) is an ACprojection iff ∀ AC-labeling I ′ : V (G1 ) → V (G2 ) and ∀x ∈ V (G1 ), I ′ (x) ⊆ I(x). We note it G1 ⇁ G2 113 Algorithm 1: AC-projection Input : Two graphs G1 and G2 Output: An AC-projection I from G1 into G2 if there is, otherwise else an empty set Function ReviseArc Input : A graph G2 , A labeling I from G1 into G2 , An arc (x, y) ∈ V (G1 ) Output: A new labeling I ′ from G1 into G2 I ′ ← I; I ′ (x) ← I(x) r {x′ ∈ V (G2 )|∄ y ′ ∈ I(y) with (x′ , y ′ ) ∈ A(G2 )}; I ′ (y) ← I(y) r {y ′ ∈ V (G2 )|∄ x′ ∈ I(x) with (x′ , y ′ ) ∈ A(G2 )}; //Initialisation foreach x ∈ V (G1 ) do I(x)={y ∈ V (G2 )|l(x) = l(y)} S ← A(G1 ); P ← ∅; while S ̸= ∅ do Choose an arc (x, y) from S; // in general the first element of S I ′ :=ReviseArc ((x, y), I, G2 ); //If for one vertex x ∈ V (G1 ) we have I ′ (x) = ∅ then there is no arc consistency if (I ′ (x) = ∅) or (I ′ (y) = ∅) then return ∅; ′ //I is consistent now with the arc (x, y); but it can be non-consistent with some other previously tested arcs so we have to verify and change (if necessary), the consistency of all these arcs. if I(x) ̸= I ′ (x) then R ← {(x′ , y ′ ) ∈ P |x′ = x or y ′ = x}; S ← S ∪ R; P ← P r R; if I(y) ̸= I ′ (y) then R ← {(x′ , y ′ ) ∈ P |x′ = y or y ′ = y}; S ← S ∪ R; P ← P r R; S ← S r {x, y}; P ← P ∪ {x, y}; I ← I ′; return I ′ ; Definition III.6 (AC-equivalent graphs) Two graphs G1 and G2 are AC-equivalent iff both G1 ⇁ G2 and G2 ⇁ G1 are fulfilled. We note G1 ⇌ G2 . We have an equivalence relation between graph using the AC-projection. In this paragraph we study the properties of this operation and search for a reduced element in an equivalence class of graphs. This element will be the unique representative of this equivalence class, and for which we give then the name of “AC-reduced graph” . Fig. 1. right) AC-equivalent graphs and the associated AC-reduced one (extreme 1) Auto AC-projection And AC-reduction: We study the auto AC-projection operation (G ⇁ G), which we will use to find the minimal graph of an equivalence class of graphs and we will prove in the following that the obtained graph is minimal. B. AC-projection: Improved Algorithm We give an improved AC-projection algorithm for graphs (based on the AC3 algorithm [9]). The AC-projection algorithm takes two graphs G1 and G2 and tests if there is an AC-projection from G1 into G2 (see Algorithm 1). It begins by the creation of a first rough labeling I and reduces, for each vertex x, the given lists I(x) to consistent lists using the function ReviseArc. The consistency check fails if some I(x) becomes empty, otherwise the consistency check succeeds and the algorithm gives the labeling I which is an AC projection G1 ⇁ G2 . Like the AC3 algorithm, the actual AC-projection algorithm has a worst-case time complexity of O(e×d3 ) and space complexity of O(e) where e is the number of arcs and d is the size of the largest domain. In our case, the size of the largest domain is the size of the largest subset of nodes with the same label. C. AC-projection And Reduction The following definition introduces an equivalence relation between graphs w.r.t. AC-projection. Proposition III.7 Given an AC-projection I : G ⇁ G′ , x′ ∈ I(x) iff for each tree T (VT , AT ), (with VT is the set of vertices of T , and AT its set of arcs) and each t ∈ VT we have: If there is a morphism from T to G which associates t to x then there is a morphism from T to G′ which associates t to x′ . [10] Proposition III.8 (Order relation on I) For an AC-projection I : G ⇁ G, if xi ∈ I(x) then I(xi ) ⊆ I(x) Proof: If we have xi ∈ I(x), it means that for all trees T having a morphism in G and which associates t to x, then there is a morphism from t in G which associates t to xi (Proposition III.7). We call T (x,t) this set of trees. Let’s see now if we can have xj ∈ I(xi ) and xj ∈ / I(x). For xj ∈ I(xi ), according to Proposition III.9, we see that all trees from T (x,t) associates to t the vertex xi . Since xj ∈ I(xi ), it will be the same for it, so xj ∈ I(x). 114 We conclude that we can not have xj ∈ / I(x) since xj ∈ I(xi ), so: I(xi ) ⊆ I(x). Proposition III.9 Given a graph G and an AC-projection I : G ⇁ G and given a vertex x ∈ V (G) with |I(x)| > 1. If we have xi ∈ I(x), the graph G′ formed by the merging of x and xi is AC-equivalent to G. Proof: To prove that G ⇋ G′ we have to prove that G ⇁ G′ and G′ ⇁ G. Since G′ ⇁ G by construction we have only to prove that G ⇁ G′ : We construct this AC-projection by replacing x by xi in the auto AC-projection G ⇁ G. Since I(xi ) ⊆ I(x), so there is really an AC-projection. We conclude that G ⇁ G′ . Now, we want to find the smallest element of the equivalence class of graphs. For two AC-equivalent graphs G and G′ , we will consider that G < G′ iff |V (G)| < |V (G′ )|. Proposition III.10 (Minimality) A graph G is minimal in the equivalence class iff for the AC-projection I : G ⇁ G, ∀x ∈ V (G), I(x) = {x}. Proof: According to Proposition III.9, it is clear that if there was a vertex x such that |I(x)| > 1, then we will be able to do another reduction. Now, the question is: can a graph G′ = G\x be AC-equivalent to G ? If this is true, then we must have an AC-projection from G to G′ . It would say that x in G has another image x′ in G′ . So, x′ must be in I(x) which is contradictory to the initial hypothesis. IV. FGMAC: F REQUENT S UBGRAPH M INING W ITH A RC C ONSISTENCY In this section, we present FGMAC, a modified version of the FSG algorithm [5] based on the AC-projection operator. In fact, in this version we have changed the support counting part. Instead of subgraph isomorphism, the AC-projection is used to verify whether a candidate graph appears in a transaction or not. A. The Algorithm The FGMAC algorithm initially enumerates all the frequent single and double edge graphs. Then, based on those two sets, it starts the main computational loop. During each iteration it first generates candidate subgraphs whose size is greater than the previous frequent ones by one edge (Algorithm 4, line 5). Next, it counts the frequency for each of these candidates, and prunes subgraphs that do no satisfy the support constraint (Algorithm 4, lines 6-11). Discovered frequent subgraphs satisfy the downward closure property of the support condition, which allows us to effectively prune the lattice of frequent subgraphs. The FGMAC’s particularity is to return only frequent ACreduced graphs (Algorithm 4, line 11) which is a subset of the whole frequent isomorphic pattern set. In the following we present the key three steps of the FGMAC main process. Algorithm 4: FGMAC Input : A graph dataset D, Minimal support σ Output: The set F of frequent subgraphs F 1 ←detect all frequent (1-edge)-subgraphs in D; F 2 ←detect all frequent (2-edges)-subgraphs in D; k ← 3; while F k−1 ̸= ∅ do C k ← fsg-gen (F k−1 ) foreach candidate g k ∈ C k do g k .count ← 0; foreach transaction t ∈ D do if g k ⇁ t then g k .count ← g k .count + 1; Algorithm 3: AC-reduce Input : A graph G Output: G′ =AC-reduced G G′ ← G; I ←AC-projection (G, G); Q ← V (G); Sort Q such as x comes before y if |I(x)| < |I(y)|; foreach v in Q do foreach i in I(v) do if (i ̸= v) then N (v) ← N (v) ∪ N (i); //if v and i are neighbors, then we would have a reflexive arc Q ← Q r i; V (G′ ) ← V (G′ ) r i; F k ← {AC-reduce(g k ∈ C k )|g k .count ≥ σ|D|}; k ← k + 1; return F ; return G′ ; B. Candidate Generation 2) AC-reduce Algorithm: The AC-reduce algorithm is based on the properties given on the section above, these properties allow to construct the AC-reduced graph considering any graph G. To do this, we simply have to do an auto ACprojection G ⇁ G and then make the necessary merges. So this algorithm is very simple and have a polynomial complexity, since the AC-projection’s complexity is polynomial. This step is assured by the same fsg-gen function (see Algorithm 4, line 5) used in the FSG algorithm. This function uses a precise joining operator (fsg-join) which generates (k + 1) − edges subgraphs by joining two frequent k − edges subgraphs. In order for two such frequent k −edges subgraphs to be eligible for joining they must contain the same (k − 1) − edges subgraph named core. The complete description of these functions as well as their detailed algorithms are given in [11]. 115 TABLE I C LASSIFICATION DATASETS STATISTICS Datasets HIA PTC-FM PTC-FR PTC-MM PTC-MR Transactions 86 349 351 336 344 Distinct labels Edge Vertices 3 8 3 19 3 20 3 21 3 19 Edges / Transaction Average Max 24 44 26 108 27 108 25 108 26 108 HIA Fig. 2. Vertices / Transaction Average Max 22 40 25 109 26 109 25 109 26 109 PTC-FM Runtime comparison of FGMAC versus FSG with the two datasets HIA and PTC-FM C. Support Calculation The key operator leading this step is the AC-projection previously described. In fact, to verify whether a pattern appears in a transaction or not, FGMAC calculate in polynomial time if there is an AC-projection of the pattern in each one of the transactions. In order to optimize this support calculation phase, the algorithm associates to each graph g of size k, the set E(g) of transactions such as for each graph G ∈ E(g), g ⇁ G. Having the graph g1 ∪ g2 representing the union of the two graphs g1 and g2 , the intersection of the two sets E(g1 ) ∩ E(g2 ) is then calculated. As E(g1 ∪ g2 ) ⊆ E(g1 ) ∩ E(g2 ), it is possible to eliminate the graph g1 ∪ g2 if the transaction’s count E(g1 ) ∩ E(g2 ) is sufficiently low to make g1 ∪ g2 infrequent. By the other hand, the existence of a subgraph which has an AC-projection in g1 ∪ g2 can be only searched in transactions E(g1 ) ∩ E(g2 ). D. Frequents AC-reduction This step is essential at the end of each iteration of the main loop of the algorithm. It is intended to avoid the extraction of non AC-reduced frequent graphs, which represent representative elements of graphs equivalence classes w.r.t. AC-euivalence. This process is based on the AC-reduce function described previously. We note that that this step takes advantage of the polynomial complexity of the AC-reduction algorithm. consists in a calculation of their discriminative power within a supervised graph classification process. A. Datasets We carried out performance and classification experiments on five biological activity datasets widely cited in the literature. These datasets can be divided in two groups: • The Predictive Toxicology Challenge (PTC) [12],contains a set of chemical compounds classified according to their toxicity in male rats (PTC-MR), female rats (PTC-FR), male mice (PTC-MM), and female mice (PTC-FM). • The Human Intestinal Absorption (HIA) [13], contains chemical compounds classified by intestinal absorption activity. B. Performance Point Of View In this subsection we present a quantitative study of the computational performance of FGMAC compared to FSG. Results depicted in Figure 2 clearly show that FGMAC outperform FSG in the runtime point of view for all minimal supports selected and confirm the theorical results about the polynomiality of the AC-projection operator compared to the exponential complexity of the subgraph isomorphism adopted by FSG. In the following section, we present a study in a qualitative point of view of frequent AC-reduced patterns. C. Qualitative Point Of View: Graph Classification V. E XPERIMENTS A ND C OMPARATIVE S TUDY In order to prove the usefulness of the AC-projection for graph mining, we present in the following an experimental study of the FGMAC algorithm. We insist that the set of frequent AC-reduced graphs found by FGMAC is not exhaustive w.r.t. isomorphic patterns. So, in the following, we present a quantitative study of the FGMAC performance followed by a qualitative evaluation of the AC-reduced patterns which Graph classification is a supervised learning problem in which the goal is to categorize an entire graph as a positive or negative instance of a concept.Feature mining on graphs is usually performed by finding all frequent or informative substructures in the graph instances. These substructures are used for transforming the graph data into data represented as a single table, and then traditional classifiers are used for 116 Fig. 3. HIA (Frequent) PTC-FM (Frequent) HIA (Closed) PTC-FM (Closed) Comparison of the number of patterns of different feature set for PTC-FM and HIA datasets classifying the instances. The aim of using graph classification in this paper is the evaluation of the quality and discriminative power of frequent AC-reduced subgraph patterns, and to compare it with isomorphic frequent subgraphs. We carried out classification experiments on five biological activity datasets, and measured classifier prediction accuracy using the known decision trees classifier named C4.5 [14]. The classification methods are described in more detail in the following subsections, along with the associated results. 1) Methods: We evaluated the classification accuracy using two different feature sets. The first set of features (Frequent) consists of all frequent subgraphs. Those subgraphs are mined using the F SG software [5] with different minimal supports. Each chemical compound is represented by a binary vector with length equal to the number of mined subgraphs. Each subgraph is mapped to a specific vector index, and if a chemical compound contains a subgraph then the bit at the corresponding index is set to one, otherwise it is set to zero. The second feature set (Closed) is simply a subset of the first set. In fact, it consists of only closed frequent subgraphs. Those subgraphs are also mined using F SG with the special parameter (-x) to hold only closed frequent subgraphs. The third feature set (AC-reduced) contains the FGMAC output which consists of only AC-reduced frequent subgraphs. We have represented each chemical compound by a binary vector with length equal to the number of AC-reduced mined subgraphs. Each AC-reduced subgraph is mapped to a specific vector index, and if there is an AC-projection from the ACreduced subgraph to the chemical compound then the bit at the corresponding index is set to one, otherwise it is set to zero. Finally, the fourth feature set (Closed AC-reduced) is similar to the third one, the difference is that we only consider closed AC-reduced frequent subgraphs with a special param- eter passed to FGMAC. 2) Results: All classifications have been done with the Weka data-mining software package [15], and we have reported results of the prediction accuracy over the 10 cross-validation trials. In the following we are analyzing the AC-reduced patterns from quantitative and qualitative points of view. a) Patterns Count: According to results showed in Figure 3, we see that for all datasets we have very few AC-reduced frequent patterns compared to the isomorphic ones. We have on average 35% less patterns. This ratio is bigger for lower supports and can reach up to 70% for the HIA dataset with a minimal support of 10%. These experimental results confirm that the search space for extracting AC-reduced patterns is smaller than the one for classical isomorphic subgraphs. So, having an algorithm which looks for all AC-reduced frequent subgraphs would benefit for the polynomiality of the projection operation as well as a smaller search space (i.e. fewer AC-projection tests). b) Classification Relevance: When we see that the number of frequent subgraph patterns has drastically decreased after the AC-reduction process, we surely wonder about the relevance of these few patterns for supervised graph classification. That’s why we have conducted classification’s accuracy experiments using AC-reduced and isomorphic patterns to compare them. As showed in figure 4, we see that for the all datasets and all classifiers average, the percentage of correctly classified (PCC) instances is almost the same for all minimal support, as well as for the other datasets individually. Taking a more in-depth look to the results, we see that, for some datasets and minimal support values, we even have better 117 Fig. 4. All datasets (Frequent) HIA (Frequent) PTC-FM (Frequent) All datasets (Closed) HIA (Closed) PTC-FM (Closed) Comparison of the classification accuracy (PCC) of different feature set for All datasets(Average), PTC-FM and HIA datasets PCC for AC-reduced feature set. This is due to the better generalization power of the AC-reduction process, which helped supervised classifiers avoiding over-fitting learning problem. VI. C ONCLUSION A ND F UTURE W ORK In this paper, we have studied the use of a new polynomial projection operator named AC-Projection initially introduced in [3] and based on a key technique of constraint programming namely Arc Consistency (AC). We have showed that using the AC-projection and its properties has permitted us to have less patterns than all frequent or closed subgraphs but with a very comparable quality and discriminative power. ACprojection is intended to replace the use of the exponential subgraph isomorphism, as well as reducing the search space when seeking for frequent subgraphs. As a soon perspective, we are working on a depth-first frequent subgraph mining approach based on the AC-projection operator. Given a graph dataset, this novel approach will be able to looks for all frequent AC-reduced patterns with a reduced search space. R EFERENCES [1] R. Agrawal and R. Skirant, “Fast algorithms for mining association rules,” in proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, June 1994, pp. 478–499. [2] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979. [3] M. Liquiere, “Arc consistency projection: A new generalization relation for graphs,” in ICCS, ser. LNCS, U. Priss, S. Polovina, and R. Hill, Eds., vol. 4604. Springer, 2007, pp. 333–346. [4] M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, “A quantitative comparison of the subgraph miners mofa, gspan, ffsm, and gaston,” in European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ser. LNCS, vol. 3721. Springer, 2005, pp. 392–403. [5] M. Kuramochi and G. Karypis, “Frequent subgraph discovery.” in International Conference on Data Mining, N. Cercone, T. Y. Lin, and X. Wu, Eds. IEEE Computer Society, 2001, pp. 313–320. [6] X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,” in International Conference on Data Mining. IEEE Computer Society, 2002, pp. 721–724. [7] S. Nijssen and J. N. Kok, “The gaston tool for frequent subgraph mining,” in International Workshop on Graph-Based Tools (Grabats). Electronic Notes in Theoretical Computer Science, 2004, pp. 77–87. [8] J. Huan, W. Wang, and J. Prins, “Efficient mining of frequent subgraphs in the presence of isomorphism.” in International Conference on Data Mining. IEEE Computer Society, 2003, p. 549. [9] A. K. Mackworth, “Consistency in networks of relations,” Artif. Intell., vol. 8, no. 1, pp. 99–118, 1977. [10] P. Hell and J. Nesetril, Graphs and homomorphism, O. L. S. in Mathematics and its Application, Eds. Oxford: Oxford University Press, 2004, vol. 28. [11] M. Kuramochi and G. Karypis, “An efficient algorithm for discovering frequent subgraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 1038–1051, 2004. [12] C. Helma, R. D. King, S. Kramer, and A. Srinivasan, “The predictive toxicology challenge 20002001,” Bioinformatics, vol. 17, no. 1, pp. 107– 108, 2001. [13] M. D. Wessel, P. C. Jurs, J. W. Tolan, and S. M. Muskal, “Prediction of human intestinal absorption of drug compounds from molecular structure,” Journal of Chemical Information and Computer Sciences, vol. 38, no. 4, pp. 726–735, 1998. [14] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. Morgan Kaufmann, January 1993. [15] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005. [16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [17] B. Dasarathy, Nearest Neighbor (NN) norms : NN pattern classification techniques. IEEE Computer Society Press, 1991. [18] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: John Wiley and Sons, 1973. 118 A PPENDIX : F ULL GRAPH CLASSIFICATION RESULTS (%) TABLE II PCC RESULTS FOR ALL DATASETS , ALL CLASSIFIERS AND DIFFERENT MINIMAL SUPPORTS . Minimal support = 10% Frequents Closed AC-reduced Closed AC-reduced Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 HIA 1964 60,69 61,67 56,11 44,86 216 66,67 57,22 52,5 51,53 467 54,03 52,5 49,31 54,72 118 60,97 57,64 54,72 55,97 PTC-FM 2492 56,48 60,17 51,31 58,18 285 62,77 63,6 51,88 62,73 1271 59,34 60,48 53 61,87 225 59,06 61,63 51,87 63,04 PTC-FR 2749 64,96 62,98 54,13 64,39 336 65,53 61,83 54,98 61,25 1347 62,67 61,25 58,1 63,25 245 66,95 63,26 59,83 63,23 PTC-MM 2472 64,29 59,48 46,43 61,04 261 64,02 61,35 56,6 63,18 1270 59,55 59,8 46,16 60,17 212 63,73 63,44 59,27 62,58 PTC-MR 2665 63,03 56,95 54,39 57,24 345 59,27 53,43 58,16 57,51 1346 62,74 56,36 54,66 55,51 262 61,61 52,25 57,28 56,95 Minimal support = 20% Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 HIA 336 52,5 56,81 49,03 51,94 71 54,58 61,81 51,11 57,92 119 56,11 53,89 46,81 46,94 47 52,36 54,44 47,78 56,94 PTC-FM 631 61,92 57,29 52,15 59,02 103 59,34 57,56 49,01 57,02 408 60,46 59,88 55,3 57,3 86 55,05 58,17 48,72 50,98 PTC-FR 694 64,94 63,56 51,56 60,12 102 63,24 61,83 51,85 64,95 445 60,42 61,53 56,96 57,29 89 61,8 60,39 54,96 63,52 PTC-MM 634 62,78 58,3 49,11 65,77 99 61,33 58,34 47,92 56,85 416 55,98 55,29 49,67 64,01 83 58,98 53,87 48,81 57,12 PTC-MR 652 64,23 59,29 50,26 57,29 99 65,37 56,07 54,07 58,43 418 56,04 56,05 52,32 54,93 85 59,82 54,9 52,86 58,99 Minimal support = 30% Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 HIA 152 50,14 44,17 50,28 47,78 25 49,31 52,5 46,53 47,78 71 58,19 47,64 46,67 43,19 18 52,78 52,64 51,25 60,69 PTC-FM 214 55,87 58,15 55,61 56,18 25 55,28 57,59 51,86 51,31 149 59,61 57,89 56,18 58,48 20 55,56 60,47 53,6 53,01 PTC-FR 240 56,97 61,26 46,15 58,7 31 59,82 61,81 54,13 60,1 166 56,13 57,27 49,56 57,56 26 57,54 61,82 55,83 61,25 PTC-MM 221 58,9 55,08 53,86 56 32 53,85 52,16 50,35 53,57 158 53,81 53,3 53,56 55,67 28 52,99 50,3 47,9 54,75 PTC-MR 234 59,33 59,27 53,43 58,14 36 62,49 58,14 54,92 59,01 164 52,29 55,45 49,99 51,71 31 60,46 55,19 54,34 59,6 C4.5 Minimal support = 40% Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB HIA 89 46,94 33,33 52,78 44,58 16 58,47 50,69 49,86 50,14 38 48,89 47,22 48,75 46,25 12 55,97 54,58 47,36 55 PTC-FM 102 54,73 53,6 55,88 55,3 9 57 52,13 59,03 58,17 92 50,97 59,04 55,03 62,49 8 54,73 55 59,03 57,59 PTC-FR 104 58,41 58,12 51,56 61,86 10 54,98 60,11 64,96 65,53 92 56,43 59,53 50,99 63,85 8 50,14 61,83 65,53 65,53 PTC-MM 99 54,79 59,55 54,45 61,34 9 58,04 58,66 62,78 55,94 93 52,99 56,87 53,85 60,42 8 53,54 60,12 63,07 57,13 PTC-MR 103 56,43 53,76 52,87 56,7 9 52,91 55,5 56,39 57,26 92 54,07 54,92 53,17 53,76 8 52,88 56,73 57,84 53,51 Minimal support = 50% • • • • Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 HIA 52 41,11 50,28 57,36 44,31 11 55,14 51,25 53,75 46,53 20 48,75 54,72 54,58 48,89 9 54,86 51,25 55 45,14 PTC-FM 65 54,16 53,92 54,74 62,2 9 54,16 61,33 55,88 61,61 61 47,02 58,19 55,31 61,34 8 50,16 59,03 55,31 61,61 PTC-FR 66 56,41 62,67 55,85 63,56 9 47,83 62,11 65,25 65,53 62 54,13 61,53 55,56 63,56 8 51,01 63,8 65,53 65,53 PTC-MM 65 58,59 58,94 53,54 58,92 8 51,46 57,74 60,11 61,91 61 52,38 55,66 53,54 58,91 7 52,64 61,88 59,21 60,1 PTC-MR 62 51,46 52,31 54,31 55,22 11 52,87 56,14 56,4 59,33 58 50,91 56,97 53,45 54,38 10 50,85 58,16 56,95 60,18 SVM: Support Vector Machine [16]; NN: Nearest Neighbors [17]; NB: Naive Bayesian [18]; C4.5: Decision Trees[14]. 119