FGMAC: Frequent subGraph Mining with Arc
Consistency
Brahim Douar and Michel Liquiere
Chiraz Latiri and Yahya Slimani
LIRMM, 161 rue Ada 34392 - Montpellier, France
{douar,liquiere}@lirmm.fr
URPAH Team, Faculty of Sciences of Tunis
[email protected],
[email protected]
Abstract—With the important growth of requirements to
analyze large amount of structured data such as chemical
compounds, proteins structures, XML documents, to cite but
a few, graph mining has become an attractive track and a
real challenge in the data mining field. Among the various
kinds of graph patterns, frequent subgraphs seem to be relevant
in characterizing graphsets, discriminating different groups of
sets, and classifying and clustering graphs. Because of the NPCompleteness of subgraph isomorphism test as well as the huge
search space, fragment miners are exponential in runtime and/or
memory consumption. In this paper we study a new polynomial
projection operator named AC-Projection based on a key technique of constraint programming namely Arc Consistency (AC).
This is intended to replace the use of the exponential subgraph
isomorphism. We study the relevance of frequent AC-reduced
graph patterns on classification and we prove that we can achieve
an important performance gain without or with non-significant
loss of discovered pattern’s quality.
Index Terms—Graph mining; AC-projection; Graph classification;
I. I NTRODUCTION
In front of the urgent need to analyze large amount of
structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has
become a compelling issue in the data mining field. Indeed,
discovering frequent subgraphs, i.e., discovering subgraphs
which occur frequently enough over the entire set of graphs, is
a real challenge due to their exponential number. Indeed, based
on the A PRIORI principle [1], a frequent n-edge graph may
contain 2n frequent subgraphs. This raises a serious problem
related to the exponential search space as well as the counting
of complete sub-patterns while the kernel of frequent subgraph
mining is subgraph isomorphism test which has been proved
to be in NP-complete complexity class [2].
In this paper, we study an innovative projection operator
intended to replace the costly subgraph isomorphism. In the
second section we give a brief literature review of the subgraph
mining field. Then, we present the AC-projection operator
initially introduced in [3], as well as its very interesting properties. We propose an efficient graph mining algorithm using the
AC-projection operator. Finally, we study the relevance of the
AC-reduced patterns for the supervised graph classification.
II. F REQUENT S UBGRAPH M INING
Given a database consisting of small graphs, for example,
molecular graphs, the problem of mining frequent subgraphs
978-1-4244-9925-0/11/$26.00 ©2011 IEEE
is to find all subgraphs that are subgraph isomorphic with a
large number of example graphs in the database. In this section
we define preliminary concepts as well as a brief review of
literature related to frequent subgraph mining.
A. Preliminary Concepts
Definition II.1 (Labeled Graph) A labeled graph can be
represented by a 4-tuple, G = (V, A, L, l), where
•
•
•
•
V is a set of vertices,
A ⊆ V × V is a set of edges,
L is a set of labels,
l : V ∪ A → L, l is a function assigning labels to the
vertices and the edges.
Definition II.2 (Isomorphism, Subgraph Isomorphism) Given
two graphs G1 and G2 , an isomorphism is a bijective function
f : V (G1 ) → V (G2 ), such that
∀x ∈ V (G1 ), l(x) = l(f (x)), and
∀(x, y) ∈ A(G1 ), (f (x), f (y)) ∈ A(G2 ) and l(x, y) =
l(f (x), f (y)).
A subgraph isomorphism from G1 to G2 is an isomorphism
from G1 to a subgraph of G2 .
Definition II.3 (Frequent Subgraph Mining) Given a graph
dataset, GS={Gi | i = 0......n}, and a minimal support
(minSup), let
ς(g, G) =
{
1
0
if there is a projection from g to G
otherwise.
σ (g, GS) =
∑
ς(g, Gi )
Gi ∈ GS
σ (g, GS) denotes the occurrence frequency of g in GS,
i.e., the support of g in GS. Frequent subgraph mining is
to find every graph g such that σ (g, GS) is greater than or
equal to minSup.
Known frequent subgraphs miners are based on this definition
and deal with the special case where the projection operator
is subgraph isomorphism.
112
B. Related Works
Algorithms for frequent subgraphs mining are based on two
patterns discovery paradigms namely breadth-first search and
depth-first search. They aim to find the connected subgraphs
that have a sufficient number of edge disjoint embedding in
a single large undirected labeled sparse graph. Most of these
algorithms use different methods for determining the number
of edge-disjoint embeddings of a subgraph and employ different ways for candidates generation and support counting. An
interesting quantitative comparison of the most cited subgraph
miners is given in [4].
The novel graph mining approach that we present in this
paper is based on a breadth-first approach intensively cited in
litterature. The following section in intended to present this
approach named FSG [5].
different subgraphs miners, say that embedding lists do not
considerably speed up the search for frequent fragments. So
even though G S PAN [6] does not use them, it is competitive to
G ASTON [7] and F FSM [8], at least with not too big fragments.
So, it seems that a better way to avoid exponential runtime
and/or memory consumption is the use of another projection
operator instead of the subgraph isomorphism. This projection
has to have a polynomial complexity as well as a polynomial
memory consumption.
In [3], the author introduced an interesting projection operator named AC-projection which seems to have good properties
and ensure polynomial time and memory consumption. The
forthcoming sections present this operator with its many
interesting properties and show an optimized algorithm for
computing it.
C. The Fsg Algorithm
III. AC- PROJECTION
Principal breadth-first approaches take advantage of the
A PRIORI [1] levelwise approach. The F SG algorithm finds
frequent subgraphs using the same level-by-level expansion
adopted. F SG uses a sparse graph representation which minimizes both storage and computation and it increases the size of
frequent subgraphs by adding one edge at a time, allowing to
generate the candidates efficiently. Various optimizations have
been proposed for candidate generation and counting which
allowed it to scale to large graph databases. For problems in
which a moderately large number of different types of vertices
and edges exist, F SG was able to achieve good performance
and to scale linearly with the database size. For problems
where the number of edge and vertex labels was small, the
performance of F SG was worse, as the exponential complexity
of graph isomorphism dominates the overall performance.
In this paper, we are particularly interested by the F SG algorithm. Indeed, we propose a basic subgraph mining approach
which is a modified F SG version that uses a novel operator
for the support counting process.
D. Critical Discussion
Developing algorithms that discover all frequently occurring
subgraphs in a large graph database is computationally extensive, as graph and subgraph isomorphisms play a key role
throughout the computations. Since subgraph isomorphism
testing is a hard problem, fragment miners are exponential in
runtime. Many frequent subgraphs miners have tried to avoid
the NP-completeness of subgraph isomorphism problem by
storing all embeddings in embedding lists which consist of
a mapping of the vertices and edges of a fragment to the
corresponding vertices and edges in the graph it occurs in. It
is clear that with this trick we can avoid excessive subgraph
isomorphism tests when counting fragments support and, by
the way, avoid exponential runtime. However these approaches
face exponential memory consumption instead. So, we can say
that they are only trading time versus storage not more and
on the other hand can cause problems if not enough memory
is available or if the memory throughput is not high enough.
The authors in [4], after an extensive experimental study of
The approach suggested in [3] advocates a projection operator based on the arc consistency algorithm. This projection
method has the required properties: polynomiality, local validation, parallelization, structural interpretation.
A. AC-projection And Arc Consistency
Definition III.1 (Labeling) Let G1 and G2 be two graphs.
We named labeling from G1 into G2 a mapping I : V (G1 ) →
2V (G2 ) |∀x ∈ V (G1 ), ∀y ∈ I(x), l(x) = l(y).
Thus, for a vertex x ∈ V (G1 ).I(x) is a set of vertices of
G2 with the same label l(x). We can say that I(x) is the set
of “possible images” of the vertex x in G2 .
This first labeling is trivial but can be refined using the
neighborhood relations between vertices.
Definition III.2 (AC-compatible y ) Let G be a graph V1
⊆ V (G), V2 ⊆ V (G)
V1 is AC-compatible with V2 iff
1) ∀xk ∈ V1 ∃yp ∈ V2 |(xk , yp ) ∈ A(G)
2) ∀yq ∈ V2 ∃xm ∈ V1 |(xm , yq ) ∈ A(G).
We note V1 y V2
Definition III.3 (Consistency for one arc) Let G1 and G2 be
two graphs. We say that a labeling I : V (G1 ) → V (G2 ) is
consistent with an arc (x, y) ∈ A(G1 ), iff I(x) y I(y).
Definition III.4 (AC-labeling) Let G1 and G2 be two graphs.
A labeling I from G1 into G2 is an AC-labeling iff I is
consistent with all the arcs e ∈ A(G1 ).
Definition III.5 (AC-projection ⇁ ) Let G1 and G2 be two
graphs. An AC-labeling I : V (G1 ) → V (G2 ) is an ACprojection iff ∀ AC-labeling I ′ : V (G1 ) → V (G2 ) and
∀x ∈ V (G1 ), I ′ (x) ⊆ I(x). We note it G1 ⇁ G2
113
Algorithm 1: AC-projection
Input : Two graphs G1 and G2
Output: An AC-projection I from G1 into G2 if there
is, otherwise else an empty set
Function ReviseArc
Input : A graph G2 , A labeling I from G1 into G2 , An
arc (x, y) ∈ V (G1 )
Output: A new labeling I ′ from G1 into G2
I ′ ← I;
I ′ (x) ← I(x) r {x′ ∈ V (G2 )|∄ y ′ ∈ I(y) with
(x′ , y ′ ) ∈ A(G2 )};
I ′ (y) ← I(y) r {y ′ ∈ V (G2 )|∄ x′ ∈ I(x) with
(x′ , y ′ ) ∈ A(G2 )};
//Initialisation
foreach x ∈ V (G1 ) do
I(x)={y ∈ V (G2 )|l(x) = l(y)}
S ← A(G1 );
P ← ∅;
while S ̸= ∅ do
Choose an arc (x, y) from S; // in general the first
element of S
I ′ :=ReviseArc ((x, y), I, G2 );
//If for one vertex x ∈ V (G1 ) we have I ′ (x) = ∅
then there is no arc consistency
if (I ′ (x) = ∅) or (I ′ (y) = ∅) then
return ∅;
′
//I is consistent now with the arc (x, y); but it can
be non-consistent with some other previously tested
arcs so we have to verify and change (if necessary),
the consistency of all these arcs.
if I(x) ̸= I ′ (x) then
R ← {(x′ , y ′ ) ∈ P |x′ = x or y ′ = x};
S ← S ∪ R;
P ← P r R;
if I(y) ̸= I ′ (y) then
R ← {(x′ , y ′ ) ∈ P |x′ = y or y ′ = y};
S ← S ∪ R;
P ← P r R;
S ← S r {x, y};
P ← P ∪ {x, y};
I ← I ′;
return I ′ ;
Definition III.6 (AC-equivalent graphs)
Two graphs G1 and G2 are AC-equivalent iff both G1 ⇁ G2
and G2 ⇁ G1 are fulfilled.
We note G1 ⇌ G2 .
We have an equivalence relation between graph using the
AC-projection. In this paragraph we study the properties of this
operation and search for a reduced element in an equivalence
class of graphs. This element will be the unique representative
of this equivalence class, and for which we give then the name
of “AC-reduced graph” .
Fig. 1.
right)
AC-equivalent graphs and the associated AC-reduced one (extreme
1) Auto AC-projection And AC-reduction: We study the
auto AC-projection operation (G ⇁ G), which we will use
to find the minimal graph of an equivalence class of graphs
and we will prove in the following that the obtained graph is
minimal.
B. AC-projection: Improved Algorithm
We give an improved AC-projection algorithm for graphs
(based on the AC3 algorithm [9]). The AC-projection algorithm takes two graphs G1 and G2 and tests if there is an
AC-projection from G1 into G2 (see Algorithm 1). It begins
by the creation of a first rough labeling I and reduces, for
each vertex x, the given lists I(x) to consistent lists using
the function ReviseArc. The consistency check fails if
some I(x) becomes empty, otherwise the consistency check
succeeds and the algorithm gives the labeling I which is an
AC projection G1 ⇁ G2 . Like the AC3 algorithm, the actual
AC-projection algorithm has a worst-case time complexity of
O(e×d3 ) and space complexity of O(e) where e is the number
of arcs and d is the size of the largest domain. In our case,
the size of the largest domain is the size of the largest subset
of nodes with the same label.
C. AC-projection And Reduction
The following definition introduces an equivalence relation
between graphs w.r.t. AC-projection.
Proposition III.7 Given an AC-projection I : G ⇁ G′ ,
x′ ∈ I(x) iff for each tree T (VT , AT ), (with VT is the set of
vertices of T , and AT its set of arcs) and each t ∈ VT we
have:
If there is a morphism from T to G which associates t to
x then there is a morphism from T to G′ which associates t
to x′ . [10]
Proposition III.8 (Order relation on I)
For an AC-projection I : G ⇁ G, if xi ∈ I(x) then
I(xi ) ⊆ I(x)
Proof: If we have xi ∈ I(x), it means that for all trees
T having a morphism in G and which associates t to x, then
there is a morphism from t in G which associates t to xi
(Proposition III.7). We call T (x,t) this set of trees.
Let’s see now if we can have xj ∈ I(xi ) and xj ∈
/ I(x).
For xj ∈ I(xi ), according to Proposition III.9, we see that
all trees from T (x,t) associates to t the vertex xi . Since xj ∈
I(xi ), it will be the same for it, so xj ∈ I(x).
114
We conclude that we can not have xj ∈
/ I(x) since xj ∈
I(xi ), so: I(xi ) ⊆ I(x).
Proposition III.9 Given a graph G and an AC-projection I :
G ⇁ G and given a vertex x ∈ V (G) with |I(x)| > 1.
If we have xi ∈ I(x), the graph G′ formed by the merging of
x and xi is AC-equivalent to G.
Proof: To prove that G ⇋ G′ we have to prove that
G ⇁ G′ and G′ ⇁ G.
Since G′ ⇁ G by construction we have only to prove that
G ⇁ G′ :
We construct this AC-projection by replacing x by xi in the
auto AC-projection G ⇁ G. Since I(xi ) ⊆ I(x), so there is
really an AC-projection. We conclude that G ⇁ G′ .
Now, we want to find the smallest element of the equivalence class of graphs. For two AC-equivalent graphs G and
G′ , we will consider that G < G′ iff |V (G)| < |V (G′ )|.
Proposition III.10 (Minimality)
A graph G is minimal in the equivalence class iff for the
AC-projection I : G ⇁ G, ∀x ∈ V (G), I(x) = {x}.
Proof: According to Proposition III.9, it is clear that if
there was a vertex x such that |I(x)| > 1, then we will be
able to do another reduction.
Now, the question is: can a graph G′ = G\x be AC-equivalent
to G ?
If this is true, then we must have an AC-projection from G
to G′ . It would say that x in G has another image x′ in G′ .
So, x′ must be in I(x) which is contradictory to the initial
hypothesis.
IV. FGMAC: F REQUENT S UBGRAPH M INING W ITH A RC
C ONSISTENCY
In this section, we present FGMAC, a modified version of
the FSG algorithm [5] based on the AC-projection operator. In
fact, in this version we have changed the support counting part.
Instead of subgraph isomorphism, the AC-projection is used
to verify whether a candidate graph appears in a transaction
or not.
A. The Algorithm
The FGMAC algorithm initially enumerates all the frequent
single and double edge graphs. Then, based on those two sets,
it starts the main computational loop. During each iteration it
first generates candidate subgraphs whose size is greater than
the previous frequent ones by one edge (Algorithm 4, line 5).
Next, it counts the frequency for each of these candidates,
and prunes subgraphs that do no satisfy the support constraint
(Algorithm 4, lines 6-11). Discovered frequent subgraphs satisfy the downward closure property of the support condition,
which allows us to effectively prune the lattice of frequent
subgraphs.
The FGMAC’s particularity is to return only frequent ACreduced graphs (Algorithm 4, line 11) which is a subset of the
whole frequent isomorphic pattern set.
In the following we present the key three steps of the
FGMAC main process.
Algorithm 4: FGMAC
Input : A graph dataset D, Minimal support σ
Output: The set F of frequent subgraphs
F 1 ←detect all frequent (1-edge)-subgraphs in D;
F 2 ←detect all frequent (2-edges)-subgraphs in D;
k ← 3;
while F k−1 ̸= ∅ do
C k ← fsg-gen (F k−1 )
foreach candidate g k ∈ C k do
g k .count ← 0;
foreach transaction t ∈ D do
if g k ⇁ t then
g k .count ← g k .count + 1;
Algorithm 3: AC-reduce
Input : A graph G
Output: G′ =AC-reduced G
G′ ← G;
I ←AC-projection (G, G);
Q ← V (G);
Sort Q such as x comes before y if |I(x)| < |I(y)|;
foreach v in Q do
foreach i in I(v) do
if (i ̸= v) then
N (v) ← N (v) ∪ N (i); //if v and i are
neighbors, then we would have a reflexive arc
Q ← Q r i;
V (G′ ) ← V (G′ ) r i;
F k ← {AC-reduce(g k ∈ C k )|g k .count ≥ σ|D|};
k ← k + 1;
return F ;
return G′ ;
B. Candidate Generation
2) AC-reduce Algorithm: The AC-reduce algorithm is
based on the properties given on the section above, these properties allow to construct the AC-reduced graph considering
any graph G. To do this, we simply have to do an auto ACprojection G ⇁ G and then make the necessary merges. So
this algorithm is very simple and have a polynomial complexity, since the AC-projection’s complexity is polynomial.
This step is assured by the same fsg-gen function (see
Algorithm 4, line 5) used in the FSG algorithm. This function
uses a precise joining operator (fsg-join) which generates
(k + 1) − edges subgraphs by joining two frequent k − edges
subgraphs. In order for two such frequent k −edges subgraphs
to be eligible for joining they must contain the same (k − 1) −
edges subgraph named core. The complete description of these
functions as well as their detailed algorithms are given in [11].
115
TABLE I
C LASSIFICATION DATASETS STATISTICS
Datasets
HIA
PTC-FM
PTC-FR
PTC-MM
PTC-MR
Transactions
86
349
351
336
344
Distinct labels
Edge
Vertices
3
8
3
19
3
20
3
21
3
19
Edges / Transaction
Average
Max
24
44
26
108
27
108
25
108
26
108
HIA
Fig. 2.
Vertices / Transaction
Average
Max
22
40
25
109
26
109
25
109
26
109
PTC-FM
Runtime comparison of FGMAC versus FSG with the two datasets HIA and PTC-FM
C. Support Calculation
The key operator leading this step is the AC-projection previously described. In fact, to verify whether a pattern appears
in a transaction or not, FGMAC calculate in polynomial time
if there is an AC-projection of the pattern in each one of
the transactions. In order to optimize this support calculation
phase, the algorithm associates to each graph g of size k, the
set E(g) of transactions such as for each graph G ∈ E(g),
g ⇁ G. Having the graph g1 ∪ g2 representing the union of
the two graphs g1 and g2 , the intersection of the two sets
E(g1 ) ∩ E(g2 ) is then calculated.
As E(g1 ∪ g2 ) ⊆ E(g1 ) ∩ E(g2 ), it is possible to eliminate
the graph g1 ∪ g2 if the transaction’s count E(g1 ) ∩ E(g2 )
is sufficiently low to make g1 ∪ g2 infrequent. By the other
hand, the existence of a subgraph which has an AC-projection
in g1 ∪ g2 can be only searched in transactions E(g1 ) ∩ E(g2 ).
D. Frequents AC-reduction
This step is essential at the end of each iteration of the
main loop of the algorithm. It is intended to avoid the
extraction of non AC-reduced frequent graphs, which represent
representative elements of graphs equivalence classes w.r.t.
AC-euivalence. This process is based on the AC-reduce
function described previously. We note that that this step takes
advantage of the polynomial complexity of the AC-reduction
algorithm.
consists in a calculation of their discriminative power within
a supervised graph classification process.
A. Datasets
We carried out performance and classification experiments
on five biological activity datasets widely cited in the literature.
These datasets can be divided in two groups:
• The Predictive Toxicology Challenge (PTC) [12],contains
a set of chemical compounds classified according to their
toxicity in male rats (PTC-MR), female rats (PTC-FR),
male mice (PTC-MM), and female mice (PTC-FM).
• The Human Intestinal Absorption (HIA) [13], contains
chemical compounds classified by intestinal absorption
activity.
B. Performance Point Of View
In this subsection we present a quantitative study of the
computational performance of FGMAC compared to FSG.
Results depicted in Figure 2 clearly show that FGMAC
outperform FSG in the runtime point of view for all minimal
supports selected and confirm the theorical results about the
polynomiality of the AC-projection operator compared to the
exponential complexity of the subgraph isomorphism adopted
by FSG.
In the following section, we present a study in a qualitative
point of view of frequent AC-reduced patterns.
C. Qualitative Point Of View: Graph Classification
V. E XPERIMENTS A ND C OMPARATIVE S TUDY
In order to prove the usefulness of the AC-projection for
graph mining, we present in the following an experimental
study of the FGMAC algorithm. We insist that the set of frequent AC-reduced graphs found by FGMAC is not exhaustive
w.r.t. isomorphic patterns. So, in the following, we present
a quantitative study of the FGMAC performance followed
by a qualitative evaluation of the AC-reduced patterns which
Graph classification is a supervised learning problem in
which the goal is to categorize an entire graph as a positive
or negative instance of a concept.Feature mining on graphs
is usually performed by finding all frequent or informative
substructures in the graph instances. These substructures are
used for transforming the graph data into data represented
as a single table, and then traditional classifiers are used for
116
Fig. 3.
HIA (Frequent)
PTC-FM (Frequent)
HIA (Closed)
PTC-FM (Closed)
Comparison of the number of patterns of different feature set for PTC-FM and HIA datasets
classifying the instances. The aim of using graph classification
in this paper is the evaluation of the quality and discriminative
power of frequent AC-reduced subgraph patterns, and to
compare it with isomorphic frequent subgraphs.
We carried out classification experiments on five biological
activity datasets, and measured classifier prediction accuracy
using the known decision trees classifier named C4.5 [14].
The classification methods are described in more detail in the
following subsections, along with the associated results.
1) Methods: We evaluated the classification accuracy using
two different feature sets. The first set of features (Frequent)
consists of all frequent subgraphs. Those subgraphs are mined
using the F SG software [5] with different minimal supports.
Each chemical compound is represented by a binary vector
with length equal to the number of mined subgraphs. Each
subgraph is mapped to a specific vector index, and if a
chemical compound contains a subgraph then the bit at the
corresponding index is set to one, otherwise it is set to zero.
The second feature set (Closed) is simply a subset of the
first set. In fact, it consists of only closed frequent subgraphs.
Those subgraphs are also mined using F SG with the special
parameter (-x) to hold only closed frequent subgraphs.
The third feature set (AC-reduced) contains the FGMAC
output which consists of only AC-reduced frequent subgraphs.
We have represented each chemical compound by a binary
vector with length equal to the number of AC-reduced mined
subgraphs. Each AC-reduced subgraph is mapped to a specific
vector index, and if there is an AC-projection from the ACreduced subgraph to the chemical compound then the bit at
the corresponding index is set to one, otherwise it is set to
zero.
Finally, the fourth feature set (Closed AC-reduced) is similar
to the third one, the difference is that we only consider
closed AC-reduced frequent subgraphs with a special param-
eter passed to FGMAC.
2) Results: All classifications have been done with the
Weka data-mining software package [15], and we have
reported results of the prediction accuracy over the 10
cross-validation trials. In the following we are analyzing the
AC-reduced patterns from quantitative and qualitative points
of view.
a) Patterns Count: According to results showed in
Figure 3, we see that for all datasets we have very few
AC-reduced frequent patterns compared to the isomorphic
ones. We have on average 35% less patterns. This ratio
is bigger for lower supports and can reach up to 70% for
the HIA dataset with a minimal support of 10%. These
experimental results confirm that the search space for
extracting AC-reduced patterns is smaller than the one for
classical isomorphic subgraphs. So, having an algorithm
which looks for all AC-reduced frequent subgraphs would
benefit for the polynomiality of the projection operation as
well as a smaller search space (i.e. fewer AC-projection tests).
b) Classification Relevance: When we see that the number of frequent subgraph patterns has drastically decreased
after the AC-reduction process, we surely wonder about the
relevance of these few patterns for supervised graph classification. That’s why we have conducted classification’s accuracy
experiments using AC-reduced and isomorphic patterns to
compare them.
As showed in figure 4, we see that for the all datasets and
all classifiers average, the percentage of correctly classified
(PCC) instances is almost the same for all minimal support,
as well as for the other datasets individually.
Taking a more in-depth look to the results, we see that, for
some datasets and minimal support values, we even have better
117
Fig. 4.
All datasets (Frequent)
HIA (Frequent)
PTC-FM (Frequent)
All datasets (Closed)
HIA (Closed)
PTC-FM (Closed)
Comparison of the classification accuracy (PCC) of different feature set for All datasets(Average), PTC-FM and HIA datasets
PCC for AC-reduced feature set. This is due to the better generalization power of the AC-reduction process, which helped
supervised classifiers avoiding over-fitting learning problem.
VI. C ONCLUSION A ND F UTURE W ORK
In this paper, we have studied the use of a new polynomial
projection operator named AC-Projection initially introduced
in [3] and based on a key technique of constraint programming
namely Arc Consistency (AC). We have showed that using
the AC-projection and its properties has permitted us to have
less patterns than all frequent or closed subgraphs but with
a very comparable quality and discriminative power. ACprojection is intended to replace the use of the exponential
subgraph isomorphism, as well as reducing the search space
when seeking for frequent subgraphs.
As a soon perspective, we are working on a depth-first frequent subgraph mining approach based on the AC-projection
operator. Given a graph dataset, this novel approach will be
able to looks for all frequent AC-reduced patterns with a
reduced search space.
R EFERENCES
[1] R. Agrawal and R. Skirant, “Fast algorithms for mining association
rules,” in proceedings of the 20th International Conference on Very
Large Databases, Santiago, Chile, June 1994, pp. 478–499.
[2] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide
to the Theory of NP-Completeness.
New York, NY, USA: W. H.
Freeman & Co., 1979.
[3] M. Liquiere, “Arc consistency projection: A new generalization relation
for graphs,” in ICCS, ser. LNCS, U. Priss, S. Polovina, and R. Hill, Eds.,
vol. 4604. Springer, 2007, pp. 333–346.
[4] M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen, “A quantitative
comparison of the subgraph miners mofa, gspan, ffsm, and gaston,” in
European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases, ser. LNCS, vol. 3721. Springer,
2005, pp. 392–403.
[5] M. Kuramochi and G. Karypis, “Frequent subgraph discovery.” in
International Conference on Data Mining, N. Cercone, T. Y. Lin, and
X. Wu, Eds. IEEE Computer Society, 2001, pp. 313–320.
[6] X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,”
in International Conference on Data Mining. IEEE Computer Society,
2002, pp. 721–724.
[7] S. Nijssen and J. N. Kok, “The gaston tool for frequent subgraph
mining,” in International Workshop on Graph-Based Tools (Grabats).
Electronic Notes in Theoretical Computer Science, 2004, pp. 77–87.
[8] J. Huan, W. Wang, and J. Prins, “Efficient mining of frequent subgraphs
in the presence of isomorphism.” in International Conference on Data
Mining. IEEE Computer Society, 2003, p. 549.
[9] A. K. Mackworth, “Consistency in networks of relations,” Artif. Intell.,
vol. 8, no. 1, pp. 99–118, 1977.
[10] P. Hell and J. Nesetril, Graphs and homomorphism, O. L. S. in Mathematics and its Application, Eds. Oxford: Oxford University Press,
2004, vol. 28.
[11] M. Kuramochi and G. Karypis, “An efficient algorithm for discovering frequent subgraphs,” IEEE Transactions on Knowledge and Data
Engineering, vol. 16, pp. 1038–1051, 2004.
[12] C. Helma, R. D. King, S. Kramer, and A. Srinivasan, “The predictive
toxicology challenge 20002001,” Bioinformatics, vol. 17, no. 1, pp. 107–
108, 2001.
[13] M. D. Wessel, P. C. Jurs, J. W. Tolan, and S. M. Muskal, “Prediction
of human intestinal absorption of drug compounds from molecular
structure,” Journal of Chemical Information and Computer Sciences,
vol. 38, no. 4, pp. 726–735, 1998.
[14] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. Morgan
Kaufmann, January 1993.
[15] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data
Management Systems). San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2005.
[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,
vol. 20, no. 3, pp. 273–297, 1995.
[17] B. Dasarathy, Nearest Neighbor (NN) norms : NN pattern classification
techniques. IEEE Computer Society Press, 1991.
[18] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New
York: John Wiley and Sons, 1973.
118
A PPENDIX : F ULL GRAPH CLASSIFICATION RESULTS (%)
TABLE II
PCC
RESULTS FOR ALL DATASETS , ALL CLASSIFIERS AND DIFFERENT MINIMAL SUPPORTS .
Minimal support = 10%
Frequents
Closed
AC-reduced
Closed AC-reduced
Datasets
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
HIA
1964
60,69
61,67
56,11
44,86
216
66,67
57,22
52,5
51,53
467
54,03
52,5
49,31
54,72
118
60,97
57,64
54,72
55,97
PTC-FM
2492
56,48
60,17
51,31
58,18
285
62,77
63,6
51,88
62,73
1271
59,34
60,48
53
61,87
225
59,06
61,63
51,87
63,04
PTC-FR
2749
64,96
62,98
54,13
64,39
336
65,53
61,83
54,98
61,25
1347
62,67
61,25
58,1
63,25
245
66,95
63,26
59,83
63,23
PTC-MM
2472
64,29
59,48
46,43
61,04
261
64,02
61,35
56,6
63,18
1270
59,55
59,8
46,16
60,17
212
63,73
63,44
59,27
62,58
PTC-MR
2665
63,03
56,95
54,39
57,24
345
59,27
53,43
58,16
57,51
1346
62,74
56,36
54,66
55,51
262
61,61
52,25
57,28
56,95
Minimal support = 20%
Datasets
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
HIA
336
52,5
56,81
49,03
51,94
71
54,58
61,81
51,11
57,92
119
56,11
53,89
46,81
46,94
47
52,36
54,44
47,78
56,94
PTC-FM
631
61,92
57,29
52,15
59,02
103
59,34
57,56
49,01
57,02
408
60,46
59,88
55,3
57,3
86
55,05
58,17
48,72
50,98
PTC-FR
694
64,94
63,56
51,56
60,12
102
63,24
61,83
51,85
64,95
445
60,42
61,53
56,96
57,29
89
61,8
60,39
54,96
63,52
PTC-MM
634
62,78
58,3
49,11
65,77
99
61,33
58,34
47,92
56,85
416
55,98
55,29
49,67
64,01
83
58,98
53,87
48,81
57,12
PTC-MR
652
64,23
59,29
50,26
57,29
99
65,37
56,07
54,07
58,43
418
56,04
56,05
52,32
54,93
85
59,82
54,9
52,86
58,99
Minimal support = 30%
Datasets
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
HIA
152
50,14
44,17
50,28
47,78
25
49,31
52,5
46,53
47,78
71
58,19
47,64
46,67
43,19
18
52,78
52,64
51,25
60,69
PTC-FM
214
55,87
58,15
55,61
56,18
25
55,28
57,59
51,86
51,31
149
59,61
57,89
56,18
58,48
20
55,56
60,47
53,6
53,01
PTC-FR
240
56,97
61,26
46,15
58,7
31
59,82
61,81
54,13
60,1
166
56,13
57,27
49,56
57,56
26
57,54
61,82
55,83
61,25
PTC-MM
221
58,9
55,08
53,86
56
32
53,85
52,16
50,35
53,57
158
53,81
53,3
53,56
55,67
28
52,99
50,3
47,9
54,75
PTC-MR
234
59,33
59,27
53,43
58,14
36
62,49
58,14
54,92
59,01
164
52,29
55,45
49,99
51,71
31
60,46
55,19
54,34
59,6
C4.5
Minimal support = 40%
Datasets
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
HIA
89
46,94
33,33
52,78
44,58
16
58,47
50,69
49,86
50,14
38
48,89
47,22
48,75
46,25
12
55,97
54,58
47,36
55
PTC-FM
102
54,73
53,6
55,88
55,3
9
57
52,13
59,03
58,17
92
50,97
59,04
55,03
62,49
8
54,73
55
59,03
57,59
PTC-FR
104
58,41
58,12
51,56
61,86
10
54,98
60,11
64,96
65,53
92
56,43
59,53
50,99
63,85
8
50,14
61,83
65,53
65,53
PTC-MM
99
54,79
59,55
54,45
61,34
9
58,04
58,66
62,78
55,94
93
52,99
56,87
53,85
60,42
8
53,54
60,12
63,07
57,13
PTC-MR
103
56,43
53,76
52,87
56,7
9
52,91
55,5
56,39
57,26
92
54,07
54,92
53,17
53,76
8
52,88
56,73
57,84
53,51
Minimal support = 50%
•
•
•
•
Datasets
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
#
SVM
NN
NB
C4.5
HIA
52
41,11
50,28
57,36
44,31
11
55,14
51,25
53,75
46,53
20
48,75
54,72
54,58
48,89
9
54,86
51,25
55
45,14
PTC-FM
65
54,16
53,92
54,74
62,2
9
54,16
61,33
55,88
61,61
61
47,02
58,19
55,31
61,34
8
50,16
59,03
55,31
61,61
PTC-FR
66
56,41
62,67
55,85
63,56
9
47,83
62,11
65,25
65,53
62
54,13
61,53
55,56
63,56
8
51,01
63,8
65,53
65,53
PTC-MM
65
58,59
58,94
53,54
58,92
8
51,46
57,74
60,11
61,91
61
52,38
55,66
53,54
58,91
7
52,64
61,88
59,21
60,1
PTC-MR
62
51,46
52,31
54,31
55,22
11
52,87
56,14
56,4
59,33
58
50,91
56,97
53,45
54,38
10
50,85
58,16
56,95
60,18
SVM: Support Vector Machine [16];
NN: Nearest Neighbors [17];
NB: Naive Bayesian [18];
C4.5: Decision Trees[14].
119