Academia.eduAcademia.edu

MoSS: A Program for Molecular Substructure Mining

Molecular substructure mining is currently an intensively studied research area. In this paper we present an implementation of an algorithm for finding frequent substructures in a set of molecules, which may also be used to find substructures that discriminate well between a focus and a complement group. In addition to the basic algorithm, we discuss advanced pruning techniques, demonstrating their effectiveness with experiments on two publicly available molecular data sets, and briefly mention some other extensions.

MoSS: A Program for Molecular Substructure Mining Christian Borgelt Dept. of Knowledge Processing and Language Engineering University of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany [email protected] Thorsten Meinl Michael Berthold Computer Science Dept. of Computer and Department 2 Information Science University of Erlangen-Nuremberg University of Konstanz Martenstraße 3, 91058 Erlangen, Germany 78457 Konstanz, Germany [email protected] ABSTRACT [email protected] though not the 3-dimensional structure into account. The resulting set of graphs is then searched for common subgraphs, that is, molecular fragments that appear with a user-specified minimum frequency. For this approach several algorithms have been proposed recently, with many of them based on methods developed for association rule mining. In particular, the Apriori algorithm [1] and the Eclat algorithm [21] are taken as starting points. The general ideas of these algorithms can be transferred to molecular substructure mining, even though the fact that the input consists of graphs instead of sets poses some problems. Examples of algorithms developed in this way are Subdue [5], MolFea [12], FSG [13], MoFa [2], gSpan [20], FFSM [9], and Gaston [16]. Other approaches rely, for instance, on principles from inductive logic programming [6] and describe the graph structure by logical expressions. Molecular substructure mining is currently an intensively studied research area. In this paper we present an implementation of an algorithm for finding frequent substructures in a set of molecules, which may also be used to find substructures that discriminate well between a focus and a complement group. In addition to the basic algorithm, we discuss advanced pruning techniques, demonstrating their effectiveness with experiments on two publicly available molecular data sets, and briefly mention some other extensions. 1. INTRODUCTION A frequent task in biochemistry is the search for common features in large sets of molecules. Examples are drug discovery, where one is interested in identifying properties shared by molecules that have been classified as “active” (and rarely shared by those classified as “inactive”) w.r.t., for example, the protection of human cells against a virus, and compound synthesis, where one tries to identify properties that enable or inhibit the synthesis of new molecules, so that one can predict the chances for a successful synthesis. Of course, all of the mentioned approaches are generally applicable to attributed graphs, with molecules being only one specific application example. However, the Java implementation discussed in this paper is geared to molecules as it supports reading molecules in standard description languages like SMILES (Daylight, Inc.) and SLN (Tripos, Inc.). It also has to be mentioned that the first version of the presented program was developed in cooperation with the biochemical software company Tripos, Inc., St. Louis, MO, USA. Finally, it should be noted that the algorithm described here is also known as MoFa (Molecular Fragment miner). The Java program presented here, however, is called MoSS (Molecular SubStructure miner) in order to distinguish it from other implementations. Since the features one may use to describe molecules are manifold, there are approaches in abundance, ranging from simple one-dimensional measurements to complex thousanddimensional descriptors. The molecular weight and the number of hydrogen donors or acceptors are examples of simple features. More complex ones include binary vectors, which can be several thousand bits long with each bit representing a specific constellation of atoms like aromatic rings or amino groups, as well as shape descriptors that try to capture geometric properties of a molecule. 2. BASIC ALGORITHM In order to capture the bond structure of molecules, we model them as attributed graphs, in which each vertex represents an atom and each edge a bond between atoms. Each vertex carries attributes that indicate the atom type (i.e., the chemical element), a possible charge, and whether it is part of an aromatic ring. Each edge carries an attribute that indicates the bond type (single, double, triple, or aromatic). In this paper we focus on an approach that models molecules as attributed graphs, thus taking the connection structure, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OSDM’05, August 21, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-210-0/05/08 ...$5.00. Our goal is to find substructures that have a certain minimum support in a given set of molecules, i.e., are part of at least a certain percentage of the molecules. However, in order to restrict the search space, we usually consider only 6 connected substructures, i.e., graphs having only one connected component. For most applications, this restriction is harmless, because connected substructures are most often exactly what is desired. (Note, however, that the program considered here also allows for starting from a disconnected core structure, though not for extending a substructure by an unconnected atom.) We do not constrain the connectivity of the graph in any other way: The graphs may be chains or trees or may contain an arbitrary number of cycles. 2.1 The resulting extended embeddings are then sorted into equivalence classes, each of which represents a new substructure. This sorting is very simple, because only the added bond and maybe the added atom have to be compared. In our implementation we use a sorted array of lists of embeddings to group the extensions. After all extended embeddings have been processed, each array element contains the list of embeddings of a new substructure. Each of these new substructures corresponds to a child node in the search tree, each of which is then processed in turn by searching recursively on the list of embeddings corresponding to it. Search Tree Traversal Most naturally, the search is carried out by traversing a tree of fragments of molecules, similar to the tree of item sets that is traversed in frequent item set mining. The root of the tree is a core structure to start from, which for now we assume to be a single atom (more complex cores are discussed below). Going down one level in the search tree means to extend a substructure by a bond (and maybe an atom, if the bond does not close a ring), just like going down in an item set tree means adding an item to an item set. That is, with a single atom at the root of the tree, the root level contains the substructures with no bonds, the second level those with one bond, the third level those with two bonds and so on. 2.2 Basic Search Tree Pruning Of course, subtrees of the search tree can be pruned if they refer to substructures not having enough support, i.e., if too few molecules are referred to in the associated list of embeddings. We call this support based pruning. We may also prune the search tree if a user-defined threshold for the number of atoms in a fragment has been reached. We call this size based pruning. However, the most important pruning type is structural pruning, which is meant to ensure that each substructure is considered only once in the search tree. In frequent item set mining such structural pruning can easily be achieved by drawing on an (arbitrary, but fixed) order of the items and disallowing extensions by items preceding the item added last to the set (cf. [3] for more details). There are basically two ways in which such a search tree can be traversed: We can use either a breadth first search and explicit subset tests (like Apriori) or a depth first search and intersections of transaction lists (like Eclat). For our task the Eclat approach seems preferable, because the Apriori approach has certain drawbacks: even subset tests can be costly, but substructure tests, which consist mathematically in checking whether a given attributed graph is a subgraph of another attributed graph, can be extremely costly. Furthermore, the number of small substructures (1 to 4 atoms) can be enormous, so that even storing only the topmost levels of the tree can require a prohibitively large amount of memory. Of course, the Eclat approach also has its drawbacks, for example, the transaction lists are now lists of embeddings of a substructure into the given molecules. Since there can be several embeddings of the same substructure into one molecule, these lists tend to get fairly long. This drawback can make it necessary to start from a reasonably sized core structure (see below). In molecular substructure mining, however, such an approach is impossible, since there is no possible global order of, say, the atoms, as we have to take the bond structure into account. However, we can number the atoms in a substructure and record how a substructure was constructed in order to constrain its extensions. The number we assign to an atom reflects the step in which it was added. That is, the core atom is numbered 0, the atom added with the first bond is numbered 1 and so on. Note that this number does not tell anything about the type of the atom, as two completely different atoms may receive the same number, simply because they were added in the same step. Whenever an embedding is extended, we record in the resulting extension the number of the atom from which the added bond started. When the extended embedding is to be extended itself, we consider only bonds that start from atoms having numbers no less than this recorded number. That is, only the atom extended in the preceding step and atoms added later than this atom can be the starting point of a new bond. This rule is directly analogous to the rule that only items following the item added last may be added to an item set. With this simple scheme we immediately avoid that two bonds, call them A and B, which start from different atoms, are added in the order A, B in one branch of the search tree and in the order B, A in another. Since either the atom A starts from or the atom B starts from must have a smaller number, one of the orders is ruled out. To be more specific, our algorithm searches as follows: The given core structure is embedded into all molecules, resulting in a list of embeddings. Each embedding consists of references into a molecule that point out the atoms and bonds that form the substructure. Remember that a list of embeddings may contain several embeddings for the same molecule if the molecule contains the substructure in more than one place or if the substructure is symmetric. In a second step each embedding is extended in every possible way. This is done by adding all bonds in the corresponding molecule that start from an atom already in the embedding (to ensure connectedness and, of course, to reduce the number of bonds that have to be considered). This may or may not involve adding the atom the bond leads to, because this atom may or may not be part of the embedding already. More technically, by following the references of an embedding the atoms and bonds of the corresponding molecule are marked and only unmarked bonds emanating from marked atoms are considered as possible extensions. However, two or more bonds can start from the same atom. Therefore we also have to define an order on bonds, so that we do not add two different bonds A and B that start from the same atom in the order A, B in one branch of the search tree and in the order B, A in another. This order on bonds is, of course, arbitrary. In our implementation, single bonds precede aromatic bonds, which precede double bonds, which 7 a) d) O C C S N O b) N C C S N e) O C C S N C c) N C S N f) O C S N C O 1) C S N 3 (50%) 2) C C S N 3 (50%) 3) C S N 5 (83%) N C S O 4) C S O 4 (67%) 5) C S N 3 (50%) 6) C S 6 (100%) Figure 1: A set of six example molecules. Figure 3: The six frequent substructures that are found in the order in which they are generated. S a bc de f 111111 S C a bc de f 122111 C S C a bc de f 022000 C S N a bc de f 122110 O C S N a bc de f 222000 N C S N a bc de f 000110 O C S N O a bc de f 200000 O C C S N a bc de f 210000 S N a bc de f 111110 C S O a bc de f 222001 N C C S a bc de f 110100 O C S O a bc de f 200000 N C S O a bc de f 000001 C S N a bc de f 000111 O C C S a bc de f 210000 C C S a bc de f 110100 N C C S a bc de f 000100 O S N a bc de f 211000 N S N a bc de f 000110 S O a bc de f 211001 S N a bc de f 000111 O S O a bc de f 200000 O S N a bc de f 000001 First the sulfur atom is embedded into the six molecules. This results in six embeddings, one for each molecule, which form the root of the search tree (see Figure 2; the table in the root node records that there is one embedding for each molecule). Then the embeddings are extended in all possible ways, which leads to the four different substructures shown on the second level (i.e., S C, S N, S O, S N). These substructures are ordered, from left to right, as they are considered by our algorithm, i.e., extensions by single bonds precede extensions by double bonds, and within extensions by bonds of the same type the element type of the atom a bond leads to determines the order. Note that there are two embeddings of S C into both the molecules b and c and two embeddings of S O into the molecule a. O O S N a bc de f 200000 Figure 2: The search tree for the six molecule example. The tables below the fragments indicate the numbers of embeddings per molecule. precede triple bonds. Finally, within extensions by bonds of the same type starting from the same atom, the order is determined by (1) whether the atom the bond leads to is already in the substructure or not and (2) the type of this atom. To take care of the bond type etc., we record in each embedding which bond was added last. In the third step the extensions of the substructure S C are constructed. This leads to the first five substructures on the third level (i.e., C S C, C S N, C S O, C S N and C C S). Again the order of these substructures, from left to right, reflects the order in which they are considered. Since we search depth first, the next substructure to be extended is C S C.2 However, this substructure does not have enough support and therefore the subtree is pruned. As a final rule, we disallow extensions by bonds leading to an atom already in the substructure if the number of the destination atom is higher than the number of the source atom. All bonds leading to already captured atoms (that is, all bonds closing rings) must lead “backward”, that is, from an atom with a higher number to an atom with a lower one. The substructure C S N is considered next etc. However, we confine ourselves to pointing out situations in which specific aspects of our method become obvious. Effects of structural pruning can be seen, for instance, at the fragment C S N, which does not have a child in which a second carbon atom is attached to the sulfur atom. The reason is that the extension by the bond to the nitrogen atom rules out all single bonds leading to atoms of a type preceding nitrogen (like carbon). Similarly, C S N does not have children with another atom attached to the sulfur atom by a single bond, not even an oxygen atom, which follows nitrogen in the periodic table of elements. The reason is that a double bond succeeds a single bond and thus the extension by the double bond to the nitrogen atom rules out all single bonds emanating from the sulfur atom. Finally, the structure C C S has no children at all, even though it has enough support. The reason is that in this substructure a bond was added to the carbon atom adjacent to the sulfur atom. This carbon atom is numbered 1 and thus no bonds can be added to the sulfur atom, which has number 0. Only the carbon atoms can be starting points of a new bond, but there are no such bonds in the molecules a, b, and d. Note that ruling out extensions of the sulfur atom in this branch is not harmful, since all fragments having another atom attached to the sulfur atom are considered in the subtrees to the left of C C S. The above rules provide us with a structural pruning scheme, but unfortunately this scheme is not perfect and making it perfect would be very expensive computationally. The problem is that we do not have any precedence rule for two bonds of the same type starting from an atom with the same number and leading to atoms of the same type, and that it is not possible to give any precedence rule for this case that is based exclusively on locally available information. We consider the problems that result from this imperfection and our solution below, but think it advisable to precede this consideration by an illustrative example of the search process as we defined it up to now. 2.3 An Illustrative Example As an illustration we consider how our algorithm finds the frequent substructures of the six example molecules shown in Figure 11 , starting from a sulfur atom. We use a minimum support of 50%, i.e., a substructure must occur in at least three of the six molecules to qualify as frequent. 1 2 Please note that these structures were made up to demonstrate certain aspects of the search scheme. None of them has any real (i.e., chemical) meaning. It may seem strange that there are two embeddings of this substructure into both the molecules b and c. The reason for this is explained below. 8 During the recursive search all frequent substructures encountered are recorded. The resulting set of six frequent substructures, together with their absolute and relative support is shown in Figure 3. Note that C C S is not reported, because it has the same support as its superstructure C C S N. Likewise, O S N, S N, and S are not reported. Transferring a common term from frequent item set mining, we may say that our algorithm reports only closed fragments, that is, fragments no superstructure of which has the same support (see also below, Section 3.1). Note that considering the same substructure several times cannot lead to wrong results, only to multiple reporting of the same substructure. Multiple reporting, however, can be suppressed by maintaining a list of frequent substructures and suppressing new ones that are identical to already known ones. It is more important that the missing rule for equivalent bonds can lead to considerable redundant search in certain structures, especially molecules containing one or more aromatic rings. (A solution to this is discussed below.) However, it should be noted that even if we could amend the weakness of our structural pruning, we would still be unable to guarantee that each substructure is considered only once. If, for instance, some substructure X can be embedded twice into some molecules and if there are frequent substructures that contain both embeddings (and thus X twice), then these substructures can be grown from either embedding. If the connection between the two embeddings of X is not symmetric, the same substructure is reached in two different branches of the search tree in this case. Obviously, there is no simple way to avoid such situations. The example makes it clear that our algorithm can find arbitrary substructures, even though it does not show how cyclic structures are treated. Unfortunately, search trees for cyclic structures are too big to be depicted here. 2.4 Incomplete Structural Pruning We indicated above that our structural pruning is not perfect. In order to understand the problems that can arise, consider two molecules A and B with the common substructure N C S C O. We try to find this substructure starting from the sulfur atom. Since the two bonds emanating from the sulfur atom are equivalent, we have no precedence rule and thus the order in which they are added to an embedding depends on the order in which they occur in the corresponding molecule (which we have no real control over) . 2.5 Embedding a Core Structure Up to now we assumed that we start the search from a single atom. This usually works fairly well as long as this atom is rare in the molecules to work on. For example, sulfur or phosphorus are often good starting points in biochemical applications, while starting with carbon is a bad idea: Every organic molecule contains several carbon atoms, often twenty or more, and thus we end up with an already very high number of embeddings of the initial atom. As a consequence, the algorithm is likely to run out of memory before reaching substructures of reasonable size. Suppose that in molecule A the bond going to the left precedes the bond going to the right, while in molecule B it is the other way round. As a consequence, in embeddings into molecule A the left carbon atom will precede the right one, while in embeddings into molecule B it will be the other way round. Now consider the substructure C S C and its extensions. In molecule A the carbon numbered 1 (the left one) will be extended by adding the nitrogen atom and thus the oxygen atom can be added in the next step (to the carbon on the right, which is numbered 2), resulting in the full substructure. However, in molecule B the nitrogen atom has to be added by extending the carbon atom numbered 2 (again the left one; in embeddings into molecule B the right carbon is numbered 1). Hence it is not possible to add the oxygen atom in the next step, because this would mean adding a bond starting at an atom with a lower number than the atom extended in the preceding step. Therefore the common substructure is not found. This example also shows that it does not help to look “one step ahead” to the next atom, because there could be arbitrarily long equivalent chains, which differ only at the ends. There are no “local” criteria, which would enable us to decide how to order and thus number equivalent extensions of the same atom. However, if we cannot start from a rare element, it is sometimes possible to specify a core—for instance, an aromatic ring with one or two side chains—from which the search can be started. Provided the core structure is specific enough, there are only few, at best only one embedding per molecule, so that the list of embeddings is short. While it is trivial to embed a single atom into a molecule, embedding a core structure can be much more difficult. In our implementation we rely on a simple observation: embedding a core structure is the same as finding a common substructure of the molecule and the core that is as big as the core itself. This leads to the idea to grow a substructure into both the core and the molecule until it completely covers the core. That is, we do a substructure search for the core and the molecule starting from an arbitrary atom of the core and requiring a support of 100% (i.e., both the core and the molecule must contain the substructure). In addition, we can restrict the search to one embedding of a substructure into the core at all times, since we know that it must be completely covered in the end. (For the molecule, however, we must consider all possible embeddings.) If, however, we accept to reach identical substructures in different branches of the search tree in cases like this, we can correct the imperfection of our structural pruning. Whenever we extended an embedding by following a bond, we allow adding an equivalent bond in the next step, regardless of whether it precedes or succeeds, in the corresponding molecule, the bond added in the preceding step. This relaxation explains why there are two embeddings of the substructure C-S-C into both the molecules b and c of our example. In one embedding the left carbon atom is numbered 1 and the one at the bottom is numbered 2, while in the other it is the other way round (cf. Figure 1). Note that the same mechanism of growing a substructure into two molecules can also be used for substructure tests as they are needed to suppress multiple reporting of the same fragment (see above) as well as reporting redundant fragments (fragments that are substructures of some other fragment and have the same support as this fragment). 9 O N C C O C O N C C O C S O N C C O C O O S C N C O Cl S C N O O S C N Figure 6: Some (fictitious) example molecules. Figure 4: The amino acids clycin, cystein and serin. S * S N S C N C S C C N C C O O S C O C O C C 12 S C C C S C C N S C C C N S C C C O 3 2.6 S C C C O O O C O C O O C C 7 S O S C N S C O C C O C O O O S C C C C 5 S C C C O S C C Figure 5: The tree of fragments for the amino acids example. O S C N O S C N O O S C N O S C N C Figure 7: Search tree for the molecules in Figure 6. lent sibling pruning and perfect extension pruning. The first of these optimizations can be applied even if the search is not restricted to closed fragments (see below), while the second presupposes that non-closed fragments can be discarded. Starting with an Empty Core Up to now we assumed that we are given a core structure to start the search from. However, in [7, 8] an extension to empty cores was suggested. We explain this approach with a simple example. Figure 4 shows the amino acids clycin, cystein and serin. The upper part of the tree (or forest if the empty fragment at the root is removed) that is traversed by our algorithm for these molecules is shown in Figure 5. The first level contains individual atoms, the second connected pairs and so on. The dots indicate subtrees that are not depicted in order to keep the figure understandable. The numbers next to these dots state the number of fragments in these subtrees, indicating the total size of the tree. 3.1 Closed Molecular Fragments As already mentioned above, the notion of a closed fragment is derived from the corresponding notion of a closed item set, which is defined as an item set no superset of which has the same support, i.e., is contained in the same number of transactions. Analogously, a closed fragment is a fragment no superstructure of which has the same support, i.e., is contained in the same number of molecules. As an example consider the three example molecules shown in Figure 6 and the corresponding (unpruned) MoFa search tree (starting from sulfur as a seed) shown in Figure 7: the closed fragments (for a minimum support of two molecules, i.e., 66%) in inner nodes are circled. (It should be clear that every leaf of the search tree is necessarily a closed fragment). The order, in which the atoms on the first level of the tree are processed, is determined by their frequency of occurrence in the molecules. The least frequent atom type is considered first. Therefore the algorithm starts on the left by embedding a sulfur atom into the example molecules. That is, the molecules are searched for sulfur atoms and their locations are recorded. In our example there is only one sulfur atom in cystein, which leads to one embedding of this (one atom) fragment. This fragment is then extended in the way described above. Next the subtree rooted at the nitrogen atom is processed. However, in this subtree extensions by a bond to a sulfur atom are ruled out, since all fragments containing a sulfur atom have already been considered in the tree rooted at the sulfur atom. Similarly, neither sulfur nor nitrogen are considered in the tree rooted at the oxygen atom, and the rightmost tree contains fragments that consist of carbon atoms only. The program offers the option to skip this last tree, because for biochemical molecules it can be very large and thus may govern the search time. As for item sets, restricting the search for molecular fragments to closed fragments does not lose any information: all frequent fragments can be constructed from the closed ones by simply forming all substructures of closed fragments that are not closed fragments themselves and assigning to them as their support the maximum of the support values of those closed fragments of which they are substructures. Consequently, closed fragment mining is a very convenient way to reduce the size of the output. In addition, chemists are usually not interested in all frequent fragments, but only in the closed ones, presumably because they contain all relevant information without redundancy. Here, however, we are interested in the fact that in closed fragment mining certain pruning techniques become applicable, which can speed up the search considerably. These and other advantages of closed fragment mining were also studied in [20, 14, 4]. 3. ADVANCED PRUNING In the preceding section we considered basic search tree pruning, which is meant to ensure that each substructure is considered as few times as possible. In this section we discuss further optimizations, suggested in [14, 4], which consist in two advanced pruning strategies, namely equiva- 3.2 Equivalent Sibling Pruning In order to suppress all redundant search, one would have to check whether any two subtrees of the search tree have the 10 C C O C C C C O C C O C C C C C C O C C C C C bedding into an arbitrary molecule is selected. In the corresponding molecule all bonds and atoms belonging to this embedding are marked and all other bonds and atoms are unmarked. Then all embeddings of the fragment corresponding to a sibling node, which is to be checked for equivalence, are traversed. For each of these embeddings it is checked whether it refers only to marked atoms and bonds, that is, whether it essentially represents the same substructure. If such an embedding can be found, the two fragments are equivalent and consequently the current node (the extensions of which are more severely restricted than those of its already processed sibling) can be skipped. Figure 8: Three phenols: phenol, p-cresol, and catechol. 5 4 C C O 0C C3 C C 1 2 0 5 C C O 1C C4 C C 2 3 1 0 C C O 2C C5 C C 3 4 2 3 C C O 1C C4 C C 0 5 Note that it suffices to check one molecule, because if there are identical embeddings into one molecule there must be identical embeddings into all molecules referred to by the fragment. As a consequence the check is comparatively cheap and thus does not degrade performance much even if the pruning cannot be carried out. Figure 9: Equivalent extended embeddings. same fragment as their root (or fragments which only differ in the restrictions placed on their possible extensions), so that only one (namely the least restricted one) is actually searched. Such a general check, however, is costly, because one would have to store all fragments that have been considered in the search so far. In addition, one needs an efficient way of checking whether these stored fragments contain one that is equivalent to the one currently considered. Finally, it is difficult to find out whether a subtree to be searched is the least restricted one appearing in the whole search tree, even though the depth first traversal applied in our algorithm always considers the least restricted sibling node first. 3.3 Perfect Extension Pruning Perfect extension pruning is based on the observation that sometimes there is a fairly large common fragment in all currently considered molecules. As long as the search has not grown a fragment to this maximal common one, it is not necessary to branch in the search tree. From the definition of a closed fragment it is clear that if there is a common substructure in all currently considered molecules of which the current fragment is only a part, then any extension that does not grow the current fragment towards the maximal common one can be postponed until this maximal common fragment has been reached. The reason is, obviously, that the maximal common fragment is part of all closed fragments that can be found in the currently considered set of molecules. Consequently, it suffices to follow one path in the search tree that leads to this maximal common fragment and to start branching only from there. However, what can be checked fairly easily is whether two sibling nodes in the search tree correspond to fragments that are equivalent (represent the same substructure) and differ only in the restrictions placed on their possible extensions. That such a situation can actually occur can be seen from the example molecules shown in Figure 8. Suppose we start the search by embedding a benzene ring into these molecules. Since such a ring exhibits a high symmetry (which is a necessary prerequisite for equivalent siblings), it can be embedded into each of the molecules in twelve different ways (the ring can be rotated to six positions and it can be traversed in two directions). Consider now the extensions of this benzene ring: each of the twelve embeddings is extended by a bond and an oxygen atom. This leads to twelve new fragments, all of which are equivalent and differ only in the label of the ring atom that was extended (see Figure 9 for examples). Obviously it suffices in this situation to consider the fragment in which the ring atom with the smallest number was extended. All other fragments must lead to redundant search, because they allow for a subset of the possible extensions, but no additional ones. As an example consider again the set of molecules shown in Figure 6. If the search is seeded with a single sulfur atom, considering extensions by a single bond starting at the sulfur atom and leading to an oxygen atom can be postponed until the structure S-C-N common to all molecules has been grown (provided that the extensions of this maximal common fragment are not restricted in any way—see below). Since our algorithm considers sibling nodes w.r.t. a local order that is based on the extension information of the parent fragment (i.e. bond and atom type, number of the atom the bond is incident to) and processes the least restricted extensions first, it is fairly easy to implement the described pruning approach: for each node its preceding sibling nodes are checked for equivalence and if an equivalent sibling is found, the current node is skipped. Technically, the search tree pruning is based on the notion of a perfect extension. An extension of a fragment, consisting of a bond and possibly an atom (if the bond does not close a ring), is called perfect if all of its embeddings can be extended in exactly the same way by this bond and atom and if it is a bridge3 in all embeddings into the molecules or if it closes a ring. (Note that there may be multiple ways of extending an embedding by this bond and atom. In this case all embeddings must be extendable in the same number of ways and in all the bond must be a bridge or must close a ring. The additional condition that the bond has to be a bridge or must close a ring in all embeddings was unfortunately not recognized in [4].) If there is a perfect ex- The equivalence test is carried out as follows: for the fragment corresponding to the current node an arbitrary em- 3 A bridge in a graph is an edge that, if removed, splits the graph into two unconnected parts. 11 N O C S C S S C S O O O C S C N O C S C C S C N 1+1 embs. O S C S C N S C O O S C N O S C N O O S C N O S C N C 2+2 embs. O C S C 1+3 embs. Figure 11: Examples of non-perfect extensions. N O Figure 10: Search with perfect extension pruning. N O Figure 12: Perfect extensions may fail in rings. test whether the bond either closes a ring or is a bridge in all embeddings. (Bridges are, of course, found and marked in a preprocessing step, so that the test only has to check this flag for the bonds referred to by the embeddings.) tension of a fragment, all closed fragments can, in principle, be found by searching only the corresponding branch. Note that it is actually necessary to count the embeddings per molecule. Checking only whether the total number of embeddings into all molecules coincides with (or is an integer multiple of) the parent embeddings does not suffice as can be seen from the example shown in Figure 11. Even though the total number of embeddings is the same in the right branch, the extension is not perfect. The left extension is not perfect, because the number of extended embeddings, even though the same for each parent embedding, is reduced from the number of extensions of its parents. This indicates that some symmetry has been destroyed by the extension, which therefore cannot be perfect. There is, however, a minor complication, because perfect extension pruning interferes with the normal structural pruning done in MoFa. Normal structural pruning prevents extensions of a fragment by bonds that start from atoms with smaller numbers than the one extended in the preceding step. However, a perfect extension should not lead to such a restriction, because otherwise some search results are lost. As an example consider again the search tree shown in Figure 7. If we simply confined the search to the subtree rooted at the fragment S-C-N, we would lose the fragment O-S-C-N in the left branch. The reason is that the extension of S-C to S-C-N, due to the normal structural pruning rules, prevents an extension of the sulfur atom in this subtree, because an atom with a higher number, namely the carbon atom, has been extended in the preceding step. Why it is necessary to constrain perfect extensions to bridges if the extension does not close a ring can be seen from the example in Figure 12. Suppose we want to find all common substructures of these two molecules (minimum support 100%). We seed the algorithm with a nitrogen atom. Then adding the oxygen atom is a perfect extension in the sense that it can be done in the same way in all molecules. The same holds for a subsequent extension of the oxygen atom with a carbon atom, or generally, to walk around the rings in counterclockwise order. This yields the fragment N O C C C C as the only result. However, C C N O C C and C C C N O are also common substructures. These we would lose if we allowed such extensions in rings to be considered as perfect. However, a perfect extension should not restrict possible future extensions of the fragment in this way. Therefore the extension information of a fragment obtained by a perfect extension are set to those of its parent fragment, bypassing the normal structural pruning rules. In the considered example, this allows us to extend the fragment S-C-N by bonds starting from the sulfur atom and thus we get the search tree shown in Figure 10, in which the fragment O-S-C-N is found. When it comes to implementing perfect extension pruning, one should bear in mind that checking whether an extension is perfect by testing the extensions of all embeddings is costly. Therefore we precede the actual test by cheap tests in order to, if possible, fail early or succeed early. The ideas underlying these tests are fairly simple: an extension leading to a fragment cannot be perfect (1) if the number of molecules referred to by the fragment differs from those referred to by its parent and (2) if the number of embeddings of the fragment is not an integer multiple of the number of embeddings of its parent. On the other hand, if these tests do not indicate that the extension is not perfect, we can immediately conclude that it is perfect if the fragment refers only to one molecule. The costly test whether each parent embedding leads to the same number of extended embeddings is, of course, carried out only if the number of molecules referred is not equal the number of embeddings (otherwise there is exactly one embedding per molecule). Only afterwards we 4. OTHER EXTENSIONS Apart from the advanced pruning strategies described above, there are several ways in which the basic algorithm can be extended, some of which are already incorporated in the Java implementation referred to here. Due to reasons of space these extensions are only briefly mentioned here. Details can be found in [2, 7, 8, 14, 15]. 4.1 Finding Discriminative Fragments Our approach to find frequent substructures can easily be extended to find discriminative fragments, that is, substructures that are frequent in a predefined subset of the molecules and infrequent in the complement of this subset. Finding discriminative fragments requires two parameters: a minimum support for the focus subset and a maximum support for the complement. The search is carried out in exactly the same way as described above. The only difference is that two 12 O F pruning neither equivalent sibling perfect extension both Figure 13: Example of a steroid, which can lead to problems due to the large number of rings. # nodes 48762 22423 4355 1577 # frags. 48762 23122 6581 2947 # embs. 78209 38475 16731 7786 Table 1: Effects of pruning on the steroids data set. Kekulé structures—use alternating single and double bonds. A chemist, however, considers all three representations as equivalent. With the basic algorithm this is not possible, because it would require treating single, double, and aromatic bonds alike, which obviously leads to undesired side effects. By treating rings as units, however, these special types of rings can be identified in the preprocessing step and then be transformed into a standard aromatic ring. Figure 14: Aromatic and Kekulé representations of benzene. support numbers are determined: one for the focus subset and one for the complement. Only the support in the focus subset is used to prune the search tree. The support in the complement determines whether a frequent substructure is recorded or not, thus filtering out substructures that do not satisfy the requirements for a discriminative structure. 4.2 4.3 Variable Length Chains Often the exact length of a carbon chain in a molecule is not important, as long as it is within a certain range. In this case, the fragment may contain only an indication of a chain, which match a certain number of chained carbon atoms in a molecule, where the number may differ for different molecules. This type of imprecise match can be handled in a similar way as the ring extension we studied in the previous section: Instead of bond by bond, a carbon chain is added in one step, allowing for different lengths. The implementation described here offers this possibility, but does not yet take a range for the acceptable lengths of the chain. Rings An unpleasant problem of the search algorithm as we presented it up to now is that the local ordering rules for suppressing redundant searches are not complete. Two extensions by identical bonds leading to identical atoms cannot be ordered based on locally available information alone (see above). As a consequence, a certain amount of redundant search has to be accepted in this case, because otherwise incorrect support values would be computed and it could not be guaranteed that all frequent substructures are found. 4.4 Unfortunately, situations leading to such redundant search occur almost always in molecules with rings, whenever the fragment extension process enters a ring. Consequently, molecules with a lot of rings, like the steroid shown in Figure 13—despite all clever search tree pruning—still lead to a prohibitively high search complexity. Wildcard Atoms In principle it is possible to extend the search algorithm to handle wildcard atoms, which may match any atom in a user-defined group of atoms. Details can be found in [7, 8]. This possibility is available in a sibling implementation of the algorithm described here. This implementation is, however, property of Tripos, Inc., and thus cannot be made available. The Java implementation described here currently lacks this capability. To cope with this problem, we added an (optional) preprocessing step to the algorithm, in which the rings of the molecules are found and marked. The implementation described here offers the possibility to specify a range for the ring size, so that this feature can be restricted to chemically meaningful rings, for example, of size 5 and 6. In the search process itself, whenever an extension adds a bond to a fragment that is part of a ring, not only this bond, but the whole ring is added (all bonds and atoms of the ring are added in one step). As a consequence, the depth of the search tree is reduced (the basic algorithm needs five or six levels to construct a ring bond by bond) and many of the redundant searches of the basic algorithm are avoided, which leads to enormous speed-ups. Details can be found in [7, 8]. 5. EXPERIMENTAL RESULTS In our experiments we restrict ourselves to demonstrating the effects of the two advanced pruning approaches described above. We carried out experiments on three data sets, all of which are publicly available. The first is a very small data set consisting of 17 steroid molecules.4 The effects of the two pruning methods are shown in Table 1 (all experiments were carried out with a minimum support of one molecule). As can be seen, both pruning methods have considerable effects, with those of perfect extension pruning being much more pronounced. However, this is presumable due to the very low minimum support, which makes perfect extension pruning highly effective. At higher support values equivalent sibling pruning seems to be more effective (see below). Besides a considerable reduction of the running time, treating rings as units when extending fragments has other advantages as well. In the first place, it makes a special treatment of atoms and bonds in rings possible, which is especially important for aromatic rings. For example, benzene can be represented in different ways as shown in Figure 14. On the left, all six bonds are represented as aromatic bonds, while the other two representations—called In a second test we ran the program on the well-known HIV data set available from the National Cancer Institute [10]. This library contains about 44,000 molecules tested for their activity against the HI-virus, which are grouped into three 4 13 See Section 7 below for a download URL. log(time/s) over support 300 150 200 100 100 50 0.5 1 1.5 2 2.5 log(time/s) over support 200 3 3 Figure 15: Execution times on NCI’s HIV data. 3.5 4 4.5 5 6 Figure 17: Execution times on the IC93 data. log(time/s) over support 600 5.5 log(time/s) over support 1000 400 500 200 0 0 0.5 1 1.5 2 2.5 3 3 3.5 4 4.5 5 5.5 6 Figure 16: Numbers of nodes on NCI’s HIV data. Figure 18: Numbers of nodes on the IC93 data. classes: about 400 belong to class CA (confirmed active), about 1000 to CM (confirmed medium active) and the rest belongs to CI (confirmed inactive). In the experiments we report here, however, we neglected these class assignments and tried to find common substructures of all molecules at minimum support thresholds ranging from 0.7 to 3%. In addition, we tested on the IC93 data set [11]. For all experiments we used a Pentium 4C 2.6GHz system with 1GB of main memory running S.u.S.E. Linux 9.3 and Java 1.5.0. number of nodes (see Figure 18), which indicates that part of the gains resulting from this pruning type are eaten up by the somewhat costly test for a perfect extension. 6. CONCLUSIONS In this paper we presented a Java implementation of an efficient graph substructure mining algorithm that finds closed frequent fragments of molecules. The algorithm is based on a depth first search in a tree of substructures, in which lists of embeddings into the set of molecules are processed and extended. The core ingredients of the approach, which make it efficient, are a structural pruning scheme, which tries to minimize the number of times a fragment is considered in the search, and two advanced pruning techniques, which speed up the search considerably. In addition, the implementation offers other extensions, like ring and chain mining, which make it very useful for biochemical applications. Results on the NIC’s HIV data are shown in Figures 15 and 16. These results were obtained with activated ring mining for rings of size 5 and 6. Results on the IC93 data are shown in Figures 17 and 18 and were obtained without ring mining. The former figure of each pair (i.e., 15 and 17) shows the execution time in seconds, the latter (i.e., 16 and 18) the number of search tree nodes (in thousands) over the minimum support in percent. Solid grey lines refer to the basic algorithm with neither of the two advanced pruning strategies. The dashed grey line refers to the results for equivalent sibling pruning only, the dotted grey line to the results with perfect extension pruning only. Finally, the black line shows the results obtained with both pruning methods activated. 7. PROGRAM The implementation of the molecular substructure mining algorithm described in this paper (executable Java archive as well as the source code, distributed under the LGPL) can be downloaded free of charge at As can be seen from Figures 15 and 16, equivalent sibling pruning yields the highest gains for the HIV data set, while perfect extension pruning yields only fairly small gains. On the IC93 data the picture is reversed: perfect extension pruning is more effective. However, both methods yield considerable gains, which add nicely when both pruning strategies are activated. It is interesting to note that the effects of perfect extension pruning are more pronounced for the http://fuzzy.cs.uni-magdeburg.de/˜borgelt/software.html Explanations of how to apply the program can be found at http://www.inf.uni-konstanz.de/bioml/projects/mofa/ It takes a simple set of 17 steroids (also used above) as an example, which can also be retrieved from the above URLs. 14 8. REFERENCES [13] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. Proc. 1st IEEE Int. Conf. on Data Mining (ICDM 2001, San Jose, CA), 313–320. IEEE Press, Piscataway, NJ, USA 2001 [1] R. Agrawal, T. Imielienski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. Proc. Conf. on Management of Data, 207–216. ACM Press, New York, NY, USA 1993 [14] T. Meinl, C. Borgelt, and M.R. Berthold. Discriminative Closed Fragment Mining and Perfect Extensions in MoFa. Proc. 2nd Starting AI Researchers’ Symposium (STAIRS 2004, Valencia, Spain), 3–14. IOS Press, Amsterdam, Netherlands 2004 [2] C. Borgelt and M.R. Berthold. Mining Molecular Fragments: Finding Relevant Substructures of Molecules. Proc. IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), 51–58. IEEE Press, Piscataway, NJ, USA 2002 [15] T. Meinl, C. Borgelt, and M.R. Berthold. Mining Fragments with Fuzzy Chains in Molecular Databases. Proc. 2nd Int. Workshop on Mining Graphs, Trees and Sequences (MGTS 2004, Pisa, Italy), 49–60. University of Pisa, Pisa, Italy 2004 [3] C. Borgelt. Efficient Implementations of Apriori and Eclat. Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL). CEUR Workshop Proceedings 90, Aachen, Germany 2003. http://www.ceur-ws.org/Vol-90/ [16] S. Nijssen and J.N. Kok. A Quickstart in Frequent Structure Mining can Make a Difference. Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD2004, Seattle, WA), 647–652. ACM Press, New York, NY, USA 2004 [4] C. Borgelt, T. Meinl, and M.R. Berthold. Advanced Pruning Strategies to Speed Up Mining Closed Molecular Fragments. Proc. IEEE Conf. on Systems, Man and Cybernetics (SMC 2004, The Hague, Netherlands), CD-ROM. IEEE Press, Piscataway, NJ, USA 2004 [17] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering Frequent Closed Itemsets for Association Rules. Proc. 7th Int. Conf. on Database Theory (ICDT’99, Jerusalem, Israel), LNCS 1540:398–416. Springer-Verlag, Heidelberg, Germany 1999 [5] D.J. Cook and L.B. Holder. Graph-Based Data Mining. IEEE Trans. on Intelligent Systems 15(2):32–41. IEEE Press, Piscataway, NJ, USA 2000 [18] T. Washio and H. Motoda. State of the Art of Graph-Based Data Mining. SIGKDD Explorations Newsletter 5(1):59–68. ACM Press, New York, NY, USA 2003 [6] P.W. Finn, S. Muggleton, D. Page, and A. Srinivasan. Pharmacore Discovery Using the Inductive Logic Programming System PROGOL. Machine Learning, 30(2-3):241–270. Kluwer, Amsterdam, Netherlands 1998 [19] X. Yan and J. Han. gSpan: Graph-Based Substructure Pattern Mining. Proc. 2nd IEEE Int. Conf. on Data Mining (ICDM 2003, Maebashi, Japan), 721–724. IEEE Press, Piscataway, NJ, USA 2002 [7] H. Hofer, C. Borgelt, and M.R. Berthold. Large Scale Mining of Molecular Fragments with Wildcards. Proc. 5th Int. Symposium on Intelligent Data Analysis (IDA 2003, Berlin, Germany), LNCS 2810:380–389. Springer-Verlag, Heidelberg, Germany 2003 [20] X. Yan and J. Han. Closegraph: Mining Closed Frequent Graph Patterns. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003, Washington, DC), 286–295. ACM Press, New York, NY, USA 2003 [8] H. Hofer, C. Borgelt, and M.R. Berthold. Large Scale Mining of Molecular Fragments with Wildcards. Intelligent Data Analysis 8:495–504. IOS Press, Amsterdam, Netherlands 2004 [21] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for Fast Discovery of Association Rules. Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97, Newport Beach, CA), 283–296. AAAI Press, Menlo Park, CA, USA 1997 [9] J. Huan, W. Wang, and J. Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM 2003, Melbourne, FL), 549–552. IEEE Press, Piscataway, NJ, USA 2003 [10] HIV Antiviral Screen. National Cancer Institute. http://dtp.nci.nih.gov/docs/aids/aids data.html [11] Index Chemicus — Subset from 1993. Institute of Scientific Information, Inc. (ISI). Thomson Scientific, Philadelphia, PA, USA 1993 [12] S. Kramer, L. de Raedt, and C. Helma. Molecular Feature Mining in HIV Data. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2001, San Francisco, CA), 136–143. ACM Press, New York, NY, USA 2001 15