Mining Tree-Based Association Rules From XML Documents: Mirjana Mazuran Elisa Quintarelli Letizia Tanca

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Mining tree-based association rules from XML documents

Mirjana Mazuran
Dipartimento di Elettronica e Informazione Politecnico di Milano Italy

Elisa Quintarelli
Dipartimento di Elettronica e Informazione Politecnico di Milano Italy

Letizia Tanca
Dipartimento di Elettronica e Informazione Politecnico di Milano Italy

[email protected] ABSTRACT

[email protected]

[email protected]

The increasing amount of XML datasets available to casual users increases the necessity of investigating techniques to extract knowledge from these data. Data mining is widely applied in the database research area in order to extract frequent correlations of values from both structured and semistructured datasets. In this work we describe an approach to mine Tree-based association rules from XML documents. Such rules provide information on both the structure and the content of XML documents; moreover, they can be stored in XML format to be queried later on. The mined knowledge is approximate, intensional knowledge used to provide: (i) quick, approximate answers to queries and (ii) information about structural regularities that can be used as dataguides for document querying. A prototype of the proposed system is also briey described.

and 2) they are very often confused by the large amount of information available. This limitation of XML is a crucial problem, which did not emerge in the past years in the context of traditional (relational) database management systems, and must be addressed in order to provide access to these data to a wider set of users. The application of data mining techniques to extract useful knowledge from XML datasets has received a lot of attention in the recent years due to the wide availability of these datasets. In particular, the process of mining association rules to provide summarized representations of XML documents has been investigated in many proposals and in particular either by using languages (e.g. XQuery) and techniques developed in the XML context, or by implementing graph/tree-based algorithms. By mining frequent patterns from XML documents, we provide the users with partial, and often approximate, information both on the document structure and on the content. Such patterns can be useful for the users to obtain information and implicit knowledge on the documents and thus be more eective in query formulation. Moreover, this information is also useful for the system, which is provided with discovered information, like hidden integrity constraints, which can be used for semantic optimization.

1.

INTRODUCTION

In the recent years the database research eld has concentrated on XML (eXtensible Markup Language [29]) as an expressive and exible hierarchical model suitable to represent huge amounts of data with no absolute and xed schema, and with a possibly irregular and incomplete structure. Despite its impressive growth in popularity, XML is still lacking appropriate techniques to retrieve datasets available to casual users; such datasets, on one hand, have a limited or absent structure, and on the other hand contain a huge amount of data. Together with intrinsically unstructured documents, there is a signicant portion of XML documents which have only an implicit structure, that is, their structure has not been declared in advance, for example via a DTD or an XMLSchema [26]. Querying such documents is quite dicult for users for two main reasons: 1) they are not able to specify a reasonably probable structure in the query conditions
This research is partially supported by the Italian MIUR project ARTDECO

Goal and contributions


The goal of this paper is to provide a method for mining intensional knowledge from XML datasets expressed by means of association rules. In particular, we propose Tree-based association rules as a means to represent such an intensional knowledge in native XML language. We introduce a proposal for mining, and also storing tree-based association rules for two main purposes: 1) to get a concise view of both the structure and the content of XML documents, and 2) to use them for intensional query answering. Our mining procedure is characterized by the following key aspects: 1) it works directly on the XML documents, without transforming the data into relational or any other intermediate format, 2) it looks for general association rules, without the need to impose what should be contained in the antecedent and consequent of the rule, and 3) it stores association rules in XML format.

Structure of the paper


The paper is organized as follows. Section 2 introduces other work related to XML association rule mining and usage. Section 3 explains what tree-based association rules are and two possible applications of such rules then Section 4 presents how tree-based rules are extracted from XML documents and how they are used to respond to intensional queries. Section 5 describes a prototype that implements our proposal whereas Section 6 explains the experimental results obtained by testing our prototype on real XML datasets. Section 7, at last, states the possible follow-ups to this work.

of extracting association rules from XML documents, i.e. to mine all frequent rules, without having any a-priori knowledge of the XML dataset. A similar idea was presented in [24] where the authors introduced HoPS, an algorithm for extracting association rules in a set of XML documents. Such rules are called XML association rules and are implications of the form X Y , where X and Y are fragments of an XML document. In particular the two trees X and Y have to be disjunct. The limitation of the proposal in [24] is that it does not contemplate the possibility to mine general association rules within a single XML dataset, while achieving this feature is one of our goals. The idea of using association rules as summarized representations of XML documents was also introduced in [5] where the XML summary is based on the extraction of association rules both on the structure (schema patterns) and on content values (instance patterns) from XML datasets. The limitation of such an approach is that the so-called schema patterns, used to describe general properties of the schema applying to all instances, are not mined, but derived as an abstraction of similar instance patterns. In our work, XML association rules are mined starting from frequent subtrees of the tree-based representation of a document. In the database literature it is possible to nd many proposals of algorithms to extract frequent structures from tree/graph-based data structures. Just to cite some of them, Tree Miner [36], Path Join [34], CloseGraph [35] propose algorithms to directly mine frequent itemsets -not association rules- from XML documents. Tree Miner and CloseGraph do not preserve the exact structure of the itemsets extracted - only the descendant-of (and not the child-of) relationship between nodes is preserved - whereas Path Join does. In this work we propose an algorithm that extends Path Join to mine generic tree-based association rules directly from XML documents.

2.

RELATED WORK

The problem of association rule mining was initially proposed in Agrawal [2] and successively many implementations of the algorithms, downloadable from [14], were developed and described in the database literature, Weka1 being a known framework. More recently the problem has been investigated also in the XML context [7, 30, 24]. In [30] the authors use XQuery [28] to extract association rules from simple XML documents. They propose a set of functions written only in XQuery which implement together the Apriori algorithm [2]. The same authors show in [30] that their approach performs well on simple XML documents; however it is very dicult to apply this proposal to complex XML documents with an irregular structure. This limitation has been overcome in [7], where the authors introduce a proposal to enrich XQuery with data mining and knowledge discovery capabilities, by introducing XMINE RULE, a specic operator for mining association rules for native XML documents. They formalize the syntax and an intuitive semantics for the operator and propose some examples of complex association rules. However, the operator proposed in [7] uses the MINE RULE operator, which works on relational data only. This means that, after a step of pruning of unnecessary information, the XML document is translated into the relational format. Moreover, both [7] and [30] force the designer to specify the structure of the rule to be extracted and then to mine it, if possible. This means that the designer has to specify what should be contained in the body and head of the rule, i.e. the designer has to know the structure of the XML document in advance, and this is an unreasonable requirement when the document has not an explicit DTD. Another limitation of these approaches is that the extracted rules have a xed root, thus once the root node of the rules to mine has been xed, only its descendants are analyzed. Let us consider the dataset in Figure 1 to explain this consideration. In order to infer the co-author relationship among authors of conferences it is necessary to x the root node of the rules in the article element, the body and head in author. In such way it is possible to learn that John Black and Mark Green frequently write papers together. However, it is not possible to mine itemsets stating that frequently, during 2008 conferences have been held in Milan. Indeed, to mine such property the body of the rules should be xed in the year element, which is not contained in the subtree of the article node, and the head in place. Our idea is to take a more general approach to the problem
1

3.

TREE-BASED ASSOCIATION RULES

Association rules [2] describe the co-occurrence of data items in a large amount of collected data and are usually represented as implications in the form X Y , where X and Y are two arbitrary sets of data items, such that X Y = . The quality of an association rule is usually measured by means of support and condence. Support corresponds to the frequency of the set X Y in the dataset, while condence corresponds to the conditional probability of nding Y , having found X and is given by sup(X Y )/sup(X ). In this work we extend the notion of association rule originally introduced in the context of relational databases, in order to adapt it to the hierarchical nature of XML documents. In particular, we consider the element-only Infoset content model [27], which allows an XML nonterminal tag to include only other elements and/or attributes, while the text is conned to terminal elements. Furthermore, without loss of generality, we do not consider some features of the Infoset that are not relevant to the present work, such as names-

http://www.cs.waikato.ac.nz/ml/weka/

conferences

name articles article author

conference place

year

name articles article

conference place

year

conference name place articles article "Milan" author year

"POLLU"

"2008" "Milan" author title

"TGUA" author

"2008" "Rome" author title

"ToDy" author

"2008" title

"John Black"

"Mark Green"

"On the problem of pollution"

"Rufus White"

"Mark Green"

"Tree growth in urban areas"

"John Black"

"Mark Green"

"A study of tourism dynamics"

Figure 1: XML sample le: conferences.xml paces, the ordering label, the referencing formalism through ID-IDREF attributes, URIs, and Links. Following the Infoset conventions, we represent an XML document by a labeled tree2 N, E, r where N is the set of nodes, r N is the root of the tree (i.e. the root of the XML document), E is the set of edges. Moreover, the following properties on nodes and edges hold: 1) Each node ni has a tuple of labels N Li = N tagi , N typei , N contenti ; the type label N typei indicates whether the node is the root, an element, text, or attribute 3 , whereas the label N contenti can assume as value a PCDATA or (undened, for nonterminals). 2) Each edge ej = (nh , nk ), ELj , with nh and nk in N , has a label ELj = Etypej , Etypej {attribute of, sub-element of}. Note that edges represent the containment relationship between dierent items of an XML document, thus edges do not have names. We are interested in nding relationships among subtrees of XML documents. Thus, we do not distinguish between textual content of leaf elements and value of attributes. As a consequence, in order to draw graphical concepts in a more readable way, we do not report the edge label and the node type label. Attributes and elements are characterized by empty circles, whereas the textual content of elements, or the value of attributes, is reported under the outgoing edge of the element or attribute it refers to.
<A> <B> <E></E> <F> x </F> </B> <C> <D></D> </C> <D> <F> <B> y </B> <C></C> </F> </D> </A>

C D F

D F

"y"

B C

E "x" "y"

(a)

(b)

(c)

Figure 2: a) an example of XML document, b) its tree-based representation, and c) three induced subtrees form Tr = SB , SH , sTr , cTr , where SB =< NB , EB , rB > and SH =< NH , EH , rH > are trees and sTr and cTr are real numbers representing the support and condence of the rule respectively. Furthermore, SB is an induced subtree of SH with an additional property on the node labels. Indeed, the following properties must hold: NB NH EB EH and n, m NB , (n, m), EL EB i (n, m), EL EH the set of tags of SB is equal to the set of tags of SH with the addition of the empty label . The empty label is introduced because the body of a rule may contain nodes with unspecied tags (a sort of placeholder nodes) and these tags will be declared in the head part of the rule (see the rule (4) of Figure 5). For the sake of clarity the label is omitted in the gures and all nodes with empty labels do not present any label at all. It is worth noticing that Tree-based association rules are dierent from XML association rules [24] because the rst require that (X Y ) (Y X ), i.e. the two trees X and Y have to be disjunct, while Tree-based association rules require that X is an induced subtree of Y . A TAR represents intensional knowledge in the form SB

3.1

Fundamental concepts

Given two labeled trees T =< NT , ET , rT > and S =< NS , ES , rS >, S is said to be an induced subtree of T [36] if and only if there exists a mapping : NS NT such that ni NS , N Li = N Lj , where (ni ) = nj and for each edge ej = (n1 , n2 ), ELj ES , ((n1 ), (n2 )), ELj ET . Figure 2 shows an example of an XML document (Figure 2(a)), its tree-based rappresentation (Figure 2(b)) and three induced subtrees of the document (Figure 2(c)). A Tree-based Association Rule (TAR) is a tuple of the Note that XML documents are here tree-like structures (and not generic graphs) because we do not include referencing. 3 XML documents may also contain ENTITY nodes (not unlike macro calls that must be expanded when parsing) or processing instructions or comments. We do not consider such elements in this paper.
2

SH , where SB is the body tree and SH the head tree of the rule. Indeed, the rule SB SH states that if the tree SB appears in an XML document D, it is likely that the bigger, or more specic, tree SH also appears. In our graphical representation, we will render the nodes of the body of a rule by black circles, and the nodes of the head by empty circles. Thus, every tree-based association rule is characterized by two measures: sTr support, measures the frequency of the tree SH in the XML document cTr condence, measures the reliability of a rule, that is the frequency of the tree SH , once SB has already been found Given function count (S, D) denoting the number of occurrences of a subtree S in the tree D and function cardinality (D) denoting the number of nodes of D, it is possible to dene formally the two measures as:

rule (1) (2) (3)

rule support 3/28 = 0.10 2/28 = 0.07 3/28 = 0.07

body support 3/28 = 0.10 3/28 = 0.10 3/28 = 0.10

rule condence 3/3 = 1.00 2/3 = 0.66 3/3 = 1.00

Table 1: Support and condence of rules in Fig 3


A

B C A B C E A B E C B D B C B C B

D A B B A C

E A D C B E

E B

Figure 4: Sample dataset It is possible to notice, from the examples, that tree-based association rules, in addition to correlation of PCDATA values, provide information about the structure of frequent portions of XML documents; thus they are more expressive than classical association rules which only provide us with frequent correlations of at values. Similarly, Figure 5 shows some examples of sTARs referred to the XML document in Figure 4. Such dataset is intentionally more complex and irregular than the dataset in Figure 3, in order to better show the potential of sTARs. In particular, rule (1) states that if there is a node labeled A in the document, probably that node has a child labeled B . Rule (2) states that if there is a node labeled B , probably its parent is labeled A. Rule (3) states that if a node C has a child labeled B , it probably also has a child labeled E . To conclude, Rule (4) states that if a node A is the grandparent of a node C (note, in the body of the rule, the empty label of the parent of node C ), probably the child of A and parent of C , is labeled B . Table 2 shows, for each mined rule, its support and condence.

support(SB SH ) =

count(SH , D) cardinality (D) count(SH , D) count(SB , D)

conf idence(SB SH ) =

Given an XML document it is possible to extract two types of tree-based association rules: iTARs: instance TARs are association rules providing information both on the structure and on the PCDATA values contained in a target XML document (see Figure 3). sTARs: structure TARs are association rules on the structure of the XML document. An sTAR is a tuple Ti = SB , SH , sTi , cTi where, for each node n either in SB or in SH , n has as label a triple TAG, TYPE, , i.e. no PCDATA is present in an sTAR (see Figure 5 for some examples). Figure 3 shows some examples of iTARs referred to the XML document in Figure 1. Rule (1) states that if there is a node labeled conference in the document, it probably has a child labeled year whose value is 2008. Rule (2) states that if there is a path composed by the following sequence of nodes: conference - articles - article - author, and the content of author is Mark Green, then node authors probably has another child labeled author whose content is John Black. Finally, rule (3), states that if there is a path composed by the following sequence of nodes: conference - articles article and the node article has two children labeled author whose contents are Mark Green and John Black, then node conference probably has two other children labeled year and place whose contents are respectively 2008 and Milan. Table 1 shows, for each one of these rules, its support and condence.

3.2

The use of TARs

Tree-based association rules provide an approximate view of both the content and the structure of an XML document. Such features can be used for: 1. providing intensional, although approximate, answers rule (1) (2) (3) (4) rule support 4/34 = 0.11 4/34 = 0.11 3/34 = 0.08 2/34 = 0.05 body support 7/34 = 0.2 12/34 = 0.35 5/34 = 0.14 6/34 = 0.17 rule condence 4/7 = 0.57 4/12 = 0.33 3/5 = 0.6 2/6 = 0.33

Table 2: Support and condence of rules in Fig 5

conference conference articles conference year article author "2008" "Mark Green" (1)

conference

conference

conference

articles article author "Mark Green" (2) author "John Black" author "Mark Green"

articles article author "John Black"

year "2008" author "Mark Green" (3)

articles article

place "Milan"

author "John Black"

Figure 3: Sample iTARs (instance Tree-based Association Rules)


A A A B A C C B B B B B E C (1) (2) (3) (4) C A

Figure 5: Sample sTARs (structure Tree-based Association Rules) to user queries; 2. providing an approximate DataGuide [15] for the document.

<result> { for $t in document("conference.xml")//title order by $t ascending return $t } </result> Class 3: this kind of query is used to count the number of elements with a specic content. To answer this queries we use an association rule whose body matches the query conditions, and obtain as answer sup (where sup and conf are support and condence conf of the rule. More specically, to count the elements satisfying a condition on their content, we use the (AB ) equality conf(A B ) = sup , where A B sup(A) is an association rule. For example, in the conferences.xml document, we can count the dierent articles written by Mark Green by using an association rule A B that has the path article - author (with content Mark Green) in the body, thus the number sup (AB ) of articles is given by sup(A) = . conf (AB ) Class 4: this kind of query is used to select the best k answers satisfying a counting and grouping condition, for example Retrieve the k authors who wrote the highest number of articles.

3.2.1

Intensional answers to queries

The classes of queries that can be managed with our approach have been introduced in [13] and further analyzed in the relational database context in [5]. We show some simple examples for four classes of queries, discussed in the following.

Class 1: this kind of query is used to impose a simple, or complex (containing AND and OR operators), restriction on the value of an attribute or the content of a leaf node. An example is Find the articles written by Mark Green. The query imposes a constraint on the content of the author element; the XQuery expression, to be applied on the original XML dataset, is: <result> { for $a in document("conferences.xml")//article where $a/author = Mark Green return $a } </result> Class 2: this kind of query is used to retrieve some properties described in the subtrees rooted in a specied element, possibly ordering the result. An example is Retrieve in an ascending order the titles of all articles. The XQuery expression, which does not specify any constraint on the content of elements, is:

3.2.2

Provide a DataGuide

The extracted sTARs can be used as a sort of approximate, but also partial, DataGuide [15] of the original document; these rules can help the user to gain knowledge about the structure of the most frequent subtrees and to formulate queries on these subtrees. As an example, the sTAR mined from the XML dataset in Figure 1 with support 0.03

conference conference year place articles articles name

article author title

Algorithm 2 Compute-Rules (s, minconf) 1: ruleSet = 2: for all cs subtrees of s do 3: conf = supp (s) / supp (cs ) 4: if conf minconf then 5: newRule =< cs , s, conf, supp(s) > 6: ruleSet = ruleSet {newRule} 7: end if 8: end for 9: return ruleSet

Figure 6: A sTAR mined from the XML dataset in Figure 1 proposed in [2]. Such optimization will be adapted to our XML context, for mining tree-based association rules. and condence 1, is shown in Figure 6: whenever there is a conference element with an articles sub-element (see the antecedent), then the structure of the XML document rooted in the conference node is the one represented by the tree in the consequent part of the rule (right-hand-side of the gure). In fact, given a frequent subtree S , all rules derived from that tree have the same support and dierent condences. Since the condence of a rule SB SH can be computed as support(SH )/support(SB ), the support of SB inuences the condence of rules having the same body tree; the higher the support of the body tree, the lower is the condence of the rule. Figure 7 shows a frequent subtree (Figure 7 (a)) and three possible tree-based rules mined from the tree; all three rules have the same support k and condence to be determined. Let the support of the body tree of rule (1) be s. Since the body trees of rules (2) and (3) are subtrees of the body tree of rule (1), their support is at least s, and possibly higher. This means that the condences of rule (2) and (3) are equal, or lower, than the condence of rule (1). Starting from this consideration, it is possible to state the following property, that allows us to optimize Algorithm 2:

4.

TAR EXTRACTION AND PROCESSING

In this section we show how Tree-based rules are extracted from an XML document. Moreover, we explain how they are stored and used to respond to the types of queries introduced in Section 3.2.1.

4.1

Association Rule Extraction

Mining tree-based association rules is a process composed of two steps: 1. mining frequent subtrees from the XML document; 2. computing interesting rules from the previously mined frequent subtrees. As already said, the problem of nding frequent subtrees has been widely discussed in the literature [35, 4, 32, 34, 36]. The problem of computing interesting rules from frequent subtrees can be compared with the problem of extracting classical association rules from large sets of elements, initially introduced in [2]. Algorithm 1 shows how tree-based association rules are mined. The inputs of the algorithm are the set of frequent subtrees, FS , and the minimal threshold for the condence of the rules, minconf . Algorithm 1 Get-Interesting-Rules (FS , minconf ) 1: ruleSet = 2: for all s FS do 3: tempSet = Compute-Rules(s, minconf) 4: ruleSet = ruleSet tempSet 5: end for 6: return ruleSet Depending on the amount of frequent subtrees and their cardinality, the amount of generated rules may be very high. The explosion of the number of mined association rules w.r.t. the number of frequent itemsets occurs also in the relational context; an optimization of the basic algorithm has been

Property If the condence of a rule T is below the established threshold c then the condence of every other rule Ti , such that its body SBTi is an induced subtree of the body SBT , is no greater than c. Algorithm 3 shows how tree-based rules are mined exploiting the introduced optimization. Note that it is advisable to rst generate the rules with the highest number of nodes in the body tree. For example, let us consider two rules Tr1 and Tr2 whose body trees contain 1 and 3 nodes respectively, as shown in Figure 8(a). Suppose both rules have condence below the xed threshold. If the algorithm considers rule Tr2 rst, all rules that have the bodies shown in Figure 8(b) will be discarded when Tr2 is eliminated. On the other hand, since the body tree of Tr1 has only one node and therefore has no induced subtree dierent from itself, starting from Tr1 will not produce any optimization. Therefore, it is more convenient to rst generate rule Tr2 and in general, to start the mining process from the rules with a larger body. Once the mining process has been performed and frequent tree-based rules have been extracted, we store them in XML format. This decision has been taken to allow the use of the same language (e.g. XQuery) for both the original dataset, and the mined rules.

A B D C E B

A C B

A A C B D D E C D B

A A C B E D B

A C

(a)

(1)

(2)

(3)

Figure 7: Rules optimization example

X X X Y X Y Z Y Z Z Y

<!ELEMENT instanceRules (Rule+) > <!ELEMENT Rule (Node) > <!ELEMENT Node (Value, Node*) > <!ELEMENT Value (#PCDATA) > <!ATTLIST Rule ID #PCDATA #REQUIRED > <!ATTLIST Rule support #PCDATA #REQUIRED > <!ATTLIST Rule condence #PCDATA #REQUIRED > <!ATTLIST Node label #REQUIRED > <!ATTLIST Node role (head) > <!ATTLIST Node isGeneric (true | false) > (a)

(a)

(b)

Figure 8: Rules optimization example Algorithm 3 Compute-Rules (s, minconf) 1: ruleSet = 2: blackList = 3: for all cs , subtrees of s do 4: if cs is not an induced subtree of any element in blackList then 5: conf = supp (s) / supp (cs ) 6: if conf minconf then 7: newRule =< cs , s, conf, supp(s) > 8: ruleSet = ruleSet {newRule} 9: else 10: blackList = blackList cs 11: end if 12: end if 13: end for 14: return ruleSet

<!ELEMENT structureRules (Rule+) > <!ELEMENT Rule (Node) > <!ELEMENT Node (Node*) > <!ATTLIST Rule ID #PCDATA #REQUIRED > <!ATTLIST Rule support #PCDATA #REQUIRED > <!ATTLIST Rule condence #PCDATA #REQUIRED > <!ATTLIST Node label #REQUIRED > <!ATTLIST Node role (body | head) > (b)

Figure 9: DTD of the XML document containing tree-based rules answering. We consider indices on both structure TARs and instance TARs. Given a set of rules, the index associates with every distinct node n in this set, the references to those rules which contain n. Therefore, an index is a set of tuples < n, rif > where n is a node and rif is a list of references to the rules. Algorithm 4 shows how the data structure for the index, which is then translated in XML format, is constructed, given the set of rules R. Algorithm 4 Create-Index (R) 1: Index 2: for all rule ri R do 3: for all node nj ri do 4: if < n, rif > Index: n = nj then 5: Index = Index {< nj , >} 6: end if 7: < nj , rif >=< nj , rif {ri } > 8: end for 9: end for 10: return Index Algorithm 4 is used to create both the index on sTARs and the index on iTARs. The dierence between the two indices

Figure 9 shows the two DTDs representing the structure of iTARs (Figure 9(a)) and sTARs (Figure 9(b)). In order to save space, only the head tree of a rule is stored. We exploit the fact that the body of the rule is a subtree of the head and use two attributes in the Node element to identify 1) the nodes that are also part of the body tree (the role attribute), and 2) those that have an empty label in the body (the isGeneric attribute).

4.2

Intensional query processing

iTARs can be used to provide intensional information about the actual data contained in the mined XML documents, when these documents are no available or reachable any more, or when the user prefers to obtain a faster (possibly partial) answer. Of course, this choice is worthwhile only if intensional queries processing is faster than extensional query processing. Therefore we introduce indices on tree-based association rules in order to improve the access to mined trees and in general the process of intensional query

<!ELEMENT instanceIndex (Node+) > <!ELEMENT Node (Value, Reference+) > <!ELEMENT Value (#PCDATA) > <!ATTLIST Node label #PCDATA #REQUIRED > <!ATTLIST Reference RuleID #PCDATA #REQUIRED > (a) <!ELEMENT structureIndex (Node+) > <!ELEMENT Node (Value, Reference+) > <!ATTLIST Node label #PCDATA #REQUIRED > <!ATTLIST Reference RuleID #PCDATA #REQUIRED > (b)

define function setinclusion(element $elem, element $set) as xs:boolean { for $s in $set where $s = $elem return true; }

union: the following XQuery function returns the union of two sets of elements: define function union (element $rif1, element $rif2) returns element { let $ins := ( for $r2 in $rif2 where every $r1 in $rif1 satisfies $r1 != $r2 return $r2 ) return $rif1 union $ins }

Figure 10: DTD of the XML document containing indices is that while for sTARs a node is identied only by its label, for iTARs a node is identied by the pair < label, value >, where value is the content of the node labeled label. Indices, like rules, are stored in XML format. Figure 10 shows the two DTDs for the index on instance rules (Figure 10(a)) and the index on structure rules (Figure 10(b)). Given a query q and the intensional data represented by the le containing instance TARs and the le containing the index on such rules, it is possible to obtain the intensional answer in two steps: 1. query the index le, with a rewriting of the query q , looking for the references to the rules which satisfy the conditions imposed by the query q ; 2. query the TARs le in order to return the rules whose reference was found in the previous step; In this section we show the process of obtaining an intensional answer when the user composes an XQuery expression. In order to perform the two steps described above we introduce the XQuery denition of the following functions: references: given a pair (name, content), describing a node of an XML tree, the function returns a set of references of the rules containing the node. The XQuery implementation of the function is: define function references (element $name, element $content) returns element { for $node in document("index.xml")//Node where $node/@label = $name and $node/Value = $content return $node/Reference }

intersection: the following XQuery function returns the intersection of two sets of elements: define function intersection (element $rif1, element $rif2) returns element { for $r1 in $rif1 where setinclusion($r1, $rif2) return $r1 }

ruleset: given a set of references, the function returns the set of references of the rules associated with the input references. The XQuery implementation of the function is: define function ruleset(element $rif) returns element { for $r in document("rules.xml")//Rule where setinclusion ($r/@ID, $rif//RuleID) return $r }

These functions allow us to easily obtain the intensional answer to the classes of XQuery expressions we have analyzed in Section 3.2. Indeed the intensional answer for an XQuery e is obtained by applying such functions to the XML document of previously mined tree-based rules. As an example, the intensional answer to the considered Class 1 query Find the articles written by Mark Green of Section 3.2, is obtained by the following function call: query (author, Mark Green) where the function query is dened as follows:

setinclusion: given a set of elements and a specic element e, the function returns true when the element e is in the set, false otherwise. The XQuery implementation of the function is:

intensional knowledge

XML le extensional answer

extraction Index TARs

translation Query

Intensional Query

intensional answer

User

Figure 11: TreeRuler architecture

define function query (element $name, element $content) returns element { let $Rif = references($name, $content) let $Rule = rules ($Rif) return $Rule } Figure 12: TreeRuler get the Gist tab This expression retrieves useful intensional information by performing the two steps 1 and 2 described above: rst, the variable $Rif is initialized with the set of index references of the rules containing the element author with content Mark Green. Then, the variable $Rule is initialized with the set of rules whose reference is in $Rif. The answer is an XML document composed by the tree-based rules satisfying the constraints indicated in the original query. get the Idea allows the visualization of the intensional information as well as the original document, in order to give the user the possibility to compare the two kinds of information. get the Answers (Figure 13) allows to query the intensional knowledge and the original XML document. The user has to write an extensional query in the box on the left; when the query belongs to the classes we have analyzed it is translated into the intensional form, shown to the user in the right part of the form. Finally, once the query is executed, the Tree-based rules that reect the search criteria are shown in the box at the bottom of the form.

5.

THE TREERULER PROTOTYPE

TreeRuler is a prototype tool that integrates all the functionalities proposed in our approach. Given an XML document, the tool is able to extract intensional knowledge, and allows the user to compose traditional queries as well as queries over the intensional knowledge. Figure 11 shows the architecture of the tool. In particular, given an XML document, it is possible to extract Tree-based rules and the corresponding index le. The user formulates XQuery expressions on the data, and these queries are automatically translated in order to be executed on the intensional knowledge. The answer is given in terms of the set of Tree-based rules which reect the search criteria. A schreenshot of the tool is shown in Figure 12: it is composed by several tabs for performing dierent tasks. In particular, there are three tabs:

5.1

Technical aspects

TreeRuler is implemented in C++ with the use of the eXpat4 library for XML parsing, and of the wxWidgets5 tools for the GUI. At present, TreeRuler implements the PathJoin [34] algorithm for the extraction of frequent subtrees from the XML document.

6.

EXPERIMENTAL RESULTS

In this section we introduce three types of experiments we performed with TreeRuler to evaluate the proposed underlying approach: experiments performed in order to monitor the time required for the extraction of the intensional knowledge from an XML database;
4 5

get the Gist (Figure 12) allows intensional information extraction from an XML document, given the desired support, condence and the les where the extracted tree-based rules and their index must be stored.

http://expat.sourceforge.net http://www.wxwidgets.org/

0,2 0,15 0,1 0,05


0 500 100 0 150 0 200 0 250 0 300 0

Figure 14: Extraction time growth (y = seconds), w.r.t. the number of nodes (x ) in the XML tree

6.2

Answer time

We have performed some experiments to monitor the time needed to obtain both intensional and extensional answers to user queries. Figure 15 shows, for each XML document we considered, the time TreeRuler took to give an intensional and extensional answer to a simple selection query. In particular we performed the following extensional query on the analyzed XML documents: define function ExtQuery (element document, element node-name) returns element { for $x in doc//node-name order by $x ascending return $x }

Figure 13: TreeRuler get the Answers tab

experiments performed in order to monitor the time needed to answer intensional and extensional queries over an XML le; a use case scenario on the DocBook6 XML database, in order to monitor extraction times given a specic support or condence.

6.1

Extraction time

We have performed some experiments on real XML datasets. First we tried to execute TreeRuler on the datasets found at the XMLData Respository7 but all the documents were too structured and the extracted intensional knowledge was very poor. Moreover the DTD for all document was already provided. Then we used GCC_XML8 in order to create XML documents starting from C++ source code. The documents produced by GCC_XML were unstructured and without a DTD. In particular, we used these XML datasets to monitor the time TreeRuler takes to extract intensional knowledge, i.e. tree-based association rules and the corresponding index, from the XML le. Figure 14 shows how extraction time depends on the number of nodes in the XML document. It is possible to notice that extraction time growth is almost linear with respect to the cardinality of an XML tree. Moreover, the compression factor provided by the XML representation of TARs, compared to the size of the original XML dataset, is signicant (e.g. 264 KB w.r.t. 4 KB). http://www.docbook.org/ http://www.cs.washington.edu/research/ xmldatasets/ 8 http://www.gccxml.org/HTML/Index.html
7 6

where document is the name of the considered XML le, and node-name is the name of a node contained in the document (do note that it changes on the basis of the XML document). Such query was translated by the prototype into the following intensional form and was executed on the knowledge we mined from each XML document: define function IntQuery (element node-name) returns element { for $x in document("rules.xml")//Node [@name=node-name] order by $x ascending return $x }

It is possible to notice that intensional times are always signicantly smaller than extensional ones (actually almost constant), thus proving the eectiveness of our approach. Recall and precision are commonly used to evaluate the accuracy of approaches which return approximate answers. Recall measures the fraction of data in the extensional query result which is actually returned by querying TARs. Precision measures the fraction of the returned data which correctly matches the query. TARs are characterized by 100% precision for all query classes described in Section 3.2.1,

0,16 0,12 0,08 0,04


0 500 100 0 150 0 200 0 250 0 300 0

0,31 0,305 0,3 0,295


0 0 ,2 5 0 ,5 0 ,7 5 1

Figure 15: Extensional (curve with squares) and intensional (curve with diamonds) time answering (y = seconds) w.r.t. each document (x = number of nodes)

Figure 17: Extraction time growth (y = seconds), w.r.t. the condence (x ), given support = 0.02

0,5 0,4 0,3 0,2 0,1


0 0,01 0,02 0,03 0,04 0,05 0,06

We have developed a C++ prototype that has been used to test the eectiveness of our proposal. As an ongoing work, we are studying how to further optimize our mining algorithm; moreover, we would like to nd the exact fragment of XQuery expressions we can manage with intensional knowledge.

Figure 16: Extraction time growth (y = seconds), w.r.t. the support (x ), given condence = 0.95

while recall depends on the support threshold established before applying the mining algorithm. Indeed, the application of our mining algorithm returns only frequent subtrees and the number of such trees depends on the support threshold. Thus, the minimum support threshold strongly inuences query recall, it is a relevant parameter for the intensional representation design.

6.3

Use case scenario

In this section we report the results we have obtained by applying TreeRuler on the DocBook XML database. In particular we show the relationships between support, condence, and the extraction time of tree-based rules. By setting the condence at 0.95, Figure 16 shows how the extraction time changes with respect to the support. Similarly, by setting the support at 0.02, Figure 17 shows how the extraction time changes with respect to the condence.

7.

CONCLUSIONS AND FUTURE WORK

In this work we have proposed an algorithm that extends PathJoin [34] and allows us to extract frequent tree-based association rules from XML documents. The main goals we have achieved are: 1) we mine frequent association rules without imposing any a-priori restriction on the structure and the content of the rules; 2) we store mined information in XML format; as a consequence, 3) we can eectively use the extracted knowledge to gain information, by using query languages for XML, about the original datasets where the mining algorithm has been applied.

[1] N. Zhang 0002, M. T. Ozsu, I. F. Ilyas, and A. Aboulnaga. Fix: Feature-based indexing technique for xml documents. In VLDB 06: Proceedings of the 20th International Conference on Very Large Data Bases, pages 259270, 2006. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB 94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 487499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [3] S. Amer-Yahia, S. Cho, L. V. S. Lakshmanan, and D. Srivastava. Minimization of tree pattern queries. In SIGMOD Conference, pages 497508, 2001. [4] T. Asai, H. Arimura, T. Uno, and S. Nakano. Discovering frequent substructures in large unordered trees, 2003. [5] E. Baralis, P. Garza, E. Quintarelli, and L. Tanca. Answering xml queries by means of data summaries. ACM Transactions of Information Systems, 25(3):10, 2007. [6] E. Bertino, B. Catania, and W. Q. Wang. Xjoin index: Indexing xml data for ecient handling of branching path expressions. In International Conference on Data Engineering, page 828, 2004. [7] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi. Discovering interesting information in xml data with association rules. In SAC 03: Proceedings of the 2003 ACM symposium on Applied computing, pages 450454, New York, NY, USA, 2003. ACM Press. [8] B. Bringmann and S. Nijssen. What is frequent in a single graph? In PAKDD, pages 858863, 2008. [9] M. Cannataro, C. Comito, and A. Pugliese. Squeezex: Synthesis and compression of xml data. In International Symposium on Information Technology, pages 326331, 2002. [10] C.W. Chung, J.K. Min, and K. Shim. Apex: an adaptive path index for xml data. In ACM SIGMOD Conference, pages 121132, 2002.

8.

REFERENCES

[11] C. Combi, B. Oliboni, and R. Rossato. Querying xml documents by using association rules. In 16th International Workshop on Database and Expert Systems Applications (DEXA05), pages 10201024, 2005. [12] L. Feng, T. S. Dillon, H. Weigand, and E. Chang. An xml-enabled association rule framework. In 14th International Conference on Database and Expert Systems Applications (DEXA 03), pages 8897, 2003. [13] S. Gasparini and E. Quintarelli. Intensional query answering to xquery expressions. In 16th International Conference on Database and Expert Systems Applications (DEXA 05), pages 544553, 2005. [14] B. Goethals and M. J. Zaki. Advances in frequent itemset mining implementations: report on FIMI03. SIGKDD Explor. Newsl., 6(1):109117, 2004. [15] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In VLDB 97: Proceedings of the 23rd International Conference on Very Large Data Bases, pages 436445, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. [16] R. Goldman and J. Widom. Approximate DataGuides, 1999. [17] R. Goldman and J. Widom. Summarizing and searching sequential semistructured sources, 2000. [18] D. Katsaros, A. Nanopoulos, and Y. Manolopoulos. Fast mining of frequent tree structures by hashing and indexing. Information & Software Technology, 47(2):129140, 2005. [19] Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 361370, 2001. [20] H. C. Liu and J. Zeleznikow. Relational computation for mining association rules from xml data. In ACM 14th Conference on Information and Knowledge Management, pages 253254, 2005. [21] J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajamaran. Indexing semistructured data, 1998. [22] S. Nestorov, J. D. Ullman, J. L. Wiener, and S. S. Chawathe. Representative objects: Concise representations of semistructured, hierarchial data. In In Proceedings of the International Conference on Data Engineering (ICDE 97), pages 7990, 1997. [23] S. Nijssen and J.N. Kok. Ecient discovery of frequent unordered trees. In Proceedings of the rst International Workshop on Mining Graphs, Trees and Sequences (MGTS03), 2003. [24] J. Paik, H. Y. Youn, and U. M. Kim. A new method for mining association rules from a collection of xml documents. In Computational Science and Its U ICCSA 2005, pages 936945, 2005. Applications A [25] et al. T. Asai. Ecient substructure discovery from large semi-structured data. [26] World Wide Web Consortium. XML Schema, 2001. http://www.w3C.org/TR/xmlschema-1/. [27] World Wide Web Consortium. XML Information Set, 2001. http://www.w3C.org/xml-infoset/. [28] World Wide Web Consortium. XQuery 1.0: An XML query language, 2007.

http://www.w3C.org/TR/xquery. [29] World Wide Web Consortium. Extensible Markup Language (XML) 1.0, 1998. http://www.w3C.org/TR/REC-xml/. [30] J. W. W. Wan and G. Dobbie. Extracting association rules from xml documents using xquery. In WIDM 03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 9497, New York, NY, USA, 2003. ACM Press. [31] H. Wang, S. Park, W. Fan, and P. S. Yu. Vist: A dynamic index method for querying xml data by tree structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD 2003, pages 110121, 2003. [32] K. Wang and H. Liu. Discovering typical structures of documents: a road map approach. In SIGIR 98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 146154. ACM Press, 1998. [33] K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering, 12(3):353371, 2000. [34] Y. Xiao, J. F. Yao, Z. Li, and M. H. Dunham. Ecient data mining for maximal frequent subtrees. In ICDM 03: Proceedings of the Third IEEE International Conference on Data Mining, page 379, Washington, DC, USA, 2003. IEEE Computer Society. [35] X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In KDD 03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 286295. ACM Press, 2003. [36] M. J. Zaki. Eciently mining frequent trees in a forest: algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 17(8):10211035, 2005. [37] M. J. Zaki and Charu C. Aggarwal. Xrules: An eective algorithm for structural classication of xml data. Machine Learning, 62(1-2):137170, 2006. [38] S. Zhang, J. Zhang, H. Liu, and W. Wang. Xar-miner: ecient association rules mining for xml data. In WWW (Special interest tracks and posters), pages 894895, 2005.

You might also like